Anda di halaman 1dari 38

Data Mining:

Mengenal dan
memahami data
1

Mengenal dan memahami data

Objek data dan macam-macam atribut

Statistik diskriptif data

Visualisasi data

Mengukur kesamaan dan ketidaksamaan


data

Types of Data Sets

Record

Relational records

Data matrix, e.g., numerical matrix

Document data: text documents:


term-frequency vector

Transaction data

Graph and network

World Wide Web

Social or information networks

Ordered

Video data: sequence of images

Temporal data: time-series

Spatial, image and multimedia:

Spatial data: maps

Image data:

Video data:

TID

Items

1
2
3
4
5

Bread, Coke, Milk


Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk

Data Objects

Data object menyatakan suatu entitas

Contoh:

Database penjualan: customers, barang-barang


yang dijual, penjualan

Database medis: pasien, perawatan

Database universitas: mahasiswa, professor,


perkuliahan

Data objects dijelaskan dengan attribut-atribut.

Baris-baris Database -> data objects;

Kolom-kolom ->attribut-atribut.
4

Atribut

Attribut ( dimensi, fitur, variabel):


menyatakan karakteristik atau fitur dari data
objek
Misal., ID_pelanggan, nama, alama
Tipe-tipe:
Nominal
Ordina
Biner
Numerik:
Interval-scaled
Ratio-scaled
5

Attribute Types

Nominal: kategori, keadaan, atau nama suatu hal

Warna rambut

Status , kode pos, dll, NRP dll


Binary :Atribut Nominal dengan hanya 2 keadaan (0 dan 1)

Symmetric binary: keduanya sama penting

Misal: jenis kelamin,

Asymmetric binary: keduanya tidak sama penting.

Misal : medical test (positive atau negative)

Dinyatakan dengan 1 untuk menyatakan hal yang


lebih penting ( positif HIV)
Ordinal

Memiliki arti secara berurutan, (ranking) tetapi tidak


dinyatakan dengan besaran angka atau nilai.

Size = {small, medium, large}, kelas, pangkat

Atribut Numerik

Kuantitas (integer atau nilai real)


Interval

Diukur pada skala dengan unit satuan yang


sama

Nilai memiliki urutan

tanggal kalender

No true zero-point
Ratio

Inherent zero-point

Contoh:Panjang, berat badan, dll

Bisa mengatakan perkalian dari nilai objek data


yang lain

Misal : panjang jalan A adalah 2 kali dari


panjang jalan B
7

Atribut Discrete dan kontinu

Atribut Diskrit
Terhingga, dapat dihitung walaupun itu tak
terhingga
Kode pos, kata dalam sekumpulan dokumen
Kadang dinyatakan dengan variabel integer
Catatan Atribut Binary: kasus khusus atribut
diskrit
Atribut Kontinu
Memilki nilai real
E.g., temperature, tinggi, berat
Atribut kontinu dinyatakn dengan floating-point
variables
8

Basic Statistical Descriptions of


Data

Tujuan
Untuk memahami data: central tendency,
variasi dan sebaran
Karakteristik Sebaran data
median, max, min, quantiles, outliers, variance,
dll.

Mengukur Central Tendency

Note: n jumlah sample dan N nilai populasi.

Mean/rata-rata:
Trimmed mean:

Median:

x
1 n

x xi
N
n i 1

Mean (algebraic measure) (sample vs. population):

w x
i 1
n

Estimated by interpolation (for grouped data):

Mode

median L1 (

n / 2 ( freq ) l
freq median

w
i 1

) width

Value that occurs most frequently in the dataMedia


n

Unimodal, bimodal, trimodal

interv
al

Empirical formula:

mean mode 3 (mean median)


10

Symmetric vs.
Skewed Data

Median, mean and mode of


symmetric, positively and
negatively skewed data

positively skewed

March 24, 2016

symmetric

negatively
skewed

Data Mining: Concepts and


Techniques

11

Mengukur sebaran Data

Quartiles, outliers and boxplots

Quartiles: Q1 (25th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 Q1

Five number summary: min, Q1, median, Q3, max

Outlier: biasanya lebih tinggi atau lebih rendah dari 1.5 x IQR

Variansi dan standar deviasi (sample: s, population: )

Variance: (algebraic, scalable computation)


1 n
1
2
2
n
n
n
1
1
1

(
x

2
i
s2
( xi x ) 2
[ xi ( xi ) 2 ]
N i 1
N

n 1 i 1
n 1 i 1
n i 1

x
i 1

Standard deviation s (atau ) akar kuadrat daro variance s2


( atau 2 )

12

Boxplot Analysis

Lima nilai dari sebaran data

Minimum, Q1, Median, Q3, Maximum

Boxplot

Data dinyatakan dengan box

Ujung dari box kuartil pertama dan


ketiga, tinggi kotak adalah IQR

Median ditandai garis dalam box

Outliers: diplot sendiri diluar

13

Sifat-sifat kurva Distribusi Normal

Kurva norma
dari to +: berisi 68% pengukukuran (:
mean, : standar deviasi)

Dari 2 to +2: berisi 95% pengukuran


Dari 3 to +3: berisi 99.7% pengukuran

14

Histogram Analysis

Histogram: grafik
menampilkan tabulasi dari
frekwensi data

15

Histograms Often Tell More than


Boxplots

Dua histogram
menunjukkan
boxplot yang sama

Nilai yang sama:


min, Q1, median,
Q3, max

Tetapi distribusi
datanya berbeda

16

Scatter plot

Melihat data bivariate data untuk melihat cluster


dan outlier data, etc
Setiap data menunjukkan pasangan koordinat dari
suatu data

17

Positively and Negatively Correlated


Data

Kiri atas korelasi positif

Kanan atas korelasi negatif

18

Uncorrelated Data

19

Data Visualization

Why data visualization?


Gain insight into an information space by mapping data onto
graphical primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities,
relationships among data
Help find interesting regions and suitable parameters for
further quantitative analysis

20

Geometric Projection Visualization


Techniques

Visualization of geometric transformations and


projections of the data

Methods

Scatterplot and scatterplot matrices

21

Used by ermission of M. Ward, Worcester Polytechnic Institute

Scatterplot Matrices

Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]


22

Similarity and Dissimilarity

Similarity

Mengukur secara Numerik bagaimana kesamaan


dua objek data

Tinggi nilainya bila benda yang lebih mirip

Range [0,1]

Dissimilarity (e.g., distance/jarak)

Ukuran numerik dari perbedaan dua objek

Sangat rendah bila benda yang lebih mirip

Minimum dissimilarity i0

23

Data Matrix and Dissimilarity


Matrix

Data matrix
n titik data dengan
p dimensi
Two modes

x11

...

x
i1
...
x
n1

Dissimilarity matrix
n titik data yang
didata adalah
distance/jarak
Matrik segitiga
Single mode

...

x1f

...

x1p

...

...

...

...

xif

...

...
xip

...
...
... xnf

...
...

...
xnp

d(2,1)
0

d(3,1) d ( 3,2) 0

:
:
:

d ( n,1) d ( n,2) ...

... 0
24

Proximity Measure for Nominal


Attributes

Misal terdapat 2 atau lebih nilai, misal., red,


yellow, blue, green (generalisasi dari atribut
binary)

Metode Simple matching

p
m # variabel
m: # yangdsesuai,
total
(i, j) p:
p

25

Proximity Measure for Binary


Attributes
Object j

A contingency table for binary


data

Object i

Distance measure for symmetric


binary variables:

Distance measure for asymmetric


binary variables:

Jaccard coefficient (similarity


measure for asymmetric binary
variables):

Note: Jaccard coefficient is the same as coherence:

26

Dissimilarity between Binary


Variables

Example
Name
Jack
Mary
Jim

Gender
M
F
M

Fever
Y
Y
Y

Cough
N
N
P

Test-1
P
P
N

Test-2
N
N
N

Test-3
N
P
N

Test-4
N
N
N

Gender is a symmetric attribute


The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N 0
01
0.33
2 01
11
d ( jack , jim )
0.67
111
1 2
d ( jim , mary )
0.75
11 2
d ( jack , mary )

27

Standardizing Numeric Data

Z-score:

X: raw score to be standardized, : mean of the population,


: standard deviation

the distance between the raw score and the population


mean in units of the standard deviation

negative when the raw score is below the mean, + when


above

An alternative way: Calculate


the mean absolute deviation
1
where

s f n (| x1 f m f | | x2 f m f | ... | xnf m f |)
m f 1n (x1 f x2 f ... xnf )
x m
.

standardized measure (z-score):

zif

if

sf

Using mean absolute deviation is more robust than using


standard deviation
28

Example:
Data Matrix and Dissimilarity Matrix
Data Matrix

Dissimilarity Matrix
(with Euclidean Distance)

29

Distance on Numeric Data: Minkowski


Distance

Minkowski distance: A popular distance measure

where i = (xi1, xi2, , xip) and j = (xj1, xj2, , xjp) are


two p-dimensional data objects, and h is the order
(the distance so defined is also called L-h norm)

Properties

d(i, j) > 0 if i j, and d(i, i) = 0 (Positive


definiteness)

d(i, j) = d(j, i) (Symmetry)

d(i, j) d(i, k) + d(k, j) (Triangle Inequality)


30

Special Cases of Minkowski Distance

h = 1: Manhattan (city block, L1 norm) distance

E.g., the Hamming distance: the number of bits that


are different between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1
i2 j 2
ip
jp

h = 2: (L2 norm) Euclidean distance


d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1
i2 j 2
ip
jp
h . supremum (Lmax norm, L norm) distance.

This is the maximum difference between any


component (attribute) of the vectors

31

Example: Minkowski Distance


Dissimilarity Matrices

Manhattan
(L1)

Euclidean (L2)

Supremum

32

Ordinal Variables

An ordinal variable can be discrete or continuous

Order is important, e.g., rank

Can be treated like interval-scaled


rif {1,..., M f }
replace x
if by their rank

map the range of each variable onto [0, 1] by


replacing i-th object in the f-th variable by
rif 1
zif
M f 1

compute the dissimilarity using methods for


interval-scaled variables
33

Attributes of Mixed Type

A database may contain all attribute types


Nominal, symmetric binary, asymmetric binary,
numeric, ordinal
One may use a weighted formula to combine their
effects

pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )

f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal
Compute ranks r and
if
r
if 1
zif
Treat z as interval-scaled
if
M f 1
34

Cosine Similarity

A document can be represented by thousands of attributes, each


recording the frequency of a particular word (such as keywords) or
phrase in the document.

Other vector objects: gene features in micro-arrays,


Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,
where indicates vector dot product, ||d||: the length of vector d

35

Example: Cosine Similarity

cos(d1, d2) = (d1 d2) /||d1|| ||d2|| ,


where indicates vector dot product, ||d|: the length of vector d
Ex: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

36

KL Divergence: Comparing
Two Probability Distributions

The Kullback-Leibler (KL) divergence: Measure the difference


between two probability distributions over the same variable x

From information theory, closely related to relative entropy,


information divergence, and information for discrimination
DKL(p(x) || q(x)): divergence of q(x) from p(x), measuring the
information lost when q(x) is used to approximate p(x)

Discrete form:
The KL divergence measures the expected number of extra bits
required to code samples from p(x) (true distribution) when
using a code based on q(x), which represents a theory, model,
description, or approximation of p(x)
Its continuous form:
The KL divergence: not a distance measure, not a metric:
asymmetric, not satisfy triangular inequality
37

How to
Compute the KL
Divergence?
Base on the formula, D (P,Q) 0 and D

KL

(P || Q) = 0 if and only if P = Q.

KL

How about when p = 0 or q = 0?


lim
p0 p log p = 0
when p != 0 but q = 0, DKL(p || q) is defined as , i.e., if one event e is
possible (i.e., p(e) > 0), and the other predicts it is absolutely
impossible (i.e., q(e) = 0), then the two distributions are absolutely
different
However, in practice, P and Q are derived from frequency distributions,
not counting the possibility of unseen events. Thus smoothing is needed
Example: P : (a : 3/5, b : 1/5, c : 1/5). Q : (a : 5/9, b : 3/9, d : 1/9)
need to introduce a small constant , e.g., = 103
The sample set observed in P, SP = {a, b, c}, SQ = {a, b, d}, SU = {a,
b, c, d}
Smoothing, add missing symbols to each distribution, with probability
P : (a : 3/5 /3, b : 1/5 /3, c : 1/5 /3, d : )
Q : (a : 5/9 /3, b : 3/9 /3, c : , d : 1/9 /3).
D (P || Q) can be computed easily
KL

38

Anda mungkin juga menyukai