06c Nearest Neighbor

Classification
Nearest
ea es Neighbor
eg o
Jeff Howbert Introduction to Machine Learning Winter 2012 1

Instance based classifiers
Set of Stored Cases Store the training samples
Atr1 ... AtrN Class Use

U ttraining
i i samples l to
t
A predict the class label of
unseen samples
B
B
Unseen Case
C
Atr1 ... AtrN
A
C
B

Instance based classifiers
z Examples:
Rote learner
memorize entire training data
perform classification only if attributes of test
sample match one of the training samples exactly
Nearest neighbor
g
use k closest samples (nearest neighbors) to
perform classification

Nearest neighbor classifiers
z Basic idea:
If it walks like a duck
duck, quacks like a duck
duck, then
its probably a duck
compute
distance test
sample
training choose k of the

samples nearest samples

Unknown record
Requires three inputs:
1. The set of stored
samples
2. Distance metric to
computet distance
di t
between samples
3. The value of k, the
number of nearest
neighbors to retrieve

Unknown record
To classify unknown record:
1. Compute
p distance to
other training records
2. Identify k nearest
neighbors
i hb
3. Use class labels of
nearest neighbors to
determine the class
label of unknown record
(e g by taking majority
(e.g.,
vote)

Definition of nearest neighbor
X X X
(a) 1-nearest
1 nearest neighbor (b) 2-nearest
2 nearest neighbor (c) 3
3-nearest
nearest neighbor
k-nearest neighbors of a sample x are datapoints

that have the k smallest distances to x

1-nearest neighbor
Voronoi diagram

Nearest neighbor classification
z Compute distance between two points:

Euclidean distance
d ( x, y ) = i i
( x
i
y ) 2
z Options for determining the class from nearest

neighbor list
Take majority vote of class labels among the
k-nearest neighbors
Weight the votes according to distance
example: weight factor w = 1 / d2

z Choosing the value of k:

If k is too small
small, sensitive to noise points
If k is too large, neighborhood may include points from
other classes

z Scaling issues
Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
Example:
height
g of a pperson mayy vary
y from 1.5 m to 1.8 m
weight of a person may vary from 90 lb to 300 lb
income of a person may vary from $10K to $1M

z Problem with Euclidean measure:

High dimensional data
curse of dimensionality
Can produce counter-intuitive results
111111111110 100000000000
vs
011111111111 000000000001
d = 1.4142 d = 1.4142
one solution: normalize the vectors to unit length

z k-Nearest neighbor classifier is a lazy learner

Does not build model explicitly.
explicitly
Unlike eager learners such as decision tree
induction and rule-based
rule based systems.
systems
Classifying unknown samples is relatively
expensive.
z k-Nearest neighbor classifier is a local model, vs.
global
g oba model
ode oof linear
ea cclassifiers.
ass e s

Example: PEBLS
z PEBLS: Parallel Examplar-Based Learning

System (Cost & Salzberg)
Works with both continuous and nominal
features
For nominal features, distance between two
nominal values is computed using modified value
diff
difference metric
t i (MVDM)
Each sample is assigned a weight factor
Number
N b off nearestt neighbor,
i hb k = 1

Example: PEBLS
Tid Refund Marital Taxable
Status Income Cheat Distance between nominal attribute values:
1 Yes Single 125K No d(Single,Married)
2 N
No M i d
Married 100K N
No = | 2/4 0/4 | + | 2/4 4/4 | = 1
3 No Single 70K No
d(Single,Divorced)
4 Yes Married 120K No
Yes
= | 2/4 1/2 | + | 2/4 1/2 | = 0
5 No Divorced 95K
6 No Married 60K No d(Married,Divorced)
7 Yes Divorced 220K No = | 0/4 1/2 | + | 4/4 1/2 | = 1
8 No Single 85K Yes d(Refund=Yes,Refund=No)
( , )
9 No Married 75K No
= | 0/3 3/7 | + | 3/3 4/7 | = 6/7
10 No Single 90K Yes
10
Marital Status Refund

n1i n 2 i
Class
Single Married Divorced
Class
Yes No d (V1 ,V2 ) =
i

n1 n 2
Yes 2 0 1 Yes 0 3
No 2 4 1 No 3 4

Example: PEBLS
Tid Refund Marital Taxable
Status Income Cheat
X Yes Single 125K No

Y No Married 100K No
10
Distance between record X and record Y:

d
( X , Y ) = w X wY d ( X i , Yi ) 2
i =1
where: Number of times X is used for prediction

wX =
Number of times X predicts correctly
wX 1 if X makes accurate prediction most of the time
wX > 1 if X is not reliable for making predictions

Decision boundaries in global vs. local models
linear regression 15-nearest neighbor 1-nearest neighbor
gglobal local
stable accurate
can be inaccurate unstable
What ultimately
ltimatel matters
matters: GENERALIZATION

06c Nearest Neighbor

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

06c Nearest Neighbor

Diunggah oleh

Hak Cipta:

Format Tersedia

Classification

Jeff Howbert Introduction to Machine Learning Winter 2012 1

Set of Stored Cases Store the training samples

Atr1 ... AtrN Class Use

Jeff Howbert Introduction to Machine Learning Winter 2012 2

Jeff Howbert Introduction to Machine Learning Winter 2012 3

training choose k of the

Jeff Howbert Introduction to Machine Learning Winter 2012 4

Jeff Howbert Introduction to Machine Learning Winter 2012 5

Jeff Howbert Introduction to Machine Learning Winter 2012 6

k-nearest neighbors of a sample x are datapoints

Jeff Howbert Introduction to Machine Learning Winter 2012 7

Jeff Howbert Introduction to Machine Learning Winter 2012 8

z Compute distance between two points:

z Options for determining the class from nearest

Jeff Howbert Introduction to Machine Learning Winter 2012 9

z Choosing the value of k:

Jeff Howbert Introduction to Machine Learning Winter 2012 10

income of a person may vary from $10K to $1M

Jeff Howbert Introduction to Machine Learning Winter 2012 11

z Problem with Euclidean measure:

one solution: normalize the vectors to unit length

z k-Nearest neighbor classifier is a lazy learner

Jeff Howbert Introduction to Machine Learning Winter 2012 13

z PEBLS: Parallel Examplar-Based Learning

Jeff Howbert Introduction to Machine Learning Winter 2012 14

Marital Status Refund

Jeff Howbert Introduction to Machine Learning Winter 2012 15

X Yes Single 125K No

Distance between record X and record Y:

where: Number of times X is used for prediction

wX 1 if X makes accurate prediction most of the time

wX > 1 if X is not reliable for making predictions

linear regression 15-nearest neighbor 1-nearest neighbor

Jeff Howbert Introduction to Machine Learning Winter 2012 17

Anda mungkin juga menyukai