c,q
(h) from h to c,
q
(h) is the
weighted average distance
D
c,q
(h) =
weight(e
I
)D(h,e
I
)
m
I=1
weight(e
I
)
m
I=1
D
c
out
,q
(h) =
weight (h).h.+ weight(e
I
)D(h,h
I
)
r-1
I=1
weight(h) + weight(h
I
)
r-1
I=1
where h. is the mean intracluster distance for the cluster
corresponding to the O-pseudopoint h. Therefore, q-NSC(h) is
given by
qNSC(h) =
D
c
mn
,q
(h) D
c
out
,q
(h)
max[D
c
mn
,q
(h),D
c
out
,q
(h)
Algorithm 1
Detect-Novel(M,Buf)
Input: M: Current ensemble of best L classifiers
Buf: Buffer temporarily holding F-outlier instances
Output: Identification of the novel class instances
1: K
o
(K*|Buf|/S) //S =chunk size K =clusters per chunk
2: K-means(Buf,K
0
) //create K
0
O-pseudopoints
3: for each classifier M
i
M do
International Journal of Computer Trends and Technology (IJCTT) volume 6 number 3 Dec 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page137
4: tp 0
5: for each cluster h H do
6: h:sc q-NSC(h) //(equation 1)
7: if h.sc >0 then
8: tp+=h:size //total instances in the cluster
9: for each instance x h:cluster do x.sc max(x.sc,h.sc)
10: end if
11: end for
12: if tp >q then vote++
13: end for
14: if vote == L then //found novel class, identify novel
instances
15:
nov
all instance x with x.sc >0
16: for all x
nov
do
17: x.ns Nscore(x) //equation 2
18: if x.ns >Gini
th
then N list N list x
19: end for
20: Detect-Multinovel(N_list) //algorithm3
21: end if
Let be the set of O-pseudopoints. For each classifier M
i
M , we compute the q-NSC value of every O pseudopoint h 2
H. If the q-NSC value of an O-pseudopoint h is positive, then
we assign the q-NSC value of h to each instance in the
corresponding cluster . Note that the q-NSC value for each
instance may be different for different classifiers. Save the
maximumvalue. It also keep a counter (tp) to count the total
number of instances having positive q-NSC value . If t
p
is
greater than the threshold q, then the corresponding classifier
votes in favor of a novel class . If all the classifiers in the
ensemble vote for a novel class, then we declare a novel class
. Then, find the novel class instances. Compute the Nscore(x)
value of all instances x having positive q-NSC . If the
Nscore(x) is greater than some threshold (Gini
th
) , then tag the
instance as a novel class instance, and save it in the novel
instances list (N list) . Finally, compute whether there is more
than one novel class. Then assume that the labels of the
instances in the current chunk will be available when the next
chunk of data (i.e., S number of instances) appears in the
stream. Besides, if any outlier instance in Buf becomes older
than S, i.e., the chunk size, then it is removed.
C. OUTLIER DETECTION USING ADAPTIVE
THRESHOLD
In this the algorithmtakes the latest labeled instance x and
the current OUTTH as input. It checks if x was a false-novel
instance. This means that x belongs to an existing class but
was falsely detected as a novel class instance. If x is false
novel, then it must have been an F-outlier. Therefore,
inst_weight(x) < OUTTH. If the difference OUTTH -
inst_weight(x) is less than a small constant, then we call x as
a marginal false-novel instance. If x is found to be a marginal
false-novel instance, then OUTTH is increased so that further
instances like this do not fall beyond the decision boundary.
Therefore, OUTTH is decreased by a small value (). This
increases the slack space beyond the surface of a hypersphere.
On the other hand, if x is a marginal false-existing instance,
then x is a novel class instance but was wrongly identified as
an existing class instance by a narrow margin, then the
process to decrease the slack space (increase OUTTH) has to
be made. This can be done by increasing OUTTH by a small
value. The marginal constraint is applied to avoid drastic
changes in OUTTH value. The value of OUTTH is not
changed, if the test instance is NOT a marginal false-novel or
false existing instance.
Algorithm 2
Adjust-threshold(x, OUTTH)
Input: x: most recent labeled instance
OUTTH: current outlier threshold
Output: OUTTH: new outlier threshold
1: if false-novel(x) && OUTTH - inst_weight(x) < then
2: OUTTH -= //increase slack space
3: else if false-existing(x) && inst_weight(x) - OUTTH <
then
4: OUTTH += //decrease slack space
5: end if
D. NOVEL CLASS DETECTION USING GINI
COEFFICIENT
After detecting the F-outlier instances using the OUTTH value
discussed in the previous section, compute the q- NSC(x)
value for each F-outlier instance x .If the q- NSC(x) value is
negative, then remove x fromconsideration, i.e., x is regarded
as an existing class instance. For the remaining F-outliers, q-
NSC(.) is within the range [0, 1]. Then compute a compound
measure for each such F
outlier,
called Novelty score or N
score
N
score
(x) = 1 _ inst weight(x) . q-NSC(x) /1 _
minweight
where minweight is the minimum inst_weight among all
F
outliers
having positive q-NSC. N
score
contains two parts:
The first part measures how far the outlier is away from its
nearest existing class pseudopoint. The second part measures
the cohesion of the F
outlier
with other F-outliers, and the
separation of the F
outlier
fromthe existing class instance.
N
score
(x) values into n equal intervals (or bins), and construct a
cumulative distribution function (CDF) of N
score
. Let y
i
be the
value of the CDF for the i
th
interval. Compute the discrete
Gini Coefficient G(s), for a randomsample of y
i
, as follows,
If G(s) >n-1 /3n , declare a novel class and tag the F
outliers
as
novel class instances.
. If G(s)=0, classify the F-outliers as existing class instances.
. If G(s) (0,
n-1
3n
) , filter out the F-outliers falling in the first
interval, and consider rest of the F-outliers as novel class.
Multiple Novel Class Detection
To construct a graph, multiple novel classes have to
be detected and identify the connected components in the
graph. The number of novel classes can be determined by the
number of connected components.
Algorithm 3
Detect-Multinovel(N list)
Input: N_ list: List of novel class instances
Output: N _type: predicted class label of the novel instances
International Journal of Computer Trends and Technology (IJCTT) volume 6 number 3 Dec 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page138
// Separation phase
1: G =(V ;E) empty //initialize graph
2: NP_ list K-means(N_ list, K
v
) //clustering
3: for h NP list do
4: h.nn Nearest-neighbor(NP_ list - {h})
5: h.sc Compute-SC(h,h.nn) //silhouette coefficient
6: V V {h} //add these nodes
7: V V {h.nn}
8: if h.sc <th
sc
then //relatively closer to the nearest
neighbor
9: E {(h,h,nn)} //add this directed edge
10: endif
11: end for
12: count Con-Components(G) //find connected components
// Merging phase
13: for each pair of components (g
1
; g
2
) G do
14:
1
mean-dist(g1),
2
mean-dist(g2)
15: if
1
+
2
2ccntrIod _dIst(g
1
; g
2
)
>1 then g
1
Merge(g
1
; g
2
)
16: end for
//Now assign the class labels
17: N _type empty
18: for x N list do
19: h PseudopointOf(x) //find the corresponding
pseudopoint
20: N _type N _type {(x,h:componentno)}
21: end for
Each cluster is saved as a pseudopoint. And these
pseudopoints are stored in the NP list. The graph G is created
in the for loop (lines 3-11). For each pseudopoint h NP list,
we find the nearest pseudopoint of h (line 4) and compute the
silhouette coefficient of h using the following formula: h.sc
=dist(h;h.nn):h. / max(dist)(h;h.nn);h. ), where dist(h;
h.nn) is the distance between the centroids of h, and h:nn, i.e,
the nearest neighbor of h. Also, h: is the mean distance from
the centroid of h to all instances belonging to h. In other
words, h: is the intracluster dispersion, and dist(h; h.nn) is the
inter cluster distance to the nearest cluster. Therefore, h:sc is a
measure of how tight the cluster is, and h:sc ranges from+1 to
-1. If h:sc is high (close to 1), it indicates h is a tight cluster
and it is far from its nearest cluster; and if h:sc is low, then h
is not so tight, and close to its nearest cluster.
Again we return to the algorithm, where in lines (6,
7) we add both h and h:nn to the vertex list V . Then we check
whether h:sc is less than a certain threshold (th
sc
), and add the
directed edge (h; h.nn) to the edge list E if indeed h:sc is less
than the threshold. Therefore, we are adding an edge only if
h:sc is lower than the threshold, meaning, h is closer to its
neighbor and less tight (lines 8-10). We use th
sc
=0:8 in all
experiments. Once we have the graph G, we can find the
connected components (line 12), and mark each pseudopoint
with the corresponding component number. For example, if
there are two connected components, all pseudopoints
belonging to the first component will be tagged as 1 and all
pseudopoints belonging to the second component will be
tagged as 2. In the merging phase (lines 13-16), we examine
different components of the graph to see whether they can be
merged. For each pair of components (g
1
; g
2
), we first find the
mean distance of each pseudopoint from the global centroid of
the corresponding component (line 14), and then merge them
if the sumof the mean distances is greater than twice the
global centroid distance between g
1
and g
2
(line 15). In other
words, two components are merged if the mean
intracomponent distance is higher than the intercomponent
distance, i.e., the components are less dense and less separable
fromeach other.
E. A CLUSTERING ALGORITHM BASED ON
RANDOMIZED SEARCH FOR OUTLIER DETECTION
AND MCM
CLARANS (A Clustering Algorithmbased on Randomized
Search) is a clustering methods draws sample of neighbors
dynamically. The clustering process has been presented as
searching a graph where each and every node is a potential
solution, (i.e., a set of k medoids). If any local optimum is
found, new randomly selected node can be started in search
for a new local optimum. This proposed method is efficient
and scalable than both PAM and CLARA .
The clustering process in Fig 1 can be done for
searching a graph where every node is a potential solution,
(i.e., a set of k medoids). Two nodes are said to be neighbors
if their sets differ by only one medoids. Each node can be
assigned by a cost which is defined to be the total
dissimilarity between medoids of its cluster and every object.
The problem corresponds and addresses to search for a
minimumon the graph. .In every step, all neighbors of current
node is searched; the neighbor which corresponds to the
decrease in cost is chosen as the next solution. CLARA draws
a sample of nodes at the beginning of search, and therefore,
CLARANS has the benefit that it is not confining the search
to a restricted area. If the local optimum is found then
CLARANS starts with new randomly selected node in search
for a new local optimum. The number of local optimums to
search for is a parameter.
CLARANS (Clustering Large Applications Based
on Randomized Search) is a method which is similar to PAM
and CLARA. Randomly it starts with the selection of medoids
and it draws the neighbor dynamically. Then it checks
maxneighbour for swapping. If any of the pair is negative
then it chooses another medoids set. Otherwise it chooses
current selection of medoids as local optimumand restarts
with the new selection of medoids randomly. It stops the
process until returns the best.
1. Input parameters numlocal and maxneighbour.
2. Select k objects from the database object D randomly.
3. Mark these K objects as selected Si and all other as non-
selected Sh.
4. Calculate the cost T for selected Si
5. If T is negative update medoid set. Else choose selected
medoid as local optimum.
6. Restart the selection of another set of medoid and find
another local optimum.
7. CLARANS stops until returns the best
International Journal of Computer Trends and Technology (IJCTT) volume 6 number 3 Dec 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page139
Input
maxNeighbors,
numLocal
i =1,
minCost =8,
bestNode =-1
current =
randomnode of
Gn,k
j =1
Pick random
neighbor Sof
current.
TCS <TCcurrent
j <maxNeighbor
j++
NO
YES
TCcurrent < minCost
NO
minCost =TCcurrent,
bestNode =current
YES
i >numLocal
Output
bestNode
Stop
NO
YES
current =S YES
i++
NO
Figure 1:Proposed Architecture Diagram
V. EXPERIMENTAL RESULTS
A. DATA SET INFORMATION
In Predicting forest cover type fromcartographic variables
needs the forest cover type for a given observation (30 x 30
meter cell) which was determined fromUS Forest Service
(USFS) Region 2 Resource Information System(RIS) data.
Independent variables were derived from data originally
obtained fromUS Geological Survey (USGS) and USFS data.
Data is in raw form(not scaled) and contains binary (0 or 1)
columns of data for qualitative independent variables
(wilderness areas and soil types). As for primary major tree
species in these areas, Neota would have spruce/fir (type 1),
while Rawah and Comanche Peak would probably have
lodgepole pine (type 2) as their primary species, followed by
spruce/fir and aspen (type 5). Cache la Poudre would tend to
have Ponderosa pine (type 3), Douglas-fir (type 6), and
cottonwood/willow (type 4).
1. Attribute Information:
Given is the attribute name, attribute type, the measurement
unit and a brief description. The forest cover type is the
classification problem. The order of this listing corresponds to
the order of numerals along the rows of the database.
Name / Data Type / Measurement / Description
Elevation / quantitative /meters / Elevation in meters
Aspect / quantitative / azimuth / Aspect in degrees azimuth
Slope / quantitative / degrees / Slope in degrees
Horizontal_Distance_To_Hydrology / quantitative / meters /
Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology / quantitative / meters /
Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways / quantitative / meters /
Horz Dist to nearest roadway
Hillshade_9am / quantitative / 0 to 255 index / Hillshade
index at 9am, summer solstice
Hillshade_Noon / quantitative / 0 to 255 index / Hillshade
index at noon, summer soltice
Hillshade_3pm / quantitative / 0 to 255 index / Hillshade
index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points / quantitative / meters /
Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns) / qualitative / 0
(absence) or 1 (presence) / Wilderness area designation
Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1
(presence) / Soil Type designation
Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type
designation
.
2. Forest Cover Type Classes:
1 -- Spruce/Fir
2 -- Lodgepole Pine
3 -- Ponderosa Pine
4 -- Cottonwood/Willow
5 -- Aspen
6 -- Douglas-fir
7 Krummholz
The data set contains geospatial descriptions of different types
of forests. It contains 7 classes, 54 attributes, and around
581,000 instances. We normalize the data set, and arrange the
data so that new classes appear randomly.
Figure 2: Cumulative average error rate
International Journal of Computer Trends and Technology (IJCTT) volume 6 number 3 Dec 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page140
The figure 2 show the ERR rates for each approach
throughout the streamin different data sets. For example, in
the figure at X-axis =6, the Y values show the average ERR of
each approach fromthe beginning of the streamto chunk 6 in
Forestcoverage type data set. In this graph the MCM and
EMCM are measured at different datapoints the cumulative
average error rate are less in EMCM when compare to MCM.
Dataset points(Cumulative
average error rate)
MCM EMCM
1 0.5 0.25
1.5 0.6 0.3
2 0.85 0.35
2.5 1.2 0.4
3 1.5 0.5
3.5 1.75 0.63
4 2 0.75
4.5 2.5 0.87
5 3 1.0
5.5 3.5 1.5
6 4 2.0
Figure 3: Number of missed instance classification
The figure 3 show the Number of missed instance
classification rates for each approach throughout the streamin
different data sets. For example, in the figure at X-axis =5, the
Y values show the Number of missed instance missed
classification of each approach fromthe beginning of the
streamto chunk 6 in Forestcoverage type data set. In this
graph the MCM and EMCM are measured at different
datapoints the Number of missed instance classification are
less in EMCM when compare to MCM.
Dataset
points(Number of
missed instance
classification )
MCM EMCM
1 65 55
1.5 90 75
2 125 100
2.5 224 150
3 315 220
3.5 440 335
4 660 450
4.5 775 615
5 990 790
Figure 4 : True novel class detection rate
The figure 3 show the True novel class detection rate for each
approach throughout the stream in different data sets. For
example, in the figure at X-axis =5, the Y values show the
True novel class detection rate of each approach fromthe
beginning of the streamto chunk 6 in Forestcoverage type
data set. In this graph the MCM and EMCM are measured at
International Journal of Computer Trends and Technology (IJCTT) volume 6 number 3 Dec 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page141
different datapoints the True novel class detection rate are
high in EMCM when compare to MCM.
Dataset
points(Number of
missed instance
classification )
MCM EMCM
1 0.1 0.3
1.5 0.24 0.39
2 0.3 0.46
2.5 0.32 0.49
3 0.35 0.5
3.5 0.37 0.59
4 0.4 0.67
4.5 0.42 0.79
5 0.44 0.9
VI. CONCLUSION AND FUTURE WORK
Classification and novel class detection technique for
concept-drifting data streams that addresses four major
challenges such as infinite length, concept-evolution, concept-
drift and feature-evolution. Some related novel class
detection techniques for data streams neither addresses the
feature-evolution problemnor suffers fromhigh false alarm
rate and false detection rates in many cases. To solve this
problem, we propose an improved technique for outlier
detection by clustering the similar data with one another and
making the slack space outside the decision boundary of each
classification model, and this can be adaptively changes the
slack space based on the characteristic of the evolving data. In
data mining, clustering is the process of grouping the data that
have high similarity in comparison to one another. The outlier
detection process utilizes the decision boundary (to be
explained shortly) of the ensemble of models to decide
whether or not an instance is outlier. Improves the accuracy of
detection by using CLARANS ,as well as reduces the time
complexity when compared with other algorithms.An
interesting and relevant question here is what will happen if
one class split into several classes. If splitting occurs , they
occupy the same feature space which means that the feature
space covers before split is the same as the union of the
feature spaces covered after split. None of the new classes will
be detected as novel, because our novel class detection
technique detects a class as novel only if it is found in the
previously unused (unoccupied) feature spaces.
However, if part of one or both of the new classes
occupies a new feature space, then those parts will be detected
as novel. An future work would be to identify this special case
more precisely to distinguish fromthe actual arrival of a novel
class.
REFERENCES
[1] M.M. Masud, Q. Chen, J. Gao, L. Khan, J. Han, and
B.M. Thuraisingham, Classification and Novel
Class Detection of Data Streams in a Dynamic
Feature Space, Proc. European Conf. Machine
Learning and Knowledge Discovery in Databases
(ECML PKDD), pp. 337-352, 2010.
[2] M.M. Masud, J. Gao, L. Khan, J. Han, and B.M.
Thuraisingham, Integrating Novel Class Detection
with Classification for Concept- Drifting Data
Streams, Proc. European Conf. Machine Learning
and Knowledge Discovery in Databases (ECML
PKDD), pp. 79-94, 2009.
[3] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, A
Framework for On-Demand Classification of
Evolving Data Streams, IEEE Trans. Knowledge
and Data Eng., vol. 18, no. 5, pp. 577-589, May
2006.
[4] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and
R. Gavalda`, New Ensemble Methods for Evolving
Data Streams, Proc. ACM SIGKDD 15th Intl Conf.
Knowledge Discovery and Data Mining, pp. 139-
148, 2009.
[5] W. Fan, Systematic Data Selection to Mine
Concept-Drifting Data Streams, Proc. ACM
SIGKDD 10th Intl Conf. Knowledge Discovery and
Data Mining, pp. 128-137, 2004.
[6] J. Kolter and M. Maloof, Using Additive Expert
Ensembles to Cope with Concept Drift, Proc. 22nd
Intl Conf. Machine Learning (ICML), pp. 449-456,
2005.
[7] C. Aggarwal, J. Han, J. Wang, and P. S. Yu, On
Demand Classification of Data Streams, Proc. 2004
Int. Conf. on Knowledge Discovery and Data
Mining, Seattle, WA, Aug. 2004.
[8] Q. Ding, Q. Ding, and W. Perrizo, Decision Tree
Classification of Spatial Data Streams Using Peano
Count Trees, Proceedings of the ACM Symposium
on Applied Computing, Madrid, Spain, March 2002.
[9] P. Domingos and G. Hulten. Mining High- Speed
Data Streams. In Proceedings of the Association for
Computing Machinery Sixth International
International Journal of Computer Trends and Technology (IJCTT) volume 6 number 3 Dec 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page142
Conference on Knowledge Discovery and Data
Mining, 2000
[10] V. Ganti, J. Gehrke, and R. Ramakrishnan: Mining
Data Streams under Block Evolution. SIGKDD
Explorations 3(2), 2002.
[11] H. Wang, W. Fan, P. Yu and J. Han; Mining
Concept-Drifting Data Streams using Ensemble
Classifiers, in the 9th ACM International Conference
on Knowledge Discovery and Data Mining
(SIGKDD), Aug. 2003, Washington DC,USA.
.
.