An Efficient Classification Approach For Novel Class Detection by Evolving Feature Datastreams

International Journal of Computer Trends and Technology (IJCTT) volume 6 number 3 Dec 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page134

An Efficient Classification Approach for Novel Class
Detection by Evolving Feature Datastreams

R.Shree alaguvidhya M.C.A.
1
, C.Yamini MSc, M.Phil,
2

1
Mphil P.T.Scholar, Department of Computer Science, Sri Ramakrishna College of Arts & Science for Women,Coimbatore.
Tamil nadu, India
2
Assistant professor, Department of Computer Science, Sri Ramakrishna College of Arts & Science for Women,Coimbatore,
Tamil nadu, India.

AbstractData stream classification has been an extensively
studied research problem in recent years. data stream
classification requires efficient and effective techniques that are
significantly different from static data classification techniques
because of its dynamic nature. Existing system faces major
challenges in the methods namely feature-evolution, infinite
length, concept-drift and concept-evolution. To address and
overcome the problems in these techniques an ensemble
classification framework is proposed where each classifier is
equipped with a novel class detector and addresses concept-drift
and concept-evolution. It also addresses feature-evolution by a
technique called a feature set homogenization technique. It also
enhances the novel class detection module by making it more
adaptive to the evolving stream. And make this enable to detect
more than one novel class at a time. But all of methods doesnt
support for the detection of the outlier class by using clustering
methods. To overcome this problem Outlier Detection has been
proposed which is a very important research problem in data
mining. These outliers are detected efficiently by using clustering
algorithms. CLARANS clustering algorithm is proposed for
detecting outliers in the class. The outlier class is detected before
the novel class detection algorithm is performed. The best outlier
in the class can be found and then it is applied to MCM
(multiclass miner) in data streams. It is more adaptive technique
to the evolving stream and enabling it to detect more than one
novel class at a time. Comparison with state-of-the-art data
stream classification techniques establishes the effectiveness of
the proposed approach.

KeywordsData stream, Multiclass miner, CLARANS
I. INTRODUCTION
Data Stream Mining is the method of extracting
knowledge idea from continuous and fast growing records of
data. A data streamis an ordered sequence of instances in
which many data stream mining applications can be read only
once or a less number of times using limited storage
capabilities and computing. Examples of such data streams
include computer phone conversations, web searches, sensor
data, ATM transactions and network traffic. Data stream
mining can be considered as a subfield of machine learning,
knowledge discovery and data mining approach. Two of the
most challenging data streams are its concept-drift and
infinite length. Since a data streamis a fast and continuous
observable fact, it is considered to have infinite length.
Therefore, it is not a practical idea to store and use all the
historical data for training. Two other significant
characteristics of data streams, namely, feature evolution and
concept-evolution, which are ignored by most of the previous
work techniques. Concept-evolution arises when new classes
evolve in the data.For example, In a network traffic streamthe
problem of intrusion detection is considered. If each type of
attack is considered as a class label, then occurrences of
concept evolution happens when a completely new kind of
attack occurs in the traffic.The current work also addresses the
feature-evolution problemin data streams namely text streams
where new features appear and old features fade away. The
novel class detection problem has been addressed recently in
the presence of concept-drift and infinite length. In this
technique, unlabelled data can be classified by ensemble
models and detects novel classes. This class detection process
consists of three steps. First, a decision boundary can be built
during training process. Second, test points which are falling
outside of the decision boundary are declared as outliers.
Finally, the outliers are analyzed to view if there is enough
cohesion among themselves and separation from the existing
class instances. But this method did not address the challenges
of feature-evolution problem.
The feature-evolution problem is addressed in [1], which
also addressed the concept-evolution problem.
However, both [2] and [1] have two drawbacks. First, the
false alarm rate, is high for some data sets (i.e., detection of
existing classes as novel). Second, if there is more than one
novel class, they are impossible to distinguish among them.
In this work, a superior technique for both outlier detection
and novel class detection has been presented to increase
detection rate and reduce both false alarm rate. Current
framework allows for methods to distinguish among two or
more novel classes.
The main objective of this research is to detect the outlier
class by using clustering algorithm. It is named as CLARANS
clustering algorithm which is mainly proposed for detecting


outliers in the class. It detects the outlier class before the novel
class detection algorithmis performed. It finds the best outlier
in the class and then it is applied to MCM (multiclass miner)
in data streams. To the best of our knowledge the approach
can work with dynamic feature vector and detect novel classes
in data streams.
.
II. RELATED WORK
Charu C. Aggarwal et.al [3] model for data stream
classification views the data stream classification problemas a
dynamic approach in which simultaneous training and test
streams are used for dynamic classification of data sets. This
model returns real-life situations effectively, since it is
advantageous to classify test streams in real time over an
evolving training and test stream. The goal here is to generate
a classification system where a training model can adapt fastly
to the changes of the original data stream. In order to achieve
this goal, an author in [3] propose an on-demand
classification process which can dynamically select the
appropriate window to build the classifier. However, Charu C.
Aggarwal et.al [3] model does not support the data Stream.
Data classification needs to use temporal and historical
summaries of data in its analysis. However, it is too expensive
to keep track of the whole history of data in fine and uniform
manner.
Albert Bifet et.al [4] proposes a new framework of data
streamfor studying concept drift, and two new variants of
Bagging, they are: ADWIN Bagging and Adaptive-Size
Hoeffding Tree (ASHT) Bagging. Ensemble methods are
defined as a combinations of several models whose individual
predictions are combined together in some approach (e.g.,
averaging or voting) to form a final prediction. Ensemble
learning classifiers commonly have better accuracy. And it is
easy to scale and parallelize than single classifier methods. A
major research of concept drift in data streams mining is done
using traditional data mining frameworks such as WEKA.
Consequently data stream setting has some constraints that a
traditional data mining environment are not used, hence a new
framework is needed to help and improve the empirical
evaluation of these methods. These methods faces limitations
such as ADWIN Bagging and ASHT Bagging are slow.
ADWIN Bagging is slower than ASHT Bagging and for some
datasets it needs more memory. Hoeffding trees using drift
detection methods which are faster as well as less accuracy is
obtained..
Wei Fan et.al [6] solves the problem on how to
analytically select old data to mine the concept-drifting data
streams. To address this author [6] proposed a cross-
validation decision tree ensemble approach. In this the first
step of the algorithm detects all features with information
gain. In the second step, multiple decision trees are built by
randomly choosing from those features with information gain
and ignores the irrelevant features. Discrete features appears
only once in a decision path which starts from the root of the
tree to the current node. Continuous features appear multiple
times but with a different splitting point when choosing each
feature at a time. Trees internal nodes keep class distribution
statistics. The probability outputs of multiple decision trees
are then averaged as the final membership probability
estimation. To make an optimal decision, calculated posterior
probability and a given loss function are used together in order
to minimize the expected loss. Major issues obtained in [6]
with an incoming data stream is data insufficiency problem.
Practically, choosing the exact number of old examples to
have the maximal accuracy is not feasible. Since, it is a
combinatorial problem and the added benefit is difficult hard
to justify the cost to do so.
Jeremy Z. Kolter et.al [7] trained on a series of different
concepts; these algorithms perform almost as well as the best
expert on each concept. The assumption is that being different
experts perform best on different concepts. In fact, however,
such algorithms can be difficult to implement. If the experts
are considered as fixed prediction strategies, then they are
limited to those that one can outline prior to any training.
Alternatively, if experts are learning algorithms, then they can
adapt to concept drift individually. In this [7] paper, author
considered a different and more adaptive approach: In this
algorithm's performance over changing concepts. Its not
relative to the performance of any abstract expert, but relative
to the actual performance of an online learner who have
trained on each concept individually.
Several authors have studied the idea of implementing a
decision tree technique for classification of stream data. Ding
et al. [9] have developed a decision tree based on Peano count
tree data structure. Domingos et al. [10] have studied the
problem of maintaining decision trees over data streams.
Ganti et al. [11] have developed analytically two algorithms
GEMM and FOCUS for modeling maintenance and change
detection between two data sets. These algorithms have been
applied to the frequent itemset model and decision tree
models. Techniques such as decision trees are useful for one
pass mining of data streams but these cannot be easily used in
the context of an on-demand classifier in an evolving
environment.
.
III. PREVIOUS WORK
In previous work, outlier detection can be performed by
using a flexible decision boundary which allows a slack
space outside the decision boundary. The threshold can be
applied to control this slack space. And thus these thresholds
are continuously adapted to reduce the risk of false alarms and
missed novel classes. Second, probabilistic approach to detect
novel class instances using the discrete Gini Coefficient has
been applied. With this approach, different causes of the
appearance of the outlets such as concept-drift, noise or
concept-evolution has been analytical threshold for the Gini
Coefficient. This coefficient can be derived which identifies
the case where a novel class appears in the stream. The
effectiveness of this approach has been shown empirically.
Third, a graph-based approach to detect the appearance of
more than one novel class simultaneously has been applied,
and separates the instances of one novel class fromthe others.


However this method doesnt detect oultiers in efficient
manner. The Less classification result is obtained and hence
complexity in time is increased for outlier detection.
.

IV. PROPOSED WORK

In this research the first clustering techniques in spatial
data is applied to detect the novel class label by using the
forest cover type. The algorithmtakes an input as the number
k, of the desired clusters. But such a parameter is often
difficult and hard to determine. So a good clustering algorithm
which minimizes the input parameters has been taken into
consideration. First, the algorithmrandomly selects k points as
the centers for the required clusters and then tries to find
better solutions. Better solution means that a new set of
centers which minimize the summation of the distances that
has a cover to the center of its cluster for each object. After
detecting the outlier in the original trained dataset of the
feature, they are derived from lossy conversion methods and
then apply the outliter detection method. Before that the
proposed clustering algorithm is applied and then apply
Multiclass miner in stream data. The advantage of the
proposed method over other method is that improves the
classification detection in outlier results by detection of outlier
using clustering before the outlier detection algorithm
performed. Better classification result in the novel
classification detection is achieved.

A. FEATURE EVOLUTION

Feature space of test instances is different fromthe
feature space of the classification models. Therefore, when
there is a need of classifying an instance, homogeneous
feature space for the model and the test instances are need to
be considered.
There are three possible alternatives, they are as follows:
1) Lossy fixed conversion or Lossy-F conversion
2) Lossy local conversion or Lossy-L conversion
3) Lossless homogenizing conversion or Lossless
In the Lossy-F conversion, the feature set which is selected for
the first data chunk, or first n data chunks is used for the entire
stream. This makes the feature set fixed, and all training and
test instances are mapped to this feature set.
In the Lossy-L conversion, each model built from the chunk
as well as training chunk will have its own feature set. These
can be selected by using a feature extraction and selection
technique. The instance is being projected to the feature set of
the model when a test instance is classified by using a model.
Both Lossy-L and Lossy-F conversions of classification
model has its own selected set of features some important
features are lost due to the conversion. When a test instance x
is to be classified by using this model, both the model and the
instance expand their feature sets to the union of their feature
sets.
Lossless conversion preserves the new set of features
and the novel class instances can be distinguished from the
existing class. This appears to the reader that increasing the
dimension of the models and the test instances may have an
undesirable side effect due to the curse of dimensionality.
However, it is practical to assume that the feature set of the
test instances is not considerably different fromthe feature
sets of the classification models because the models normally
represent the most recent concept. As a result, the converted
dimension of the feature space is moreover same as the
original feature spaces. In addition, this type of conversion has
been proved to be successful in other popular classification
techniques such as support vector machines.

B. CLASS DETECTION APPROACH

The input to the algorithmis the ensemble M and the buffer
Buf holding the outliers instances. At first, we create K
0

clusters using K-means with the instances in Buf , where K0 is
proportional to K, the number of pseudopoints per chunk .
Then each cluster is transformed into a pseudopoint data
structure, which stores the centroid, weight (number of data
points in the cluster) and radius (distance between the centroid
and the farthest data point in the cluster). Clustering is carried
out to speed up the computation of q-NSC value .If we
compute q-NSC value for every F-outlier separately, it takes
quadratic time in the number of the outliers. On the other
hand, if we compute the q-NSC value of the K
0
F-outlier
pseudopoints (or O-pseudopoint), it takes constant time. The
q-NSC value of a O-pseudopoint h is the approximate average
of the q-NSC value of each instance in h . This is computed as
follows: First, we define c; q(h) in terms of a O-pseudopoint
h.
The distance D(h; h
i
) is defined as the centroid to centroid
distance. The mean distance D
c,q
(h) from h to c,
q
(h) is the
weighted average distance
D
c,q
(h) =
weight(e
I
)D(h,e
I
)
m
I=1
weight(e
I
)
m
I=1

D
c
out
,q
(h) =
weight (h).h.+ weight(e
I
)D(h,h
I
)
r-1
I=1
weight(h) + weight(h
I
)
r-1
I=1

where h. is the mean intracluster distance for the cluster
corresponding to the O-pseudopoint h. Therefore, q-NSC(h) is
given by
qNSC(h) =
D
c
mn
,q
(h) D
c
out
,q
(h)
max[D
c
mn
,q
(h),D
c
out
,q
(h)

Algorithm 1
Detect-Novel(M,Buf)
Input: M: Current ensemble of best L classifiers
Buf: Buffer temporarily holding F-outlier instances
Output: Identification of the novel class instances
1: K
o
(K*|Buf|/S) //S =chunk size K =clusters per chunk
2: K-means(Buf,K
0
) //create K
0
O-pseudopoints
3: for each classifier M
i
M do


4: tp 0
5: for each cluster h H do
6: h:sc q-NSC(h) //(equation 1)
7: if h.sc >0 then
8: tp+=h:size //total instances in the cluster
9: for each instance x h:cluster do x.sc max(x.sc,h.sc)
10: end if
11: end for
12: if tp >q then vote++
13: end for
14: if vote == L then //found novel class, identify novel
instances
15:
nov
all instance x with x.sc >0
16: for all x
nov
do
17: x.ns Nscore(x) //equation 2
18: if x.ns >Gini
th
then N list N list x
19: end for
20: Detect-Multinovel(N_list) //algorithm3
21: end if
Let be the set of O-pseudopoints. For each classifier M
i

M , we compute the q-NSC value of every O pseudopoint h 2
H. If the q-NSC value of an O-pseudopoint h is positive, then
we assign the q-NSC value of h to each instance in the
corresponding cluster . Note that the q-NSC value for each
instance may be different for different classifiers. Save the
maximumvalue. It also keep a counter (tp) to count the total
number of instances having positive q-NSC value . If t
p
is
greater than the threshold q, then the corresponding classifier
votes in favor of a novel class . If all the classifiers in the
ensemble vote for a novel class, then we declare a novel class
. Then, find the novel class instances. Compute the Nscore(x)
value of all instances x having positive q-NSC . If the
Nscore(x) is greater than some threshold (Gini
th
) , then tag the
instance as a novel class instance, and save it in the novel
instances list (N list) . Finally, compute whether there is more
than one novel class. Then assume that the labels of the
instances in the current chunk will be available when the next
chunk of data (i.e., S number of instances) appears in the
stream. Besides, if any outlier instance in Buf becomes older
than S, i.e., the chunk size, then it is removed.

C. OUTLIER DETECTION USING ADAPTIVE
THRESHOLD

In this the algorithmtakes the latest labeled instance x and
the current OUTTH as input. It checks if x was a false-novel
instance. This means that x belongs to an existing class but
was falsely detected as a novel class instance. If x is false
novel, then it must have been an F-outlier. Therefore,
inst_weight(x) < OUTTH. If the difference OUTTH -
inst_weight(x) is less than a small constant, then we call x as
a marginal false-novel instance. If x is found to be a marginal
false-novel instance, then OUTTH is increased so that further
instances like this do not fall beyond the decision boundary.
Therefore, OUTTH is decreased by a small value (). This
increases the slack space beyond the surface of a hypersphere.
On the other hand, if x is a marginal false-existing instance,
then x is a novel class instance but was wrongly identified as
an existing class instance by a narrow margin, then the
process to decrease the slack space (increase OUTTH) has to
be made. This can be done by increasing OUTTH by a small
value. The marginal constraint is applied to avoid drastic
changes in OUTTH value. The value of OUTTH is not
changed, if the test instance is NOT a marginal false-novel or
false existing instance.
Algorithm 2
Adjust-threshold(x, OUTTH)
Input: x: most recent labeled instance
OUTTH: current outlier threshold
Output: OUTTH: new outlier threshold
1: if false-novel(x) && OUTTH - inst_weight(x) < then
2: OUTTH -= //increase slack space
3: else if false-existing(x) && inst_weight(x) - OUTTH <
then
4: OUTTH += //decrease slack space
5: end if

D. NOVEL CLASS DETECTION USING GINI
COEFFICIENT

After detecting the F-outlier instances using the OUTTH value
discussed in the previous section, compute the q- NSC(x)
value for each F-outlier instance x .If the q- NSC(x) value is
negative, then remove x fromconsideration, i.e., x is regarded
as an existing class instance. For the remaining F-outliers, q-
NSC(.) is within the range [0, 1]. Then compute a compound
measure for each such F
outlier,
called Novelty score or N
score
N
score
(x) = 1 _ inst weight(x) . q-NSC(x) /1 _
minweight
where minweight is the minimum inst_weight among all
F
outliers
having positive q-NSC. N
score
contains two parts:
The first part measures how far the outlier is away from its
nearest existing class pseudopoint. The second part measures
the cohesion of the F
outlier
with other F-outliers, and the
separation of the F
outlier
fromthe existing class instance.
N
score
(x) values into n equal intervals (or bins), and construct a
cumulative distribution function (CDF) of N
score
. Let y
i
be the
value of the CDF for the i
th
interval. Compute the discrete
Gini Coefficient G(s), for a randomsample of y
i
, as follows,
If G(s) >n-1 /3n , declare a novel class and tag the F
outliers
as
novel class instances.
. If G(s)=0, classify the F-outliers as existing class instances.
. If G(s) (0,
n-1
3n
) , filter out the F-outliers falling in the first
interval, and consider rest of the F-outliers as novel class.
Multiple Novel Class Detection
To construct a graph, multiple novel classes have to
be detected and identify the connected components in the
graph. The number of novel classes can be determined by the
number of connected components.
Algorithm 3
Detect-Multinovel(N list)
Input: N_ list: List of novel class instances
Output: N _type: predicted class label of the novel instances


// Separation phase
1: G =(V ;E) empty //initialize graph
2: NP_ list K-means(N_ list, K
v
) //clustering
3: for h NP list do
4: h.nn Nearest-neighbor(NP_ list - {h})
5: h.sc Compute-SC(h,h.nn) //silhouette coefficient
6: V V {h} //add these nodes
7: V V {h.nn}
8: if h.sc <th
sc
then //relatively closer to the nearest
neighbor
9: E {(h,h,nn)} //add this directed edge
10: endif
11: end for
12: count Con-Components(G) //find connected components
// Merging phase
13: for each pair of components (g
1
; g
2
) G do
14:
1
mean-dist(g1),
2
mean-dist(g2)
15: if
1
+
2
2ccntrIod _dIst(g
1
; g
2
)
>1 then g
1
Merge(g
1
; g
2
)
16: end for
//Now assign the class labels
17: N _type empty
18: for x N list do
19: h PseudopointOf(x) //find the corresponding
pseudopoint
20: N _type N _type {(x,h:componentno)}
21: end for
Each cluster is saved as a pseudopoint. And these
pseudopoints are stored in the NP list. The graph G is created
in the for loop (lines 3-11). For each pseudopoint h NP list,
we find the nearest pseudopoint of h (line 4) and compute the
silhouette coefficient of h using the following formula: h.sc
=dist(h;h.nn):h. / max(dist)(h;h.nn);h. ), where dist(h;
h.nn) is the distance between the centroids of h, and h:nn, i.e,
the nearest neighbor of h. Also, h: is the mean distance from
the centroid of h to all instances belonging to h. In other
words, h: is the intracluster dispersion, and dist(h; h.nn) is the
inter cluster distance to the nearest cluster. Therefore, h:sc is a
measure of how tight the cluster is, and h:sc ranges from+1 to
-1. If h:sc is high (close to 1), it indicates h is a tight cluster
and it is far from its nearest cluster; and if h:sc is low, then h
is not so tight, and close to its nearest cluster.
Again we return to the algorithm, where in lines (6,
7) we add both h and h:nn to the vertex list V . Then we check
whether h:sc is less than a certain threshold (th
sc
), and add the
directed edge (h; h.nn) to the edge list E if indeed h:sc is less
than the threshold. Therefore, we are adding an edge only if
h:sc is lower than the threshold, meaning, h is closer to its
neighbor and less tight (lines 8-10). We use th
sc
=0:8 in all
experiments. Once we have the graph G, we can find the
connected components (line 12), and mark each pseudopoint
with the corresponding component number. For example, if
there are two connected components, all pseudopoints
belonging to the first component will be tagged as 1 and all
pseudopoints belonging to the second component will be
tagged as 2. In the merging phase (lines 13-16), we examine
different components of the graph to see whether they can be
merged. For each pair of components (g
1
; g
2
), we first find the
mean distance of each pseudopoint from the global centroid of
the corresponding component (line 14), and then merge them
if the sumof the mean distances is greater than twice the
global centroid distance between g
1
and g
2
(line 15). In other
words, two components are merged if the mean
intracomponent distance is higher than the intercomponent
distance, i.e., the components are less dense and less separable
fromeach other.

E. A CLUSTERING ALGORITHM BASED ON
RANDOMIZED SEARCH FOR OUTLIER DETECTION
AND MCM

CLARANS (A Clustering Algorithmbased on Randomized
Search) is a clustering methods draws sample of neighbors
dynamically. The clustering process has been presented as
searching a graph where each and every node is a potential
solution, (i.e., a set of k medoids). If any local optimum is
found, new randomly selected node can be started in search
for a new local optimum. This proposed method is efficient
and scalable than both PAM and CLARA .
The clustering process in Fig 1 can be done for
searching a graph where every node is a potential solution,
(i.e., a set of k medoids). Two nodes are said to be neighbors
if their sets differ by only one medoids. Each node can be
assigned by a cost which is defined to be the total
dissimilarity between medoids of its cluster and every object.
The problem corresponds and addresses to search for a
minimumon the graph. .In every step, all neighbors of current
node is searched; the neighbor which corresponds to the
decrease in cost is chosen as the next solution. CLARA draws
a sample of nodes at the beginning of search, and therefore,
CLARANS has the benefit that it is not confining the search
to a restricted area. If the local optimum is found then
CLARANS starts with new randomly selected node in search
for a new local optimum. The number of local optimums to
search for is a parameter.
CLARANS (Clustering Large Applications Based
on Randomized Search) is a method which is similar to PAM
and CLARA. Randomly it starts with the selection of medoids
and it draws the neighbor dynamically. Then it checks
maxneighbour for swapping. If any of the pair is negative
then it chooses another medoids set. Otherwise it chooses
current selection of medoids as local optimumand restarts
with the new selection of medoids randomly. It stops the
process until returns the best.
1. Input parameters numlocal and maxneighbour.
2. Select k objects from the database object D randomly.
3. Mark these K objects as selected Si and all other as non-
selected Sh.
4. Calculate the cost T for selected Si
5. If T is negative update medoid set. Else choose selected
medoid as local optimum.
6. Restart the selection of another set of medoid and find
another local optimum.
7. CLARANS stops until returns the best


Input
maxNeighbors,
numLocal
i =1,
minCost =8,
bestNode =-1
current =
randomnode of
Gn,k
j =1
Pick random
neighbor Sof
current.
TCS <TCcurrent
j <maxNeighbor
j++
NO
YES
TCcurrent < minCost
NO
minCost =TCcurrent,
bestNode =current
YES
i >numLocal
Output
bestNode
Stop
NO
YES
current =S YES
i++
NO

Figure 1:Proposed Architecture Diagram

V. EXPERIMENTAL RESULTS

A. DATA SET INFORMATION

In Predicting forest cover type fromcartographic variables
needs the forest cover type for a given observation (30 x 30
meter cell) which was determined fromUS Forest Service
(USFS) Region 2 Resource Information System(RIS) data.
Independent variables were derived from data originally
obtained fromUS Geological Survey (USGS) and USFS data.
Data is in raw form(not scaled) and contains binary (0 or 1)
columns of data for qualitative independent variables
(wilderness areas and soil types). As for primary major tree
species in these areas, Neota would have spruce/fir (type 1),
while Rawah and Comanche Peak would probably have
lodgepole pine (type 2) as their primary species, followed by
spruce/fir and aspen (type 5). Cache la Poudre would tend to
have Ponderosa pine (type 3), Douglas-fir (type 6), and
cottonwood/willow (type 4).

1. Attribute Information:

Given is the attribute name, attribute type, the measurement
unit and a brief description. The forest cover type is the
classification problem. The order of this listing corresponds to
the order of numerals along the rows of the database.

Name / Data Type / Measurement / Description

Elevation / quantitative /meters / Elevation in meters
Aspect / quantitative / azimuth / Aspect in degrees azimuth
Slope / quantitative / degrees / Slope in degrees
Horizontal_Distance_To_Hydrology / quantitative / meters /
Horz Dist to nearest surface water features
Vertical_Distance_To_Hydrology / quantitative / meters /
Vert Dist to nearest surface water features
Horizontal_Distance_To_Roadways / quantitative / meters /
Horz Dist to nearest roadway
Hillshade_9am / quantitative / 0 to 255 index / Hillshade
index at 9am, summer solstice
Hillshade_Noon / quantitative / 0 to 255 index / Hillshade
index at noon, summer soltice
Hillshade_3pm / quantitative / 0 to 255 index / Hillshade
index at 3pm, summer solstice
Horizontal_Distance_To_Fire_Points / quantitative / meters /
Horz Dist to nearest wildfire ignition points
Wilderness_Area (4 binary columns) / qualitative / 0
(absence) or 1 (presence) / Wilderness area designation
Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1
(presence) / Soil Type designation
Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type
designation
.
2. Forest Cover Type Classes:

1 -- Spruce/Fir
2 -- Lodgepole Pine
3 -- Ponderosa Pine
4 -- Cottonwood/Willow
5 -- Aspen
6 -- Douglas-fir
7 Krummholz

The data set contains geospatial descriptions of different types
of forests. It contains 7 classes, 54 attributes, and around
581,000 instances. We normalize the data set, and arrange the
data so that new classes appear randomly.

Figure 2: Cumulative average error rate


The figure 2 show the ERR rates for each approach
throughout the streamin different data sets. For example, in
the figure at X-axis =6, the Y values show the average ERR of
each approach fromthe beginning of the streamto chunk 6 in
Forestcoverage type data set. In this graph the MCM and
EMCM are measured at different datapoints the cumulative
average error rate are less in EMCM when compare to MCM.

Dataset points(Cumulative
average error rate)
MCM EMCM
1 0.5 0.25
1.5 0.6 0.3
2 0.85 0.35
2.5 1.2 0.4
3 1.5 0.5
3.5 1.75 0.63
4 2 0.75
4.5 2.5 0.87
5 3 1.0
5.5 3.5 1.5
6 4 2.0

Figure 3: Number of missed instance classification

The figure 3 show the Number of missed instance
classification rates for each approach throughout the streamin
different data sets. For example, in the figure at X-axis =5, the
Y values show the Number of missed instance missed
classification of each approach fromthe beginning of the
streamto chunk 6 in Forestcoverage type data set. In this
graph the MCM and EMCM are measured at different
datapoints the Number of missed instance classification are
less in EMCM when compare to MCM.

Dataset
points(Number of
missed instance
classification )
MCM EMCM
1 65 55
1.5 90 75
2 125 100
2.5 224 150
3 315 220
3.5 440 335
4 660 450
4.5 775 615
5 990 790

Figure 4 : True novel class detection rate

The figure 3 show the True novel class detection rate for each
approach throughout the stream in different data sets. For
example, in the figure at X-axis =5, the Y values show the
True novel class detection rate of each approach fromthe
beginning of the streamto chunk 6 in Forestcoverage type
data set. In this graph the MCM and EMCM are measured at


different datapoints the True novel class detection rate are
high in EMCM when compare to MCM.

Dataset
points(Number of
missed instance
classification )
MCM EMCM
1 0.1 0.3
1.5 0.24 0.39
2 0.3 0.46
2.5 0.32 0.49
3 0.35 0.5
3.5 0.37 0.59
4 0.4 0.67
4.5 0.42 0.79
5 0.44 0.9

VI. CONCLUSION AND FUTURE WORK

Classification and novel class detection technique for
concept-drifting data streams that addresses four major
challenges such as infinite length, concept-evolution, concept-
drift and feature-evolution. Some related novel class
detection techniques for data streams neither addresses the
feature-evolution problemnor suffers fromhigh false alarm
rate and false detection rates in many cases. To solve this
problem, we propose an improved technique for outlier
detection by clustering the similar data with one another and
making the slack space outside the decision boundary of each
classification model, and this can be adaptively changes the
slack space based on the characteristic of the evolving data. In
data mining, clustering is the process of grouping the data that
have high similarity in comparison to one another. The outlier
detection process utilizes the decision boundary (to be
explained shortly) of the ensemble of models to decide
whether or not an instance is outlier. Improves the accuracy of
detection by using CLARANS ,as well as reduces the time
complexity when compared with other algorithms.An
interesting and relevant question here is what will happen if
one class split into several classes. If splitting occurs , they
occupy the same feature space which means that the feature
space covers before split is the same as the union of the
feature spaces covered after split. None of the new classes will
be detected as novel, because our novel class detection
technique detects a class as novel only if it is found in the
previously unused (unoccupied) feature spaces.
However, if part of one or both of the new classes
occupies a new feature space, then those parts will be detected
as novel. An future work would be to identify this special case
more precisely to distinguish fromthe actual arrival of a novel
class.

REFERENCES

[1] M.M. Masud, Q. Chen, J. Gao, L. Khan, J. Han, and
B.M. Thuraisingham, Classification and Novel
Class Detection of Data Streams in a Dynamic
Feature Space, Proc. European Conf. Machine
Learning and Knowledge Discovery in Databases
(ECML PKDD), pp. 337-352, 2010.

[2] M.M. Masud, J. Gao, L. Khan, J. Han, and B.M.
Thuraisingham, Integrating Novel Class Detection
with Classification for Concept- Drifting Data
Streams, Proc. European Conf. Machine Learning
and Knowledge Discovery in Databases (ECML
PKDD), pp. 79-94, 2009.

[3] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, A
Framework for On-Demand Classification of
Evolving Data Streams, IEEE Trans. Knowledge
and Data Eng., vol. 18, no. 5, pp. 577-589, May
2006.

[4] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and
R. Gavalda`, New Ensemble Methods for Evolving
Data Streams, Proc. ACM SIGKDD 15th Intl Conf.
Knowledge Discovery and Data Mining, pp. 139-
148, 2009.

[5] W. Fan, Systematic Data Selection to Mine
Concept-Drifting Data Streams, Proc. ACM
SIGKDD 10th Intl Conf. Knowledge Discovery and
Data Mining, pp. 128-137, 2004.

[6] J. Kolter and M. Maloof, Using Additive Expert
Ensembles to Cope with Concept Drift, Proc. 22nd
Intl Conf. Machine Learning (ICML), pp. 449-456,
2005.

[7] C. Aggarwal, J. Han, J. Wang, and P. S. Yu, On
Demand Classification of Data Streams, Proc. 2004
Int. Conf. on Knowledge Discovery and Data
Mining, Seattle, WA, Aug. 2004.

[8] Q. Ding, Q. Ding, and W. Perrizo, Decision Tree
Classification of Spatial Data Streams Using Peano
Count Trees, Proceedings of the ACM Symposium
on Applied Computing, Madrid, Spain, March 2002.

[9] P. Domingos and G. Hulten. Mining High- Speed
Data Streams. In Proceedings of the Association for
Computing Machinery Sixth International


Conference on Knowledge Discovery and Data
Mining, 2000

[10] V. Ganti, J. Gehrke, and R. Ramakrishnan: Mining
Data Streams under Block Evolution. SIGKDD
Explorations 3(2), 2002.

[11] H. Wang, W. Fan, P. Yu and J. Han; Mining
Concept-Drifting Data Streams using Ensemble
Classifiers, in the 9th ACM International Conference
on Knowledge Discovery and Data Mining
(SIGKDD), Aug. 2003, Washington DC,USA.

.
.

An Efficient Classification Approach For Novel Class Detection by Evolving Feature Datastreams

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

An Efficient Classification Approach For Novel Class Detection by Evolving Feature Datastreams

Diunggah oleh

Hak Cipta:

Format Tersedia

International Journal of Computer Trends and Technology (IJCTT) volume 6 number 3 Dec 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page134

Anda mungkin juga menyukai