Average SSQ
R2L (i.e., unauthorized access from a remote ma- 1.00E+08
chine), U2R (i.e., unauthorized access to local supe- 1.00E+06
ruser privileges), and PROBING (i.e., surveillance and 1.00E+04
other probing). As a result, the data contains a total 1.00E+02
of five clusters including the class for “normal connec-
1.00E+00
tions”. The attack-types are further classified into one
5 20 80 160
of 24 types, such as buffer-overflow, guess-passwd, nep- Stream (in time units)
tune, portsweep, rootkit, smurf, warezclient, spy, and
so on. It is evident that each specific attack type can
be treated as a sub-cluster. Most of the connections in Figure 1: Quality comparison (Network Intrusion
this dataset are normal, but occasionally there could dataset, horizon=1, stream speed=2000)
be a burst of attacks at certain times. Also, each con-
nection record in this dataset contains 42 attributes,
such as duration of the connection, the number of data
bytes transmitted from source to destination (and vice CluStream STREAM
versa), percentile of connections that have “SYN” er-
1.00E+16
rors, the number of “root” accesses, etc. As in [13],
1.00E+14
all 34 continuous attributes will be used for clustering
1.00E+12
Average SSQ
and one outlier point has been removed.
Second, besides testing on the rapidly evolving net- 1.00E+10
work intrusion data stream, we also test our method 1.00E+08
over relatively stable streams. Since previously re- 1.00E+06
ported stream clustering algorithms work on the entire 1.00E+04
history of stream data, we believe that they should 1.00E+02
perform effectively for some datasets with a relatively 1.00E+00
stable distribution over time. An example of such a 750 1250 1750 2250
data set is the KDD-CUP’98 Charitable Donation data Stream (in time units)
set. We will show that even for such datasets, the
CluStream can consistently outperform the STREAM
algorithm. Figure 2: Quality comparison (Network Intrusion
The KDD-CUP’98 Charitable Donation data set dataset, horizon=256, stream speed=200)
has also been used in evaluating several one-scan clus-
tering algorithms, such as [7]. This dataset contains
95412 records of information about people who have fined as follows. Assume that there are a total of nh
made charitable donations in response to direct mail- points in the past horizon at current time Tc . For each
ing requests, and clustering can be used to group point pi in this horizon, we find the centroid Cpi of its
donors showing similar donation behavior. As in [7], closest macro-cluster, and compute d(pi , Cpi ), the dis-
we will only use 56 fields which can be extracted from tance between pi and Cpi . Then the SSQ at time
the total 481 fields of each record. This data set is Tc with horizon H (denoted as SSQ(Tc , H)) is equal
converted into a data stream by taking the data in- to the sum of d2 (pi , Cpi ) for all the nh points within
put order as the order of streaming and assuming that the previous horizon H. Unless otherwise mentioned,
they flow-in with a uniform speed. the algorithm parameters were set at α = 2, l = 10,
Synthetic datasets. To test the scalability of CluS- InitN umber = 2000, δ = 512, and t = 2.
tream, we generate some synthetic datasets by varying
base size from 100K to 1000K points, the number of 6.2 Clustering Evaluation
clusters from 4 to 64, and the dimensionality in the
range of 10 to 100. Because we know the true clus- One novel feature of CluStream is that it can create a
ter distribution a priori, we can compare the clusters set of macro-clusters for any user-specified horizon at
found with the true clusters. The data points of each any time upon demand. Furthermore, we expect CluS-
synthetic dataset will follow a series of Gaussian distri- tream to be more effective than current algorithms at
butions. In order to reflect the evolution of the stream clustering rapidly evolving data streams. We will first
data over time, we change the mean and variance of show the effectiveness and high quality of CluStream
the current Gaussian distribution every 10K points in in detecting network intrusions.
the synthetic data generation. We compare the clustering quality of CluStream
The quality of clustering on the real data sets was with that of STREAM for different horizons at dif-
measured using the sum of square distance (SSQ), de- ferent times using the Network Intrusion dataset. For
6000
CluStream STREAM CluStream
2.00E+07
4000
1.50E+07 3500
1.00E+07 3000
5.00E+06 2500
0.00E+00 2000
50 150 250 350 450 1500
10 15 20 25 30 35 40 45 50 55 60
Stream (in time units)
Elapsed time (in seconds)
500
CluStream STREAM B400C20
450 B200C10
8.00E+07 B100C5
400
7.00E+07
6.00E+07 runtime (in seconds) 350
Average SSQ
300
5.00E+07
250
4.00E+07
200
3.00E+07
150
2.00E+07
100
1.00E+07
50
0.00E+00
0
50 150 250 350 450 10 20 30 40 50 60 70 80
Stream (in time units) Number of dimensions
Figure 4: Quality comparison (Charitable Donation Figure 7: Scalability with Data Dimensionality
dataset, horizon=16, stream speed=200) (stream speed=2000)
2000 500
CluStream B400D40
450
Number of points processed per second
STREAM B200D20
B100D10
1800 400
runtime (in seconds)
350
1600 300
250
1400 200
150
1200 100
50
1000 0
10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40
Elapsed time (in seconds) Number of clusters