Anda di halaman 1dari 2

Lightweight Hierarchical Network Traffic

Clustering
Abdulrahman Hijazi 1 , Hajime Inoue2 , Anil Somayaji 1

1 Carleton Computer Security Laboratory, Carleton University, Ottawa, ON, Canada


2 ATC-NY, Ithaca, NY
{ahijazi, hinoue, soma} @ccsl.carleton.ca

Abstract: We summarize our work with ADHIC (Approximate Divisive HIerarchical Clusterer),
a lightweight, online, divisive hierarchical clustering algorithm tailored to the domain of network
traffic clustering. We then briefly describe our implementation of ADHIC, NetADHICT, which
serves as a tool to system administrators. The key innovation is that it can identify and present
a hierarchical decomposition of traffic based upon the learned structure of whole packets without
prior knowledge of protocol structures. ADHIC needs only a small fraction of packets to generate
the cluster decision tree, and the generated tree can be used to cluster packets at wire speeds. Our
experiments show NetADHICT can appropriately segregate well-known protocols, cluster traffic of
the same protocol together even if it is running on multiple ports, and segregate p2p traffic that
uses non-standard ports. We believe that ADHIC and NetADHICT are a useful complement to
critical applications used for performance analysis, identification of worms and flash crowds, and
[ Denial-of-Service resistant bandwidth management. ]

1. Introduction and Related Work


Analyzing and understanding the behavior of network traffic is very challenging. Two common approaches are the
use of classifiers based on ports and IP addresses and protocol dissectors which operate on reconstructed streams.
Neither performs well for many applications, particularly in the face of adversaries. The first strategy is high
performance, but fails with unknown or unusually configured protocols. The second is very slow and requires a
dissector for every protocol.
Our strategy is not to solve the entire problem. We obtain a higher-level view by clustering packets into
equivalence classes. ADHIC (3), does not use features calculated from packet characteristics; instead, it relies on
the raw packet data. We divide clusters by calculating the frequencies of substrings at fixed offsets within packets,
which we call (p, n)-grams (6) This allows us to continually monitor traffic and adjust the clusterer to best match
traffic. Then, this can be presented to system administrators to examine or to application-specific tools for further
analysis.
Machine learning techniques have been applied to network traffic analysis before, most notably by Estan, Savage,
and Varghese with their AutoFocus cluster, which uses IP header fields. More recently, Erman used K-means to
cluster and then labelled the results to build a classifier (2). Williams, Zander, and Armitage tested various
classifiers on flows (7).
Our approach differs from most of these ML methods in several ways. ADHIC is unsupervised in every stage of
operation, it does not rely on prior knowledge of protocols, nor do we bias our selection of packet substrings. Also,
we do not attempt to use ADHIC as a first step in building a classifier nor do we require flow reconstruction.
Moreover, most traditional machine learning cluster algorithms assume offline analysis that involves a quadratic
or greater number of comparisons (1). Our clustering algorithm uses only a small number of packets during each
monitoring period to construct a cluster decision tree. Because of this, ADHIC is able to assign packets to clusters
and adapt the cluster decision tree to changing traffic patterns simultaneously.

2. ADHIC and NetADHICT


ADHIC clusters traffic by recurisvely subdividing network traffic into binary classes. Division stops when the
traffic assigned to the resulting clusters falls below some configurable threshold or volume becomes too similar or
dissimilar. Our approach is Approximate because we use a sampled measure of similarity. This method produces a
binary decision tree that consists of internal decision nodes and terminal clusters. Each internal node is associated
with a (p, n)-gram. Packets are assigned to clusters depending on whether they contain the substring. We say
that packets that match are directed to the left child, and ones that do not are directed to the right child. The
rightmost child is called the default node, because default packets have not matched any (p, n)-gram.
The tree is generated and adapts through two operators: split and delete. Splitting occurs when a leaf cluster
matches more than some threshold of traffic and its traffic is neither too similar or dissimilar. Similarity in ADHIC
is measured by finding a (p, n)-gram such that it is found in roughly half of packets. These statistics are measured
over a period called the maturation window. Nodes which have been modified over the last maturation window
cannot be split or deleted. Deletion occurs when a node has not been assigned traffic over the maturation window.
The child receiving all the traffic replaces the parent of the ignored node.

3. Experiments and Results


In earlier work, we investigated ADHIC’s ability to mitigate denial-of-service attacks. By dividing network band-
width by cluster, only a small percentage of traffic is affected by an attack, assuming attacks can be segregated
(4).
To evaluate ADHIC in a broader context, we implemented a packet analysis tool called NetADHICT (pronounced
“net addict”) (5). NetADHICT can analyze data as it is received by a network interface, or it can analyze capture
files offline. Sampled packets are used to generate and update an ADHIC decision tree. At any given point in time,
NetADHICT has a cluster tree that embodies the high-level structure of current network traffic.
In experiments using four independent week-long datasets from our lab’s production network, we have found
that ADHIC can quickly capture the overall structure of traffic. The inferred structure corresponds to typical
divisions of network traffic (e.g. TCP vs. UDP, web vs. non-web traffic, etc.), arrived at using the (p, n)-grams
that generally are only meaningful within the context of a given environment. This structure directly reflects
the relative popularity of different uses of the network’s bandwidth. Further, the use of non-standard ports for
protocols has little effect on how packets are clustered (i.e., they are grouped as if they used the standard port
for the application). Even applications that purposely disguise themselves by using reserved port numbers (e.g.
80) or non-standard changing port numbers, such as the BitTorrent peer-to-peer file sharing protocol, can also be
clustered appropriately—all without requiring any protocol-specific information. Similarly, encrypted traffic is also
often clustered appropriately because we are able to appropriately segregate all other traffic.
An adversary wishing to manipulate the ADHIC decision tree faces several obstacles. First, he must guess the
(p, n)-grams within the nodes. This is not impossible, but is often difficult because they are often idiosyncratic to
the network’s specific traffic composition, particularly in deeper nodes. Furthermore, because ADHIC is volume-
based, adversaries must send many packets to considerably influence splits. This becomes easier for nodes deeper
in the tree, but deeper nodes are also less likely to affect general behavior. Finally, the decision-tree is easily
comprehended by system administrators and can be manually manipulated, allowing for direct interventions when
the algorithm fails.
Thus, ADHIC, through its implementation NetADHICT, provides a powerful and efficient means for clustering
network traffic. ADHIC is useful for both investigating network behavior and in mitigating some attacks.

References
[1] Duda, R. O., Hart, P. E., and Stork, D. G. Pattern Classification, 2 ed. Wiley, 2001, ch. Unsupervised
Learning and Clustering, pp. 517–599.
[2] Erman, J., Mahanti, A., and Arlitt, M. Internet traffic identification using machine. In Proceedings of
IEEE GlobeCom (2006).
[3] Hijazi, A., Inoue, H., Matrawy, A., van Oorschot, P., and Somayaji, A. Towards understanding
network traffic through whole packet analysis. Tech. Rep. TR-07-06, School of Computer Science, Carleton
University, 2007.
[4] Hijazi, A., Inoue, H., van Oorschot, P., and Somayaji, A. Diversity-based traffic traffic management.
Tech. rep., Carleton University - prepared for the Communications Security Establishment, 2006.
[5] Inoue, H., Jansens, D., Hijazi, A., and Somayaji, A. Netadhict: A tool for understanding network traffic.
In Proceedings of the 21st Large Installation System Administration Conference (LISA’07) (Nov 2007).
[6] Matrawy, A., van Oorschot, P., and Somayaji, A. Mitigating network denial-of-service through
diversity-based traffic management. In Applied Cryptography and Network Security (ACNS’05) (2005), Springer
Science+Business Media, pp. 104–121.
[7] Williams, N., Zander, S., and Armitage, G. A Preliminary Performance Comparison of Five Machine
Learning Algorithms for Practical IP Traffic Flow Classification. ACM SIGCOMM Computer Communications
Review (October 2006).

Anda mungkin juga menyukai