Anda di halaman 1dari 70

1.

INTRODUCTION
A log file is a file that records either the events that take place which happens while an
operating system or other software runs. Event logs record events taking place in the execution
of a system in order to provide an audit trail that can be used to understand the activity of the
system and to diagnose problems. They are essential to understand the activities of complex
systems, particularly in the case of applications with little user interaction (such as server
applications). ystem administrators can trawl through event logs to detect critical alerts and
failures and fix them.!"#
1.1 Supercomputer Logs
$igure %.%& ample entries of a log file.
upercomputers generate event logs with millions of entries. This makes it difficult for an
administrator to identify possible alerts manually. 'ence, there is a need to streamline the
process for convenient detection of alerts. To this end, extraction of (essage types is to be done.
A message type is basically a cluster of similar event descriptions. (essage type descriptions are
the templates on which the individual unstructured messages in any event log are built.
Extraction of message types serves several purposes. (essage types can serve as abstractions for
compression of log files, visuali)ation of alerts occurring in the system and simplifying the
search procedure in log files.
%
1.2 Message Type Extraction
(essage types can be define by understanding the following example&
*onsider the following events of a log file
$igure %."& Entries of an event log.
The entries can be grouped under a cluster +generating ,- as among all the entries, the token
generating remains constant and the other token varies and hence it is generali)ed using ,. At
the outset, message type extraction serves as a classic problem of clustering. (any existing tools
do establish this ob.ective but with either little accuracy or poor performance. To this end, a
novel multi/level clustering techni0ue is proposed which scales well on large datasets and
provided accurate message type descriptions !%#. This model is extended to show the
visuali)ation of the alerts in the file. Event logs generated by applications that run on a system
consist of independent lines of text data, which contain information that pertains to events that
occur within a system. This makes them an important source of information to system
administrators in fault management and for intrusion detection and prevention. 1ith regard to
autonomic systems, these two tasks are important cornerstones for self/healing and self/
protection, respectively. Therefore, to move towards the goal of building systems that are capable
of self/healing and self/protection, an important step would be to build systems that are capable
of automatically analy)ing the contents of their log files, in addition to measured system metrics
to provide useful information to the system administrators.
1. Moti!ation
Extraction of message types makes it possible to abstract the unstructured content of event
logs, which constitutes a key challenge to achieving fully automatic analysis of system logs.
"
(essage type descriptions are the templates on which the individual unstructured messages in
any event log are built. (essage types can abstract the contents of system logs. 1e can therefore
use them to obtain more concise and compact representations of log entries. This leads to
memory and space savings. Each uni0ue message type can be assigned an 2dentifier 2ndex (23),
which in turn can be used to index historical system logs leading to faster searches. The building
of computational models on the log data, which usually re0uires the input of structured data, can
be facilitated by the initial extraction of message type information. (essage types are used to
impose structure on the unstructured messages in the log data before they are used as input into
the model building algorithm. 4isuali)ation is an important component of the analysis of large
data sets. 4isuali)ation of the contents of systems logs can be made more meaningful to a human
observer by using message types as a feature of the visuali)ation. $or the visuali)ation to be
meaningful to a human observer, the message types must be interpretable. This fact provides a
strong incentive for the production of message types that have meaning to a human observer.
1." O#$ecti!e
The ob.ective is to create a tool that takes a log file as input, performs preprocessing on it,
and then performs the message type extraction. 5sing the results obtained from it, statistical tools
like pie charts and runtime curves are used. Among the message types, ones which are alerts are
taken and monitored by viewing the messages in those clusters.
1.% Scope
1orks on 6 datasets (as of now) but can be extended to other log files after making
minor modifications
3isplays runtime curves for a given dataset and also pie charts showing the alerts.
1.& Literature Sur!ey
*hapter % gives overview of the pro.ect problem statement. 2t includes introduction to the
domain, motivation, scope and ob.ective of the pro.ect.
6
*hapter " describes literature survey of the pro.ect where the study regarding the
terminologies involved in the pro.ect, various vulnerabilities and information about third party
tools used are mentioned.
*hapter 6 gives overview of the system re0uired for the pro.ect where the problem
specification, various modules and their functionalities and software and hardware re0uirements
are mentioned.
*hapter 7 gives design of the pro.ect where introduction and 5(8 diagrams involved in the
pro.ect are specified.
*hapter 9 describes the implementation of pro.ect. This includes classes and methods used,
algorithms implemented with examples.
*hapter : includes the results of the pro.ect execution and its analysis.
*hapter ; gives the conclusions and future work related to the pro.ect.

7
2. LITER'TURE SUR(E)
2.1 *re!ious +or,
3ata clustering as a techni0ue in data mining or machine learning is a process whereby
entities are sorted into groups called clusters, where members of each cluster are similar to each
other and dissimilar from members of other groups. *lustering can be useful in the interpretation
and classification of data sets too large to analy)e manually. *lustering therefore can be a useful
first step in the automatic analysis of event logs. 2f each textual line in an event log is considered
a data point and its individual words considered attributes, then the clustering task reduces to one
in which similar log messages are grouped together.
2.1.1 Existing Tec-ni.ues
1hile several algorithms like *82<5E!6#, *5=E, and (A$2A!9# have been designed for
clustering high dimensional data, these algorithms are still not 0uite suitable for log files because
an algorithm suitable for clustering event logs needs to not .ust be able to deal with high/
dimensional data, but it also needs to be able to deal with data with different attribute types. >n
the other hand 8*T and 8oghound are two algorithms, which were designed specifically for
automatically clustering log files, and discovering event formats. ?ecause both 8*T and
8oghound are similar to the Apriori algorithm, they re0uire the user to provide a support
threshold value as input.
2.1.2 SLCT
8*T works through a three step process / 2t firsts identifies the fre0uent words (words that
occur more fre0uently than a support threshold value) or %/item sets from the data. 2t then
extracts the combinations of these %/item sets that occur in each line in the data set. These %/item
set combinations are cluster candidates. $inally, those cluster candidates that occur more
fre0uently than the support value are then selected as the clusters in the data set. =isto 4aarandi@s
8*T9 uses an algorithm specifically designed to detect word clusters in log messages. 2t makes
three passes through the data to accomplish this ob.ective. A hash counting all words and their
9
position in the line is generated on the rst pass through the data (+the dog ran- is be hashed into
three keys& .%.the, ".dog, 6.ran.). 1ords having a support less than s are then pruned from the
hash, and a new hash of message word clusters is generated during a second pass through the
data (the messages .the dog ran. and .the deer ran. would generate a key of %.the "., 6.ran for
s=2 / the second word is the wild card +,- since dog and deer only appear once). An optional
third pass can be performed in which wild card positions are refined with constant heads or tails
if possible (in our example, ".+,- becomes ".-d,-. because both dog and deer begin with d). The
resulting word cluster and their support is output, and any lines not matching any word cluster
are saved to a separate file for review (+outlier- lines).!7#
2.1. Log-oun/
8oghound on the other hand discovers fre0uent patterns from event logs by utili)ing a
fre0uent item set mining algorithm, which mirrors the Apriori algorithm more closely than 8*T
because it works by finding item sets which may contain more than % word up to a maximum
value provided by the user. 1ith both 8*T and 8oghound, lines that do not match any of the
fre0uent patterns discovered are classified as outliers. The shortcomings of 8oghound and 8*T
are two fold. $irstly, they both focus on Anding only fre0uent message patterns in log data but
not infre0uent patterns. 1hile this might sufAce most times, it may sometimes be necessary to
also And infre0uent patterns for analysis. 2nfre0uent patterns may be more interesting to And in
applications such as anomaly detection. econdly comes the issue of semantics. Batterns found
by 8oghound and 8*T are all valid but may not necessarily make sense to a human observer.
This observation becomes relevant if the patterns found will be used in a visuali)ation tool such
as 8og4iew. 2t is therefore important to extend the work of tools like 8oghound and 8*T by
designing an algorithm that will allow the discovery of infre0uent patterns and also patterns that
are meaningful to a human observer.
2.2 Datasets
2n this work, log files of 6 different supercomputers C 8A 'B*, ?lue DeneEB, ?lue DeneE8 are
used for evaluation.
:
2.2.1 01ue gene
?lue Dene is an 2?( pro.ect aimed at designing supercomputers that can reach operating
speeds in the B$8>B (peta$8>B) range, with low power consumption. The pro.ect created
three generations of supercomputers, ?lue DeneE8, ?lue DeneEB, and ?lue DeneE<. ?lue Dene
systems have often led the T>B9FF and Dreen9FF rankings of the most powerful and most power
efficient supercomputers, respectively. ?lue Dene systems have also consistently scored top
positions in the Draph9FF list. The pro.ect was awarded the "FFG Hational (edal of Technology
and 2nnovation. The ?lue DeneEB dataset!:# used in the current work has 7.; million entries,
while the ?lue DeneE8 dataset has %.; million entries. The entries have been collected over a six
month period. The dataset si)e is large enough to pose a data mining problem. The data consists
of =A log messages collected over a period of : months on the ?lue DeneEB 2ntrepid system at.
Each message in the log contains %9 fields as follows& =E*23, (DI23, *>(B>HEHT,
5?*>(B>HEHT, E==*>3E, E4E=2TJ, E4EHTIT2(E, $8AD, B=>*E>=, H>3E,
?8>*K, 8>*AT2>H, E=2A8H5(?E=, E*23, (EADE.
2.2.2 L' 2*C31
The 'B*/% is a supercomputer located at the 8os Alamos Hational 8aboratory. 2ts dataset
has a total of F.7 million entries.
2. C4
*L (pronounced as see sharp) is a multi/paradigm programming language encompassing
strong typing, imperative, declarative, functional, procedural, generic, ob.ect/oriented (class/
based), and component/oriented programming disciplines. 2t was developed by (icrosoft within
its .HET initiative and later approved as a standard by Ecma (E*(A/667) and 2> (2>E2E*
"6";F&"FF:). *L is one of the programming languages designed for the *ommon 8anguage
2nfrastructure. *L is built on the syntax and semantics of *MM, allowing * programmers to take
advantage of .HET and the common language runtime. *L is intended to be a simple, modern,
general/purpose, ob.ect/oriented programming language. 2ts development team is led by Anders
'e.lsberg. The most recent version is *L 9.F, which was released on August %9, "F%". *L is the
programming language that most directly reflects the underlying *ommon 8anguage
;
2nfrastructure (*82). (ost of its intrinsic types correspond to value/types implemented by the
*82 framework. 'owever, the language specification does not state the code generation
re0uirements of the compiler& that is, it does not state that a *L compiler must target a *ommon
8anguage =untime, or generate *ommon 2ntermediate 8anguage (*28), or generate any other
specific format. Theoretically, a *L compiler could generate machine code like traditional
compilers of *MM or $ortran. . ome notable features of *L that distinguish it from * and *MM
(and Nava, where noted) are&
*L supports strongly typed implicit variable declarations with the keyword var, and
implicitly typed arrays with the keyword new!# followed by a collection initiali)er.
(eta programming via *L attributes is part of the language. (any of these attributes
duplicate the functionality of D**@s and 4isual*MM@s platform/dependent preprocessor
directives.
8ike *MM, and unlike Nava, *L programmers must use the keyword virtual to allow
methods to be overridden by subclasses.
Extension methods in *L allow programmers to use static methods as if they were
methods from a class@s method table, allowing programmers to add methods to an ob.ect
that they feel should exist on that ob.ect and its derivatives.
The type dynamic allows for run/time method binding, allowing for Navacript like
method calls and run/time ob.ect composition.
*L has support for strongly/typed function pointers via the keyword delegate.
8ike the <t framework@s pseudo/*MM signal and slot, *L has semantics specifically
surrounding publish/subscribe style events, though *L uses delegates to do so.
*L offers Nava/like synchroni)ed method calls, via the attribute
!(ethod2mpl((ethod2mpl>ptions.ynchroni)ed)#, and has support for mutually/
exclusive locks via the keyword lock.
The *L languages does not allow for global variables or functions. All methods and
members must be declared within classes. tatic members of public classes can
substitute for global variables and functions.
8ocal variables cannot shadow variables of the enclosing block, unlike * and *MM.
O
A *L namespace provides the same level of code isolation as a Nava package or a *MM
namespace, with very similar rules and features to a package.
*L supports a strict ?oolean data type, bool. tatements that take conditions, such as
while and if, re0uire an expression of a type that implements the true operator, such as
the boolean type. 1hile *MM also has a boolean type, it can be freely converted to and
from integers, and expressions such as if(a) re0uire only that a is convertible to bool,
allowing a to be an int, or a pointer. *L disallows this Pinteger meaning true or falseP
approach, on the grounds that forcing programmers to use expressions that return
exactly bool can prevent certain types of programming mistakes common in * or *MM
such as if (a Q b) (use of assignment Q instead of e0uality QQ).
2n *L, memory address pointers can only be used within blocks specifically marked as
unsafe, and programs with unsafe code need appropriate permissions to run. (ost
ob.ect access is done through safe ob.ect references, which always either point to a
PliveP ob.ect or have the well/defined null valueR it is impossible to obtain a reference to
a PdeadP ob.ect (one that has been garbage collected), or to a random block of memory.
An unsafe pointer can point to an instance of a value/type, array, string, or a block of
memory allocated on a stack. *ode that is not marked as unsafe can still store and
manipulate pointers through the ystem.2ntBtr type, but it cannot dereference them.
(anaged memory cannot be explicitly freedR instead, it is automatically garbage
collected. Darbage collection addresses the problem of memory leaks by freeing the
programmer of responsibility for releasing memory that is no longer needed.
2." +*5
1indows Bresentation $oundation (or 1B$) is a graphical subsystem for rendering user
interfaces in 1indows/based applications by (icrosoft. 1B$, previously known as PAvalonP,
was initially released as part of .HET $ramework 6.F. =ather than relying on the older D32
subsystem, 1B$ uses 3irectS. 1B$ attempts to provide a consistent programming model for
building applications and separates the user interface from business logic. 2t resembles similar
S(8/oriented ob.ect models, such as those implemented in S58 and 4D. 1B$ employs
SA(8, an S(8/based language, to define and link various interface elements. 1B$
G
applications can also be deployed as standalone desktop programs, or hosted as an embedded
ob.ect in a website. 1B$ aims to unify a number of common user interface elements, such as
"3E63 rendering, fixed and adaptive documents, typography, vector graphics, runtime
animation, and pre/rendered media. These elements can then be linked and manipulated based on
various events, user interactions, and data bindings. 1B$ runtime libraries are included with all
versions of (icrosoft 1indows since 1indows 4ista and 1indows erver "FFO. 5sers of
1indows SB B"EB6 and 1indows erver "FF6 can optionally install the necessary libraries.
2.% 6'ML
Extensible Application (arkup 8anguage (SA(8) is a declarative S(8/based language
developed by (icrosoft that is used for initiali)ing structured values and ob.ects. 2t is available
under (icrosoft@s >pen pecification Bromise. The acronym originally stood for Extensible
Avalon (arkup 8anguage / Avalon being the code/name for 1indows Bresentation $oundation
(1B$). SA(8 is used extensively in .HET $ramework 6.F T .HET $ramework 7.F
technologies, particularly 1indows Bresentation $oundation (1B$), ilverlight, 1indows
1orkflow $oundation (1$) and 1indows =untime SA(8 $ramework and 1indows tore
apps. 2n 1B$, SA(8 forms a user interface markup language to define 52 elements, data
binding, eventing, and other features. 2n 1$, workflows can be defined using SA(8. SA(8
can also be used in ilverlight applications, 1indows Bhone apps and 1indows tore apps.
SA(8 elements map directly to *ommon 8anguage =untime ob.ect instances, while SA(8
attributes map to *ommon 8anguage =untime properties and events on those ob.ects. SA(8
files can be created and edited with visual design tools like (icrosoft Expression ?lend,
(icrosoft 4isual tudio, and the hostable 1indows 1orkflow $oundation visual designer. They
can also be created and edited with a standard text editor, a code editor like SA(8Bad, or a
graphical editor like 4ector Architect. Anything that is created or implemented in SA(8 can be
expressed using a more traditional .HET language, such as *L or 4isual ?asic.HET. 'owever, a
key aspect of the technology is the reduced complexity needed for tools to process SA(8,
because it is based on S(8. *onse0uently, a variety of products are emerging, particularly in the
1B$ space, which create SA(8/based applications. As SA(8 is simply based on S(8,
developers and designers are able to share and edit content freely amongst themselves without
%F
re0uiring compilation. ince it is strongly linked to the .HET $ramework 6.F technologies, the
only fully compliant implementation at present is (icrosoft@s.
2.& +*5 Too1,it
1B$ Toolkit is the number one collection of 1B$ controls, components and utilities for creating
next generation 1indows applications. 2t provides controls for creating pie charts, bar charts,
histograms etc. 8ike other 1B$ controls, they use SA(8 for creation and specifying their
properties.
%%
. S)STEM 'N'L)SIS
.1 Termino1ogy
To understand the problem at hand, it is important to define the following terms.
.1.1 E!ent 1og
A text/based audit trail of events that occur within the system or application processes on a
computer system.
.1.2 E!ent
An independent line of text within an event log which details a single occurrence on the
system. An event typically contains not only a message but other fields of information like a
3ate, ource, and Tag. $or message type extraction, we are only interested in the message field
of the event. This is why events are sometimes referred to in the literature as messages. 2n $igure
6.%, the first five fields (delimited by whitespace) represent the Timestamp, 'ost, *lass, $acility,
and everity of each event. 1e omit these types of fields from the message type extraction
process as they are already sufficiently structured. 'owever, they are still useful for further log
analysis, e.g., the Timestamp and 'ost fields for time series analysis of the uni0ue message types
extracted.
$igure 6.%& Deneral structure of a log entry.
%"
.1. To,en
A single word delimited by white space within the message field of an event. The tokens and
the relationship between them is considered while clustering the events.
.1." E!ent si7e
The number of individual tokens in the +message- field of an event. The event si)e is one of
heuristics used while clustering the logs.
.1.% Message Type
These are message fields of entries within an event log produced by the same print statement.
Honoverlapping consecutive pairs of lines in the log belong to the same event cluster. 3ue to the
sub.ectivity of determining what constitutes a message type, it is possible that a human observer
might consider messages produced by a single message type as belonging to different message
types or treat messages produced by different print statements as belonging to the same message
type. 2t is also possible that the same print statement is present in different parts of the code,
producing different messages types with the same message type description. 'owever, we
consider these scenarios as relatively rare, so we will use this definition for the sake of
simplicity.
.1.& Constant To,en
A token within the message field of an event which is not represented by a wildcard value in
its associated message type description.
.1.8 (aria#1e To,en
A token within the message field which is represented by a wild card (+,-) and is part of the
message type description.
%6
.2 *ro#1em
The problem can be defined as follows C given a log file 8, consisting of entries U8
%
,8
"
, 8
6
,......V,
we need to extract message types (QU(
%
,(
"
,(
6
,....V where each (
i
represents a uni0ue non/
empty subset of 8. Among all such subsets, we need to find the largest and represent them
visually using charts.
. Mo/u1es
There are mainly 6 modules in the application.
..1 *re3processing Mo/u1e
2n this module, the user can select the log file as input and perform pre/processing. Bre/
processing removes all the irrelevant attributes and only keeps the message part of the entries.
..2 C1ustering Mo/u1e
2n this module, the pre/processed dataset is given to the mining algorithms. There are 7
algorithms which are called se0uentially one after the another C Bartition ?y Event si)e, Bartition
?y Token Bosition, Bartition ?y ?i.ection, (essage Type Extraction. Each algorithm generates a
series of partitions that are given to the next algorithm. At the end of all the algorithms, the
re0uired message types are generated. This module is space and time intensive and involves the
use of several data structures.
.. (isua1i7ation mo/u1e
This module involves taking the message types extracted by the *lustering module and
applying visuali)ation techni0ues to better understand the results. This involves using pie charts
and curves for runtime analysis of operations.
." Re.uirements
=e0uirements are categori)ed into two types C 'ardware and oftware re0uirements.
%7
.".1 2ar/9are Re.uirements
Bentium 3ual/*ore Brocessor
" D? =A(
9 D? storage space
.".2 So:t9are Re.uirements
.HET 7.9 3K and runtime
1B$ toolkit (for visuali)ation)
4isual tudio "F%6
1indows O
tar5(8 (for design)
8T$4iewer (for viewing the datasets)
%9
". S)STEM DESI;N
".1 Design
3esign can be described with the help of 5(8 diagrams.
".1.1 Use Case /iagram
$igure 7.%& 5secase diagram of the system
The only actor involved here is the administrator C who is interested in accessing the message
types so that he can get to the root of the problems in the system.
The use cases involved in $igure 7.% are&
Breprocess C calls the pre/processing module to remove irrelevant attributes from the
selected dataset.
Extract (essage Types C calls the clustering module to perform 2B8>( on the pre/
%:
processed dataset.
4iew *lusters C the extracted message types and messages belonging to each type can be
viewed in a hierarchial manner.
4iew Draphs C the administrator can view the most fre0uently occurring alerts using a pie
chart and also see the run/time performance of the algorithms.
".1.2 C1ass Diagram
The class diagram shows the relationships and abstractions involved in mining the log
files. The various classes in the $igure 7." are&
(ain1indow C this is the main D52 where the user selects the dataset and starts the
operations.
Bartition C this encompasses the algorithms which preprocess and then cluster the entries
in the dataset. 2t is invoked in the (ain1indow class and hence has an association
relationship.
TokenBosition C each ob.ect of this class stores the tokens occurring in corresponding
token position and calculates cardinality re0uired for the Bartition ?y Token algorithm.
Bair C ob.ects of this class are also used in Bartition ?y Token algorithm. Each ob.ect
stores the token position chosen and the number of uni0ue tokens in that position.
?i.ectionBair C ob.ects of this class used in Bartition ?y earch for ?i.ection. Each ob.ect
considers two token positions and stores tokens occurring in those positions and
determines whether there exists a bi.ective relationship.
?i.ection C This ob.ect stores the bi.ective relationship between two positions (if it exists)
along with the set of uni0ue tokens in each position.
Hon?i.ection C if there exists no bi.ective relationship in a partition, then this ob.ect is
used to create outlier and place the entries in them.
(essageType C each ob.ect stores final value for each token position after clustering.
1indow% C This generates the pie charts which shows classification of messages
according to event si)e and the most fre0uently occurring alerts.
1indow" C this window shows each message type and messages belonging to it.
%;
$igure 7."& *lass diagram
%O
".1. 'cti!ity Diagrams
Two activity diagrams are shown in $igure 7.6 and $igure 7.7.
$igure 7.6& Activity diagram for Bartition by event
%G
$igure 7.7& Activity diagram for Bartition by Token operation
$igure 7.6 shows the activity diagram for Bartition by event operation. $igure 7.7 shows
the activity diagram for Bartition by token operation. The operations are as explained in the
methodology.
"F
".1." Se.uence Diagram
$igure 7.9& e0uence diagram
$igure 7.9 shows the se0uence diagram. An instance of (ain1indow starts the process
by creating the Bartition ob.ect p, which in turns starts se0uentially self/invoking operations like
partition?yEventi)e(), partition?yTokenBosition() etc. p then indicates the completion of the
clustering. The (ain1indow ob.ect w then invokes the visuali)ation module and viewing the
clusters separately.
"%
".1.% Component /iagram
$igure 7.:& *omponent diagram
$igure 7.: shows the component diagram, which shows interaction between groups of
classes. The initial D52 and event handlers are grouped into (ain1indow component. The
Bartition component is the group of classes which accomplish the clustering. 2t consists of
several classes like TokenBosition, ?i.ection, Hon?i.ection, (essageTypes etc. 2t reali)es the
re0uired interface 2B8>(. 4isuali)ation component encompasses the set of classes re0uired for
displaying the statistics and charts. The (ain1indow has connectors to Bartition and
4isuali)ation modules.
""
%. IM*LEMENT'TION
%.1 Met-o/o1ogy
The algorithms involved in the clustering module are illustrated below&
%.1.1 *artition #y E!ent Si7e
The pre/processed data set is read line by line. (essages with same event si)es are
grouped into one partition. Thus, several partitions are created. The reasoning behind this is that,
messages belonging to same message type have the same number of tokens. This is can be
illustrated from the $igure 9.%.
$igure 9.%& 2llustration of partition by event si)e, creating 6 partitions for 6 messages
The first step of the partitioning process works on the assumption that log messages that have the
same message type description are likely to have the same event si)e. $or this reason, 2B8o(Ws
first step uses the event si)e heuristic to partition the log messages. ?y partition, we mean
nonoverlapping groupings of the messages. Additional heuristic criteria are used in the remaining
steps to further divide the initial partitions. The partitioning process induces a hierarchy of
maximum depth 7 on the messages and the number of nodes on each level is data dependent.
*onsider the cluster description +*onnection from ,,- which contains three tokens. 2t can be
intuitively concluded that all the instances of this cluster, e.g., +*onnection from
"99."99."99."99- and +*onnection from F.F.F.F- would also contain the same number of tokens.
?y partitioning our data first by event si)e, we are taking advantage of the property of most
cluster instances of having the same event si)e. Therefore, the resultant partitions of this heuristic
are likely to contain the instances of the different clusters, which have the same event si)e.
"6
ometimes, it is possible that clusters with events of variable si)e exist in the event log. ince
2B8o( assumes that messages belonging to the same cluster should have the same number of
tokens or event si)e, this step of the algorithm would separate such clusters. This does not occur
too often, and variable si)e message types can still be found by postprocessing 2B8o(Ws results.
The process of finding variable si)e message types can be computationally expensive.
Hevertheless, performing this process on the templates produced by 2B8o( rather than on the
complete log would re0uire less computation.
%.1.2 *artition 0y To,en *osition
At this point, each partition of the log data contains log messages, which are of the same
si)e and can therefore be viewed as n/tuples, with n being the event si)e of the log messages in
the partition. This step of the algorithm works on the assumption that the column with the least
number of variables (uni0ue words) is likely to contain words, which are constant in that position
of the message type descriptions that produced them. >ur heuristic is therefore to find the token
position with the least number of uni0ue values and further split each partition using the uni0ue
values in this token position, i.e., each resultant partition will contain only one of those uni0ue
values in the token position discovered, as can be seen in the example outlined in $igure 9." .
$igure 9." & electing the token position for partitioning in partition by token position operation.
The algorithm is elaborated in pseudo code below&
Algorithm ". 2B8o( tep "& elects the token position with the lowest cardinality and then
separates the lines in the partition based on the uni0ue values in the token position. ?acktracks
on partitions with lines that fall below the partition support threshold.
2nput& *ollection of log file partitions from tep/%.
"7
=eal number BT as partition support threshold. U=ange for BT is assumed to be between F and
%.V
>utput& *ollection of log file partitions derived at tep/"
%& for every log file partition do UAssume lines in each partition have same event si)e.V
"& 3etermine token position B with lowest cardinality with respect to set of uni0ue tokens.
6& *reate a partition for each token value in the set of uni0ue tokens that appear in position B.
7& eparate contents of partition based on uni0ue token values in token position B into separate
partitions.
9& end for
:& for each partition derived at tep/" do UV
;& if B= X B then
O& Add lines from partition to >utlier partition
G& end if
%F& end for
%%& =eturn() U>utput is collection of pruned new partitionsV
The memory re0uirement of uni0ue token counting is a potential concern with the
algorithm. 1hile the problem of uni0ue token counting is not specific to 2B8o(, it is believed
that 2B8o( has an advantage in this respect. ince 2B8o( partitions the database, only the
contents of the partition being handled need be stored in memory. This greatly reduces the
memory re0uirements of the algorithm. (oreover, other workarounds can be implemented to
further reduce the memory re0uirements. $or example, in this tep " of the algorithm, by
determining an upper bound (5?) on the lowest token count in tep %, we can drastically reduce
the memory re0uirements of this step, further counts of uni0ue tokens in any token position that
exceeds the upper bound can be eliminated.
3espite the fact that we use the token position with the least number of uni0ue tokens, it
is still possible that some of the values in the token position might actually be variables in the
original message type descriptions. 1hile an error of this type may have little effect on =ecall, it
could adversely affect Brecision. To mitigate the effects of this error, a partition support ratio
"9
(B=) for each partition produced could be introduced. The B= is calculated using in regard to
the original partition that it was derived from. 1e can then define a partition support ratio
threshold (BT). 1e group any partition with a B= that falls below the BT into one partition
(Algorithm "). The intuition here is that a child partition that is produced using a variable token
value may not have enough lines to exceed a certain percentage (the partition support ratio
threshold) of the log messages in the parent partition. 2t should be noted that this threshold is not
necessary for the algorithm to function and is only introduced to give the system administrators
the flexibility to influence the partitioning based on expert knowledge they may have and avoid
errors in the partitioning process.
$igure 9.6& E0uation for calculating partition support ratio.
%.1. *artition 0y Searc- 5or 0i$ection
*onsider the example the messages below as a log partition.
Command has completed successfully
Command has been aborted
Command has been aborted
Command has been aborted
Command failed on starting.
This partition has event si)e e0ual to 7. 1e need to select two token positions to perform
the search for bi.ection on. The first token position has one uni0ue token, U*ommandV. The
second token position has two uni0ue tokens, Uhas, failedV. The third token position has three
uni0ue tokens, Ucompleted, been, onV. 1hile the fourth token position has three uni0ue tokens,
Usuccessfully, aborted, startingV. 1e notice in this example that token count 6 appears most
fre0uently, twice, once in position 6 and once in position 7. The heuristic would therefore select
token positions 6 and 7 in this example. 2n the third and final partitioning step, we partition by
searching for bi.ective relationships between the set of uni0ue tokens in two token positions
selected using a heuristic as described in below algorithm.
Algorithm 6. 2B8o( tep 6& elects the two token positions and then separates the lines in the
partition based on the relational mappings of uni0ue values in the token positions. ?acktracks on
":
partitions with lines that fall below the partition support threshold.
2nput& *ollection of partitions from tep ". UBartitions of event si)e % or " are not processed
hereV
=eal number *T as cluster goodness threshold. U=ange for *T is assumed to be between F/%.V
>utput& *ollection of partitions derived at tep/6.
%& for every log file partition do
"& if *D= YQ *T then Uee (")V
6& Add partition to collection of output partitions
7& (ove to next partition.
9& end if
:& 3etermine token positions using heuristic as B% and B". U'euristic is explained in the text. 1e
assume token position B% occurs before B".V
;& 3etermine mappings of uni0ue token values B% in respect of token values in B" and vice
versa.
O& if mapping is %/% then
G& *reate partitions for event lines that meet each %/%
relationship.
%F& else if mapping is %/( or (/% then
%%& 3etermine variable state of ( side of relationship.
%"& if variable state of ( side is *>HTAHT then
%6& *reate partitions for event lines that meet relationship.
%7& else Uvariable state of ( side is 4A=2A?8EV
%9& *reate new partitions for uni0ue tokens in ( side of the relationship.
%:& end if
%;& else Umapping is (/(V
%O& All lines with meet (/( relationships are placed in one partition.
%G& end if
"F& end for
"%& for each partition derived at tep/6 do UV
""& if B= X B then
";
"6& Add lines from partition to >utlier partition
"7& end if
"9& end for
":& =eturn() U>utput is collection of pruned new partitionsV
To summari)e the steps of the heuristic, we first determine the number of uni0ue tokens
in each token position of a partition. 1e then determine the most fre0uently occurring token
count among all the token positions. This value must be greater than %. The token count that
occurs most fre0uently is likely indicative of the number of message types that exist in the
partition. 2f this is true, then a bi.ective relationship should exist between the tokens in the token
positions that have this token count. >nce the most fre0uently occurring token count value is
determined, the token positions chosen will be the first two token positions, which have a token
count value e0uivalent to the most fre0uently occurring token count. A bi.ective function is a %/%
relation that is both in.ective and sur.ective. 1hen a bi.ection exists between two elements in the
sets of tokens, this usually implies that a strong relationship exists between them and log
messages that have these token values in the corresponding token positions are separated into a
new partition.
$igure 9.7& earching for bi.ective relationship between two token positions
A bi.ective function is a %/% relation that is both in.ective and sur.ective. 1hen a bi.ection
exists between two elements in the sets of tokens, this usually implies that a strong relationship
exists between them and log messages that have these token values in the corresponding token
positions are separated into a new partition.
"O
%.1." Extraction o: Message Types
2n this step of the algorithm, partitioning is complete and we assume that each partition
represents a cluster, i.e., every log message in the partition was produced using the same line
format. A message type description or line format consists of a line of text where constant values
are represented literally and variable values are represented using wildcards. This is done by
counting the number of uni0ue tokens in each token position of a partition. 2f a token position
has only one value then it is considered a constant value in the line format, while if it is more
than one then it is considered a variable. ince our goal is to find all message types that may
exist in an event log or ensure that the presence of every message type contained in an event log
is reflected in the message types produced, we are not concerned about the occurrence of
+outliers- interfering with the formats produced at this step. 'ence, we set the threshold for
determining a variable token position as any token position with more than one uni0ue token.
$igure 9.9& Extracting the message types by generali)ing tokens in each position
%.2 Imp1ementation
%.2.1 *ac,ages
2n *L, packages are referred to as namespaces. The default namespace is ystem. ome
of the namespaces used in this work are&
ystem C the default namespace, classes like tring
ystem.*ollections.Deneric C data structures like 3ictionary, 8ist
ystem.1indows.*ontrols C 1B$ controls
ystem.1indows.3ocuments
ystem.2> C the classes for reading and writing files
"G
ystem.3iagnostics C for calculating runtime, top1atch is used.
(icrosoft.1indows.*ontrols
ystem.*omponent(odel < ?ackground1orker is used for memory intensive operations
%.2.2 C1asses an/ Met-o/s
ome of the user/define classes used here are&
Bartition C this class encompasses the methods re0uired for 2B8>(. ome of the methods
are&
preprocess() / removes the irrelevant attributes and only keeps the message part of the
log file
partition?yEvent() / splits the preprocessed log file into several partitions(files) based
on event si)e.
partition?yToken() / splits the partitions generated in above step into more partitions
based on the heuristic explained already.
partition?y?i.ection() / splits the partitions generated in above step into more
partitions based on another heuristic.
extract(essageTypes() / each partition thus generated is generali)ed to derive labels,
which are called message types.
TokenBosition C each ob.ect represents a token position and holds a list of all uni0ue
tokens in that positions
?i.ectionBair C each ob.ect is used to trace mappings between two token positions and
determine if a bi.ective relationship exists.
?i.ection C if a bi.ection exists, then this ob.ect stores the properties of that relationship.
Hon?i.ection C if no bi.ection exists, then this ob.ect stores details of that partition to
indicate it as an outlier.
Bair C in the partition?yToken() operation, it stores the position found and the cardinality.
(ain1indow C this hosts the D52 which appears when the program launches. 2t mainly
consists of event handler and background threads for running the clustering module.
6F
&. RESULTS 'ND 'N'L)SIS
&.1 Data Sets
1e use 6 log files of supercomputers as datasets to discover the clusters and analy)e
them. Table shows an overview of the datasets.
Hame of 3ataset Ho. of entries i)e of file
?luegeneEB 7;,%6,7GO Z;"9 (?
?luegeneE8 %:,G9,6;% Z% D?
8A 'B*/% 7,66,77O Z6" (?
Table :.%. >verview of 3atasets
&.2 Screens-ots
2n this section, we@ll discuss the clustering process for each data set with screenshots.
6%
&.2.1 Initia1 ;UI
$igure :.%& D52 of the program
$igure :.% shows the initial D52 presented to the user. Any one of the 6 datasets can be
selected and preprocessing is performed. >n clicking +Breprocess-, the program starts reading
the file related to the dataset. Each line is split into several strings using the space delimiter. The
only relevant attributes are picked and stored entry/by/entry in a new file.
6"
&.2.2 *reprocessing
$igure :."& napshot of raw log file
$igure :." shows a snapshot of the 'B* raw log file with several attributes. After clicking
on the Breprocess button as shown in $igure :.%, preprocessing starts. 2n each 'B* log file, each
field in an entry is separated from another by comma. The first : attributes are removed and only
the message is extracted and written to a new file.
66
$igure :.6& 'B* log file after preprocessing
After preprocessing, the dataset changes from $igure :." to $igure :.6. 2t shows the
preprocessed 'B* log file with only the event description. The file si)e is reduced from 6" (?
to %" (?. This also helps improve the runtime for the mining algorithm.
67
&.2. Options :or !isua1i7ation
$igure :.7& >ptions for visuali)ation
2n $igure :.7, After extracting the message types, there are options to&
Bartition ?y Token C runtime analysis C this gives a graph that plots the values of time
taken by the Bartition ?y Token Bosition operation and its variation with event si)e.
3istribution of messages C event si)e C This shows a pie chart which classifies messages
according to their event si)es.
Bartition ?y ?i.ection vs Event i)e C this shows a graph that plots the values of time
taken by the operation with respect to event si)e.
(ost fre0uently occurring alerts C This is the most important result of the whole process.
2t gives a pie chart indicating messages which are alerts and the largest among them.
69
&.2." 01uegene=*
$igure :.9& =untime analysis of Bartition by Token operation for ?luegeneEB
$igure :.9 shows the variation of runtime of Bartition by token operation with respect to
event si)e for ?luegeneEB dataset. The runtime is given in milliseconds along the J/axis. S/axis
shows the various event si)es. The maximum time is taken by messages with event si)e ". 1hen
ranges are considered, event si)es %/9 hog most of the *B5 time. ome event si)es do not occur
in the dataset and hence have F as their runtime. The maximum event si)e is :O.
6:
$igure :.:& Bie chart of distribution of messages
$igure :.: shows a pie chart which gives the distribution of messages in the dataset based
on their event si)es. The statistics are summari)ed in table :.".
Event si)e [ i)e
!%/9# :6.;O 6FF:F::
!9/%F# %G.G% G6O:%F
!%F/"F# G.6: 79F:F;
!"F/9F# :.:: "O;G%6
Y9F F.6G %O"G;
Table :."& 3istribution of messages/ ?luegeneEB
As observed in Table :.", ma.ority of the messages are in %/9 range and hence consume
more *B5 in clustering.
6;
$igure :.;& =untime analysis of Bartition by bi.ection operation
$igure :.; shows a similar runtime curve for Bartition by search for bi.ection operation
for ?luegeneEB dataset. Again, peak value is observed for message si)e !%/9# and "9. This is
because ma.ority of messages in the dataset belong to the mentioned event si)es. The S/axis
indicates the event si)es and J/axis indicates the runtimes in milliseconds. $or other event si)es,
the runtime is either small or negligible.
6O
$igure :.O& (ost fre0uently occurring alerts C ?luegeneEB
$igure :.O shows the most fre0uently occurring alerts for ?luegeneEB dataset. They can be
summari)ed as in table
*iod& Error loading ,invalid or missing program image. Ho such file
or directory
";."F[ %9";69
, total interrupts. , critical input interrupts. , microseconds total
spent on critical input interrupts, , microseconds max time in a
critical input interrupt.
"7.F:[ %69FG"
Brogram interrupt& , , %F.6%[ 9;;O7
, T8? error interrupt O.";[ 7:7%:
2nstruction cache parity error corrected %O.6:[ %F9G"7
3ata storage interrupt %%.6%[ :67G6
Table :.6& (ost fre0uently occurring alerts for ?luegeneEB
?y referring to Table :.6, the system administrator can select the given clusters and
actually analy)e the messages under each of the alert clusters.
6G
$igure :.G& 1indow showing clusters
$igure :.G shows each cluster in the top list and its corresponding elements in below lists.
This makes it simple for a system administrator to narrow down the message heEshe might be
searching for. The first list box shows the list of message types extracted. >n double/clicking on
any of the message types, the corresponding messages of the cluster are shown in a text area
below. The text area can display a maximum of 9F messages at a time. 2f there are more
messages in the cluster, the administrator can view them by clicking on a Hext button which is
provided.
7F
&.2.% 01uegene=L /ataset
$igure :.%F& =untime analysis for partition by token for ?luegeneE8
$igure :.%F shows the variation of runtime versus event si)e for partition by token
operation for ?luegeneE8 dataset. 5nlike the other two datasets, the ma.ority of the time is spent
on messages of event si)e %6 and :F. This is different compared to ?luegeneEB which peaks at %/
9 message si)es. This is due to different message distribution. Again, S/axis represents event
si)es and J/axis represents time in milliseconds.
7%
$igure :.%%& 3istribution of messages with event si)es for ?luegeneE8 dataset
$igure :.%% shows the distribution of the messages for ?luegeneE8 dataset. (ost of the
messages lie between %F/"F si)e. >therwise, the dataset is uniformly distributed over all event
si)es. Each sector represents one of the various pre/define ranges C !%/9#, !9/%F#, !%F/"F#, !"F/
9F# and Y9F.
Event si)e [ i)e
!%/9# "".O6 6O:GG9
!9/%F# %%."O %G%%:"
!%F/"F# 7O.G7 ;;OG"F
!"F/9F# %F.; %O%767
Y9F G."9 %9:O9G
Table :.7& 3istribution of messages C ?luegeneE8
Table :.7 shows the fraction of messages present for each of pre/defined labels.
(aximum amount is seen in !"F/9F# range which is indicated in green colour in the chart.
7"
$igure :.%"& =untime analysis of Bartition by bi.ection for ?luegeneE8
$igure :.%" shows the variation of runtime for Bartition by bi.ection. The peak occurs at
event si)e :F. The runtime greatly increases with event si)es as more tokens have to be parsed
and stored in TokenBosition ob.ects and more prospective bi.ective relationships must be
considered. Therefore, values are more for higher event si)es, especially if the number of
messages for that event si)e is larger than others. (essages with lower event si)es have fewer
prospective bi.ective relationships that need to be searched. 'ence, the low values.
76
$igure :.%6& (ost fre0uently occurring alerts for ?luegeneE8 dataset
$igure :.%6 shows the most fre0uently occurring alerts for ?luegeneE8 dataset, with each
alert indicated by a uni0ue colour. Their statistics can be summari)ed in the table below.
2nstruction cache parity error analysis& Tag bit , , , %G."%[ O9%%G
8% , cache parity error has occurred in TAD bit , , , %6."7[ 9O:;%
=eceiver errors& Hode , , , had , correctable errors in the , direction "%.O;[ G:GFF
ingle symbol error(s)& 33= *ontroller , failing 3=A( address ,
?B* pin , transfer , bit , ?B* module pin , compute trace , 3=A(
chip , 3=A( pin ,
G."G[ 7%%99
*ache parity error analyis& Tag bit , , , "G.9"[ %6FOF"
, symbol error count. *ontroller , chipselect F :.OO[ 6F7G7
Table :.9& (ost fre0uently occurring alerts for ?luegeneE8 dataset
Table :.9 shows the most fre0uently occurring alerts for ?luegeneE8 with statistics. Each
of these labels can be searched in the +4iew (essage Types- window with the messages.
77
&.2.& L' 2*C31
$igure :.%7& =untime analysis of Bartition ?y Token for 8A 'B*/%
$igure :.%7 shows the variation of runtime of Bartition by token operation. Beak is seen at
event si)e % and it is negilible for messages with event si)e Y "F. This can be understood by
seeing the message distribution for 'B*/%. There are very few messages with event si)eY9F.
*ompared to other datasets, this has the fewest entries and hence clustering completes much
0uicker as 2E> takes lesser time. Again, S/axis denotes event si)es and J/axis denotes
milliseconds.
79
$igure :.%9& 3istribution of messages C 8A 'B*/%
$igure :.%9 shows the distribution of messages for 8A 'B*/%. (ost of messages lie in %/
9 range. (ore can be inferred from table 9.:. (essages here come under only 6 labels C !%/9#,
!9/%F# and !%F/"F#. There are very few messages with event si)es Y "F.
Event si)e [ i)e
!%/9# O6." 6:F:9O
!9/%F# %6.GO :::%F
!%F/"F# ".;" %%O%%
!"F/9F# F 9
Y9F F %F
Table 7.:& 3istribution of messages C 8A 'B*/%
Table :.: shows that overwhelming ma.ority of the messages are in %/9 range. This has
ma.or influence on the runtime.
7:
$igure :.%:& =untime analysis of Bartition ?y ?i.ection for 8A/'B*/%
$igure :.%: shows the runtime performance of Bartition by bi.ection with respect to event
si)e. As expected, the peaks occur at %/9 and for other values, time in negligible, in fact almost
near )ero. This is obvious because ma.ority of the messages are found within that range. Again,
S/axis denotes the event si)es and J/axis denotes the runtimes in milliseconds. This operation in
total takes around O seconds for the dataset.
7;
$igure :.%;& (ost fre0uently occurring alerts for 8A 'B*/%
$igure :.%; shows the most fre0uently occurring alerts for 8A 'B*/%. This is
summari)ed in table.
8inkerror interval expired 9:.":[ O7OG7
8ink errors remain current G.F7[ %6:79
Temperature , exceeds warning threshold F.:"[ G7%
8ink error on broadcast tree , F.;;[ %%:;
1arning 6".G:[ 7G;7%
Bsu failure\ F.67[ 9%;
Table :.;& (ost fre0uently occurring alerts for 'B*/%
Table :.; shows the most fre0uently occurring alerts for 'B*/%. All the alerts also have
event si)es X 9. These alerts can be checked in another window along with the messages
belonging to each label.
7O
$igure :.%O 1indow to view 'B* clusters
$igure :.%O shows a window with cluster labels and their corresponding elements in the
text area below when a user clicks on it. This allows for more efficient searching. The text area
can show a maximum of 9F messages at a time. This is because the text area control could crash
if thousands of messages are shown at the same time. 2f more messages are available, they can be
seen by clicking on a Hext button provided.
7G
&.2.8 *artitions generate/
$igure :.%G& Bartitions generated after %
st
step
$igure :.%G shows the partitions(files) generated after Bartition by event si)e operation.
$or each event si)e, there exists a file. The above image is for 'B* dataset. ince there are a lot
of files, significant time is spent in reading and writing. $iles are named using the notation
+partitionXeventsi)eY-. These files are subse0uently read for the next operation.
9F
$igure :."F& Bartitions generated during "
nd
step
$igure :."F shows the partitions generated after Bartition by token position operation.
These are larger in number and are spawned from the partitions in %
st
step. The naming notation
for the files here is +partitionXevent si)eYIXtoken position %YIXtoken position "Y-, where the
two token positions are derived from the algorithm. This notation makes it easier for
programming as well as general understanding.
9%
$igure :."%& Bartitions generated during 6
rd
step
$igure :."% generates partitions after Bartition by bi.ection operations, this contains
outliers i.e. non/bi.ective partitions as well. The naming notation for the files here is
+partitionXevent si)eYIXtoken position %YIXtoken position "YIXbi.ection %YIXbi.ection
"YIXuni0ue idY-. ?i.ection % and ?i.ection " are indications of a bi.ective mapping between the
two mentioned positions. An additional string is added for outliers to indicate them. These files
are used for message type extraction.
9"
8. CONCLUSIONS > 5UTURE +OR?
8.1 Conc1usions
3ue to the si)e and complexity of sources of information used by system administrators
in fault management, it has become imperative to find ways to manage these sources of
information automatically.
Application logs are one such source. This work is based on a novel algorithm for
message type extraction from log files, 2B8o(. o far, there is no standard approach to
tackling this problem in the literature.
(essage types are semantic groupings of system log messages. They are important to
system administrators, as they aid their understanding of the contents of log files.
Administrators become familiar with message types over time and through experience.
This work provides a way of finding these message types automatically. 2n con.unction
with the other fields in an event (host names, severity), message types can be used for
more detailed analysis of log files.
Through a 6/step hierarchical partitioning process, 2B8o( partitions log data into its
respective clusters. 2n its fourth and final stage, 2B8o( produces message type
descriptions or line formats for each of the clusters produced.
2B8o( is able to find message clusters whether or not its instances are fre0uent. 1e
demonstrate that 2B8o( produces cluster descriptions, which match human .udgement
more closely when compared to 8*T, 8oghound, and Teiresias. 'owever, our results
show that a speciali)ed algorithm such as 2B8o( can significantly improve the
abstraction level of the unstructured message types extracted from the data.
(essage types are fundamental units in any application log file. 3etermining what
message types can be produced by an application accurately and efficiently is therefore a
fundamental step in the automatic analysis of log files.
(essage types, once determined, not only provide groupings for categori)ing and
summari)ing log data, which simplifies further processing steps like visuali)ation or
mathematical modeling, but also a way of labeling the individual terms (distinct word and
position pairs) in the data.
96
8.2 5uture 9or,
$uture work on 2B8o( will involve using the information derived from the results of
2B8o( in other automatic log analysis tasks which help with fault management. 1ork can be
done to optimi)e the time and space complexity of the algorithms. $urther more, the large log
files can be compressed using the message types as abstractions.
97
RE5ERENCES
!%#. Adetokunbo (akan.u, (ember, 2EEE, A. Hur ]incir/'eywood, (ember, 2EEE, and
Evangelos E. (ilios, enior (ember, 2EEE, +A 8ightweight Algorithm for (essage Type
Extraction in ystem Application 8ogs-, 2EEE Transaction on Knowledge and 3ata Engineering,
4ol. "7, Ho. %%,pp. %G"%/%G6:, Hovember "F%".
!"#. N. tearley, +Towards 2nformatic Analysis of yslogs,- Proc. IEEE Intl Conf. Cluster
Computing, pp. 6FG/6%O, "FF7.
!6#. 1. Su, 8. 'uang, A. $ox, 3. Batterson, and (.2. Nordan, +3etecting 8arge/cale ystem
Broblems by (ining *onsole 8ogs,- SOSP 0! Proc. "C# SI$OPS 22nd Symp. Operating
Systems Principles, pp. %%;/%6", "FFG.
!7#. =. 4aarandi, +A 3ata *lustering Algorithm for (ining Batterns from Event 8ogs,- Proc.
IEEE %or&shop IP Operations and #anagement, pp. %%G/%":, "FF6.
!9#. . Doil, '. Hagesh, and A. *houdhary, +(A$2A& Efficient and calable ubspace *lustering
for 4ery 8arge 3ata ets,- technical report' (orth)estern *ni+., %GGG.
!:#. N. tearley, +isyphus 8og 3ata (ining Toolkit,- http!,,))).cs.sandia.go+,sisyphus, Nan.
"FFG.
99
'**ENDI6
Co/e Snippets@
*artition #y to,en position met-o/&
public void Bartition?yTokenBosition()
U
TokenBosition!# tp Q new TokenBosition!%FFF#R
tream=eader!# sr Q new tream=eader!maxEventi)e M %#R
p Q new Bair!maxEventi)e M %#R
topwatch!# stw Q new topwatch!%FF#R
try
U
int tokenplitBosition Q /%, numberTokens Q /%R
for (int i Q %R i XQ maxEventi)eR iMM)
U
if (events!i#)
U
sr!i# Q new tream=eader(new
$iletream(destinationM^P\Bartitions?yEvent\partitionP M i M P.logP, $ile(ode.>pen,
$ileAccess.=ead))R
stw!i# Q new topwatch()R
V
V
for (int i Q %R i X%FFR iMM)
tp!i# Q new TokenBosition()R
for (int i Q %R i XQ maxEventi)eR iMM)
U
int current(axR
if (events!i#)
U
stw!i#.tart()R
9:
tring sR
current(ax Q iR
while ((s Q sr!i#.=ead8ine()) _Q null)
U
if (i QQ %)
U
tp!%#.add(s)R
V
else
U
tring!# splittring Q s.plit(@ @)R
for (int . Q FR . X splittring.8engthR .MM)
tp!. M %#.add(splittring!.#)R
V
V
tokenplitBosition Q %R
numberTokens Q tp!%#.getHumber>fTokens()R
for (int k Q "R k XQcurrent(axR kMM)
U
if (numberTokens Y tp!k#.getHumber>fTokens())
U
numberTokens Q tp!k#.getHumber>fTokens()R
tokenplitBosition Q kR
V
V
p!i# Q new Bair()R
p!i#.tokenpos Q tokenplitBositionR
p!i#.tokens Q tp!tokenplitBosition#.getTokens()R
for (int . Q %R . XQ maxEventi)eR .MM)
tp!.#.clear()R
9;
stw!i#.top()R
EE runtimes%!i# Q stw!i#.Elapsed(illisecondsR
V
V
tream1riter!,# swrQnew tream1riter!%FF,9FFF#R
for(int iQ%RiXQmaxEventi)eRiMM)
U
if(events!i#)
U
if (i QQ %)
U
tring!# tok Q p!i#.tokensR
stw!i#.tart()R
for (int . Q FR . X tok.8engthR .MM)
U
swr!%, .# Q new tream1riter(new $iletream(destination M
^P\Bartition?yToken\partitionIP M i M PIP M % M PIP M (. M %) M P.logP, $ile(ode.Append,
$ileAccess.1rite))R
swr!%, .#.Auto$lush Q trueR
V
stw!i#.top()R
V
else
U
tring!# tok Q p!i#.tokensR
int pos Q p!i#.tokenposR
stw!i#.tart()R
for (int . Q FR . X tok.8ength TT tok!.# _Q PPR .MM)
U
swr!i, .# Q new tream1riter(new $iletream(destination M
9O
^P\Bartition?yToken\partitionIP M i M PIP M pos M PIP M (. M %) M P.logP, $ile(ode.Append,
$ileAccess.1rite))R
swr!i, .#.Auto$lush Q trueR
V
stw!i#.top()R
V
V
V
for (int i Q %R i XQ maxEventi)eR iMM)
U
if (events!i#)
U
if (i QQ %)
U
tring!# tok Q p!i#.tokensR
tring sR
stw!i#.tart()R
tream=eader sre Q new tream=eader(new
$iletream(destinationM^P\Bartitions?yEvent\partition%.logP, $ile(ode.>pen,
$ileAccess.=ead))R
while ((s Q sre.=ead8ine()) _Q null)
U
for (int . Q FR . X tok.8engthR .MM)
if (tok!.#.E0uals(s))
U
swr!i, .#.1rite8ine(s)R
breakR
V
V
stw!i#.top()R
9G
runtimes%!i# Q stw!i#.Elapsed(illisecondsR
V
else
U
int tokensplitpos Q p!i#.tokenposR
tring!# tokens Q p!i#.tokensR
stw!i#.tart()R
using (tream=eader sr% Q new tream=eader(new
$iletream(destinationM^P\Bartitions?yEvent\partitionP M i M P.logP, $ile(ode.>pen,
$ileAccess.=ead)))
U
tring sR
while ((s Q sr%.=ead8ine()) _Q null)
U
tring!# splittring Q s.plit(@ @)R
tring a Q splittring!tokensplitpos / %#R
for (int . Q FR . X tokens.8engthR .MM)
if (tokens!.#.E0uals(a))
U
swr!i, .#.1rite8ine(s)R
breakR
V
V
V
stw!i#.top()R
runtimes%!i# Q stw!i#.Elapsed(illisecondsR
V
V
V
for (int i Q %R i XQ maxEventi)eR iMM)
:F
U
if (events!i#)
U
tring!# t Q p!i#.tokensR
for (int . Q FR . X t.8engthR .MM)
swr!i, .#.*lose()R
V
V
V
catch (Exception e)
U
(essage?ox.how(e.Totring())R
V
EE (essage?ox.how(*onvert.Totring(d))R
V
Co/e :or *artition #y 0i$ection@
public void Bartition?y?i.ection()
U
topwatch!# stw Q new topwatch!%FF#R
runtimes" Q new long!%FF#R
for (int i Q FR i X %FFR iMM)
runtimes"!i# Q FR
try
U
tream=eader!,# str Q new tream=eader!%FF, 9FFF#R
TokenBosition!# tp Q new TokenBosition!%FFF#R
b Q new ?i.ection!%FF, 9FFF#R
bpQnew ?i.ectionBair!%FF,9FFF#R
Hon?i.ection!,# nb Q new Hon?i.ection!%FF, 9FFF#R
:%
int!# bi.ectioncount Q new int!%FF#R
for (int i Q 6R i XQ maxEventi)eR iMM)
U
if (events!i#)
U
stw!i# Q new topwatch()R
stw!i#.tart()R
int tokenpos Q p!i#.tokenposR
tring!# t Q p!i#.tokensR
for (int . Q FR . X t.8engthR .MM)
U
str!i, .# Q new tream=eader(new
$iletream(destinationM^P\Bartition?yToken\partitionIP M i M PIP M tokenpos M PIP M (. M %) M
P.logP, $ile(ode.>pen, $ileAccess.=ead))R
b!i, .# Q new ?i.ection()R
bp!i,.#Qnew ?i.ectionBair()R
V
stw!i#.top()R
V
V
for(int iQ%RiX%FFRiMM)
tp!i#Qnew TokenBosition()R
for (int i Q 6R i XQ maxEventi)eR iMM)
U
if (events!i#)
U
stw!i#.tart()R
int tokenpos Q p!i#.tokenposR
tring!# t Q p!i#.tokensR
:"
for (int . Q FR . X t.8engthR .MM)
U
tring sR
while ((s Q str!i, .#.=ead8ine()) _Q null)
U
tring!# split Q s.plit(@ @)R
for (int k Q FR k X split.8engthR kMM)
tp!k M %#.add(split!k#)R
V
EE 3ictionaryXint, intY positions Q new 3ictionaryXint, intY()R
int most$re0*ount Q get(ost$re0uent*ount(tp,i)R
bool flag Q falseR
if (most$re0*ount QQ /%)
U
int pos%QFR
b!i, .#.set?i.ection(false)R
nb!i, .# Q new Hon?i.ection()R
for (int k Q %R k XQiR kMM)
U
if (pos% QQ ")
U
flag Q trueR
breakR
V
if (tp!k#.getHumber>fTokens() Y %)
U
if (pos% QQ F)
b!i, .#.set$irstTokenBosition(k)R
else
b!i, .#.setecondTokenBosition(k)R
:6
pos%MMR
V
V
V
if (flag QQ false)
U
int!# pos Q getTokenBositions(most$re0*ount, tp, i)R
EE if (i QQ 9) (essage?ox.how(*onvert.Totring(most$re0*ount))R
b!i, .#.set?i.ection(true)R
if (pos _Q null)
U
EE (essage?ox.how(*onvert.Totring(pos!F#) M P P M
*onvert.Totring(pos!%#))R
b!i, .#.set$irstTokenBosition(pos!F#)R
b!i, .#.setecondTokenBosition(pos!%#)R
b!i, .#.set(ost$re0uent*ount(most$re0*ount)R
V
V
for (int l Q %R l XQ iR lMM)
tp!l#.clear()R
stw!i#.top()R
V
V
V
for (int i Q 6R i XQ maxEventi)eR iMM)
U
if (events!i#)
U
tring!# t Q p!i#.tokensR
:7
stw!i#.tart()R
int pos Q p!i#.tokenposR
for (int . Q FR . X t.8engthR .MM)
U
if (b!i, .#.is?i.ection())
U
int pos% Q b!i, .#.get$irstTokenBosition()R
int pos" Q b!i, .#.getecondTokenBosition()R
if (pos% _Q F TT pos% _Q pos")
U
tring sR
str!i, .#.?asetream.Bosition Q FR
while ((s Q str!i, .#.=ead8ine()) _Q null)
U
tring!# split Q s.plit(@ @)R
tring first Q split!pos% / %#R
tring second Q split!pos" / %#R
bp!i, .#.add(first, second)R
V
V
V
else
U
int pos% Q b!i, .#.get$irstTokenBosition()R
int pos" Q b!i, .#.getecondTokenBosition()R
tring sR
str!i, .#.?asetream.Bosition Q FR
while ((s Q str!i, .#.=ead8ine()) _Q null)
U
tring!# split Q s.plit(@ @)R
:9
tring first Q split!pos% / %#R
tring second Q split!pos" / %#R
nb!i, .#.add(first, second)R
V
V
stw!i#.top()R
V
V
V
3ictionaryXtring, tream1riterY dtw Q new 3ictionaryXtring, tream1riterY()R
tream1riter!# swtr Q new tream1riter!%FFFF#R
dict Q new 3ictionaryXtring,intY()R
int e Q FR
for (int i Q 6R i XQ maxEventi)eR iMM)
U
if (events!i#)
U
tring!# t Q p!i#.tokensR
int pos Q p!i#.tokenposR
stw!i#.tart()R
for (int . Q FR . X t.8engthR .MM)
U
if (b!i, .#.is?i.ection())
U
int pos% Q b!i, .#.get$irstTokenBosition()R
int pos" Q b!i, .#.getecondTokenBosition()R
3ictionaryXtring, tringY dt Q bp!i, .#.get?i.ections()R
3ictionaryXtring, boolY dtb Q bp!i, .#.get>utliers()R
foreach (Key4alueBairXtring, boolY kvp in dtb)
U
::
if (kvp.4alue QQ true)
U
tring s% Q kvp.KeyR
EEtream1riter swR
0 Q s%R
tring s"R
dt.TryDet4alue(s%, out s")R
if (s" _Q null)
U
swtr!e# Q new tream1riter(new $iletream(destination M
^P\Bartition?y?i.ection\partitionIP M i M PIP M pos M PIP M pos% M PIP M pos" M PIP M e M P.logP,
$ile(ode.Append, $ileAccess.1rite))R
swtr!e#.Auto$lush Q trueR
if (_dict.*ontainsKey(s% M s" M pos% M pos" M pos M i))
dict.Add(s% M s" M pos% M pos" M pos M i, e)R
eMMR
V
V
V
stw!i#.top()R
V

V
V
V

int f Q FR
for(int iQ6RiXQmaxEventi)eRiMM)
U
:;
if(events!i#)
U
tring!# tQp!i#.tokensR
stw!i#.tart()R
int posQp!i#.tokenposR
for(int .QFR.Xt.8engthR.MM)
U
if (b!i, .#.is?i.ection())
U
int pos% Q b!i, .#.get$irstTokenBosition()R
int pos" Q b!i, .#.getecondTokenBosition()R
if (pos% Y F TT pos" Y pos%)
U
str!i, .#.?asetream.Bosition Q FR
3ictionaryXtring, tringY dt Q bp!i, .#.get?i.ections()R
3ictionaryXtring, boolY dtb Q bp!i, .#.get>utliers()R
tring sR
while ((s Q str!i, .#.=ead8ine()) _Q null)
U
tring!# split Q s.plit(@ @)R
bool w Q falseR
tring s% Q split!pos% / %#R
tring s" Q split!pos" / %#R
0 Q split!pos% / %# M PIP M split!pos" / %# M PIP M i M PIP M pos M PIP M
pos% M PIP M pos"R
if (dtb.*ontainsKey(split!pos% / %#))
dtb.TryDet4alue(split!pos% / %#, out w)R
if (w)
U
int cR
:O
dict.TryDet4alue(s% M s" M pos% M pos" M pos M i, out c)R
swtr!c#.1rite8ine(s)R
V
V
V
V
else
U
tring sR
str!i, .#.?asetream.Bosition Q FR
int pos% Q b!i, .#.get$irstTokenBosition()R
int pos" Q b!i, .#.getecondTokenBosition()R
while ((s Q str!i, .#.=ead8ine()) _Q null)
U
tring!# split Q s.plit(@ @)R
tring first Q split!pos% / %#R
3ictionaryXtring, 8istXtringYY dt Q nb!i, .#.get(appings()R
8istXtringY lR
tring secondQsplit!pos"/%#R
dt.TryDet4alue(first, out l)R
foreach (tring k in l)
U
if (second.E0uals(k))
U
using (tream1riter sw Q new tream1riter(new
$iletream(destination M ^P\Bartition?y?i.ection\partitionIP M i M PIP M . M PIP M pos M PIP M
pos% M PIP M pos" M PIoutlierIP M first.ubstring(F,%) M PIPMk.ubstring(F,%)MP.logP,
$ile(ode.Append, $ileAccess.1rite)))
sw.1rite8ine(s)R
:G
breakR
V
V
V
V
stw!i#.top()R
V
V
V
for (int i Q 6R i XQ maxEventi)eR iMM)
if (events!i#)
runtimes"!i# Q stw!i#.Elapsed(illisecondsR
for (int i Q FR i X eR iMM)
swtr!i#.*lose()R
V
catch (Exception e)
U
(essage?ox.how(e.Totring())R
(essage?ox.how(0)R
V
V
;F

Anda mungkin juga menyukai