Anda di halaman 1dari 10

Mechanical Systems and Signal Processing 50-51 (2015) 427–436

Contents lists available at ScienceDirect

Mechanical Systems and Signal Processing


journal homepage: www.elsevier.com/locate/ymssp

Novelty detection by multivariate kernel density estimation


and growing neural gas algorithm
Olga Fink a,n, Enrico Zio b,c, Ulrich Weidmann a
a
Institute for Transport Planning and Systems, ETH Zurich, Wolfgang-Pauli-Str. 15, 8093 Zurich, Switzerland
b
Chair on Systems Science and the Energetic Challenge, European Foundation for New Energy-Electricité de France (EDF) at École Centrale
Paris and SUPELEC, France
c
Department of Energy, Politecnico di Milano, Italy

a r t i c l e i n f o abstract

Article history: One of the underlying assumptions when using data-based methods for pattern recogni-
Received 2 May 2013 tion in diagnostics or prognostics is that the selected data sample used to train and test
Received in revised form the algorithm is representative of the entire dataset and covers all combinations of
25 March 2014
parameters and conditions, and resulting system states. However in practice, operating
Accepted 28 April 2014
Available online 29 May 2014
and environmental conditions may change, unexpected and previously unanticipated
events may occur and corresponding new anomalous patterns develop. Therefore for
Keywords: practical applications, techniques are required to detect novelties in patterns and give
Novelty detection confidence to the user on the validity of the performed diagnosis and predictions.
Multivariate kernel density estimation
In this paper, the application of two types of novelty detection approaches is
Growing neural gas
compared: a statistical approach based on multivariate kernel density estimation and
Railway turnout system
an approach based on a type of unsupervised artificial neural network, called the growing
neural gas (GNG). The comparison is performed on a case study in the field of railway
turnout systems. Both approaches demonstrate their suitability for detecting novel
patterns. Furthermore, GNG proves to be more flexible, especially with respect to
dimensionality of the input data and suitability for online learning.
& 2014 Elsevier Ltd. All rights reserved.

1. Introduction

Techniques of machine learning have been developed to assure that the information contained in the available dataset is
used optimally and to make algorithms learn patterns efficiently, also in the case that they do not occur very frequently [17].
Main properties that are sought in the models generated by machine learning techniques are as follows: (i) generalization
ability, i.e. the ability to generalize patterns from training data to previously unseen data, and (ii) fault tolerance, i.e. the
ability to ignore noise in input data and assure a stable model structure with respect to small changes [10].
Cross validation [10], for example, can be applied in the training mode to determine the model structure and parameters
so as to ensure that the entire dataset is exploited. Then, the robustness of the performance of the algorithms can be judged
with respect to the variability in the results when applied to different subsets of the data.
Bootstrap, a resampling technique [4], can be applied to bias the density of underrepresented patterns in the dataset,
so as to enable the algorithm to learn all patterns from sufficiently frequent occurrences in the training data.

n
Corresponding author. Tel.: þ41 44 633 27 28.
E-mail address: ofink@ethz.ch (O. Fink).

http://dx.doi.org/10.1016/j.ymssp.2014.04.022
0888-3270/& 2014 Elsevier Ltd. All rights reserved.
428 O. Fink et al. / Mechanical Systems and Signal Processing 50-51 (2015) 427–436

One of the underlying assumptions when using data-based methods for pattern recognition in diagnostics or prognostics
is that the selected data sample used to train and test the algorithm is representative of the entire dataset and covers all
combinations of parameters and conditions, and resulting system states. However in practice, operating and environmental
conditions may change, unexpected and previously unanticipated events may occur and corresponding new anomalous
patterns develop. Many machine learning techniques have difficulty in recognizing novel patterns that follow a different
distribution or were recorded in changed operating or environmental conditions, or evolved due to some anomalous
conditions that were not covered by the training dataset. Therefore, techniques are required to detect novel data patterns.
The task of detecting patterns that are different from those that the applied algorithm was trained and tested on is shared
in problems such as intrusion detection [27], autonomous mobile robots detecting novelty in their environment [21] or
monitoring of system's conditions and detecting anomalous conditions [25,24,11].
For the solution of these problems and others, several approaches have been proposed in the literature for novelty
detection. These can be statistical approaches, in which the input data are analysed based on their statistical properties [19].
The statistical models for the analyses are subsequently used to determine if the new patterns considered are from the same
distribution as the training data used to build them. Statistical approaches can be subdivided into parametric and non-
parametric [19]. Parametric approaches assume an underlying distribution and determine the parameters of the distribution
that best fits the data patterns. In non-parametric approaches, the form of the density function is derived from the data
without a priori assuming a specific distribution [19]. Besides statistical approaches, several soft computing techniques have
been applied to detect novelty in the data patterns, such as different types of neural networks [20], support vector machines
[20] and artificial immune systems [3,26].
Parametric approaches have a narrow field of application, especially for high dimensional data, where the type of
distribution has to be determined not only for one parameter, but for several interdependent parameters. Also some of the
soft computing approaches show limitations and are for example not suitable for applications in which online updating is
required. This is, for example, the case for self-organizing maps (SOMs) [20,25] for which the structure is fixed prior to the
learning process.
In this paper, two types of novelty detection approaches are compared based on their application to a case study in the
field of railway turnout systems: a statistical approach based on multivariate kernel density estimation and an approach
based on a type of unsupervised artificial neural network, called the growing neural gas (GNG). Contrary to SOM, GNG does
not require an a priori definition of network structure, but the structure evolves during the learning process based on the
presented patterns. As the structure is not fixed but is adaptable, GNG can also learn new evolving patterns in an online
learning process and is able to adapt to dynamically changing operating conditions.
Multivariate statistical process control analyses, monitors and diagnoses process operating performance [18,2]. Some of
the approaches applied to facilitate the processing of multidimensional process parameters are similar to those applied in
this study. For example, the principal component analysis is an approach to reduce the dimensionality of the data while
pertaining the relevant information in the reduced number of dimensions [16].
The remainder of the paper is organized as follows. Section 2 presents the two applied approaches and their theoretical
background. Section 3 describes the case study and the applied data, which are derived from the railway turnout system.
Section 4 presents the evaluation of the approaches applied on the case study. Finally, Section 5 discusses the obtained
results and presents the conclusions of this research.

2. Applied approaches

2.1. Selecting novelty detection algorithms

With respect to the requirements of practical applications, the criteria used to select the algorithms for this research are
applicability to multi-dimensional input data, flexibility, adaptability and the ability to learn novel patterns online as they
evolve. Two different approaches are selected: a statistical approach and a soft computing approach.
As there is no information available on the type and form of the underlying distribution of the input data, a non-
parametric approach is selected based on kernel density estimation. Furthermore, as the dataset is multi-dimensional,
multivariate kernel density estimation is applied.
From the soft computing approaches, a growing neural gas algorithm is selected due to its flexibility in the learning
process and in adapting its structure to new evolving patterns.

2.2. Multivariate kernel density estimation (MVKDE)

Kernel density estimations are a flexible approach to estimate the densities of a given data distribution on which no
information is available on the type of the underlying distribution [22,23]. They are also referred to as Parzen windows or
Parzen-Rosenblatt windows [19]. The approach of kernel density estimation has some similarities to histogram building.
One of the main differences of the construction principles of the kernel density function to those of a histogram is that the
density calculation is based on an interval placed around the observed value x and not on an interval containing x that is
placed around a predefined bin center [9].
O. Fink et al. / Mechanical Systems and Signal Processing 50-51 (2015) 427–436 429

For multi-dimensional datasets, multivariate kernel density estimations are applied. In the following, some basic
concepts on the multivariate kernel density estimation are given, based on [9].
Given a d-dimensional random vector X ¼ ðX 1 ; …; X d ÞT , where X 1 ; …; X d are one-dimensional random variables, vector Xi
 
represents the ith set of the observed d variables: Xi ¼ X⋮i1 , i ¼ 1; …; n, where X ij is the ith observation of the random
X id

variable X j . The probability density function (pdf) of X is given by the joint pdf of the random variables ðX 1 ; …; X d ÞT :
f ðXÞ ¼ f ðX i ; …; X d Þ ð1Þ
The kernel functions are applied to the scaled distances u ¼ ðx  X i Þ=h, where h is the smoothing parameter, the so-called
bandwidth. Assuming that the bandwidth can be set individually for each distance ðx X i Þ, h ¼ ðh1 ; …; hd ÞT , the density
estimator can be given as
 
1 n 1 x1  X i1 x X id
fbh ðxÞ ¼ ∑ K ; ⋯; d ð2Þ
n i ¼ 1 h1 …hd h1 hd

In the multi-dimensional space there are different approaches to form the multi-dimensional kernel KðuÞ ¼ Kðu1 ; …; ud Þ.
For this research, multiplicative kernel is applied: KðuÞ ¼ Kðu1 Þ  …  Kðud Þ, where Kðun Þ is the kernel function.
Eq. (2) can thereby be rewritten as
(  )
c 1 n d
1 xj  X ij
f h ðxÞ ¼ ∑ ∏ h K ð3Þ
ni¼1 j¼1 j hj

The pdf highly depends on the selection of the bandwidth parameter vector. Several approaches have been proposed in
the literature on setting the bandwidths, such as Silverman's rule of thumb [9]. Another approach, applied in this research, is
to set the bandwidths through least squares cross-validation [9]. By this approach, a generic bandwidth, h, is selected so as
to minimise the integrated mean square error between the estimated and actual distributions:
Z
2
IMSEðhÞ ¼ ff^h ðxÞ  f ðxÞg dx ð4Þ

2.3. Growing neural gas

Growing neural gas is a special type of unsupervised artificial neural networks based on a self-organizing algorithm that
constructs a graph of nodes connected by edges [7]. GNG is an adaptive algorithm that is able to adjust its structure based on
the changing distribution of the input vectors [6]. Contrary to SOM, which is based on some similar principles as GNG, the
structure is not predefined within the set-up process but evolves during the training process as the patterns are presented to
the learning algorithm and is thereby more flexible.
Growing neural gas is a method of Vector Quantization (VQ). VQ is a set of methods that is used to quantize input vectors
to a limited set of output vectors [8], which are also referred to as code-vectors. The set of all possible code-vectors is
referred to as codebook, and is generated during the training process.
GNG constructs graphs with the dimensionality of the input data, exploiting information on the underlying data
distribution, specifically the distribution of the data patterns in the input space.
! !T
Given an unlabeled input dataset X ¼ fx1 ; …; xn g  Rs , the dataset is organized into a limited set of representative output
! !
vectors that capture the structure of X using a set of reference vectors W ¼ f w1 ; …; wm g. As GNG is an adaptive algorithm, it
!t
can be applied to online learning processes and a temporarily growing dataset Xt ¼ fx1 ; …g  Rs .
The structure of the GNG network is created based on the similarity between the input vectors, whereby the similarity
between two vectors is measured by the Euclidean distance between the two vectors:
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n
dða; bÞ ¼ ∑ ðai bi Þ2 ð5Þ
i¼1

sffiffiffiffiffiffiffiffiffiffiffiffiffi
n pffiffiffiffiffiffiffiffiffi
JaJ ¼ ∑ a2i ¼ aa ð6Þ
i¼1
!
Each of the k nodes in the GNG is described by a reference vector, wk , a local accumulated error variable, Ek, which gives
the accumulated squared distance to the portion of the input distribution that the node k covers, and a set of edges
connecting the node to its topological neighbors. The edges are described by their age, which represents the timeliness of
the network representation with respect to the recently presented inputs.
In the following, the approach of applying GNG is described, based on [6,7]:

1. The algorithm of GNG begins with randomly positioning two nodes that are connected by an edge, with the age and the
local error both equal to zero.
430 O. Fink et al. / Mechanical Systems and Signal Processing 50-51 (2015) 427–436

!
2. For each of the randomly selected input vectors xi , two nodes s and t are identified with corresponding reference vectors
! ! ! ! !
ws and wt , such that s is the nearest node to xi , i.e. J ws  xi J 2 is the smallest distance value, and node t is the second
! ! 2
nearest, i.e. J wt  xi J is the second smallest distance value.
! !
3. The local error Es is updated by adding the quantity J ws  xi J 2 :
! !
Es ¼ Es þ ΔE ¼ J ws  xi J 2 ð7Þ

4. The reference vectors of the winner (i.e. the closest node s) and its direct topological neighbors, Ns, are adapted applying the
total distance to the input signal, weighted by ϵb, the learning rate of the winner node s, and ϵn, the learning rate of the
neighboring nodes Ns (with ϵb ; ϵn A ½0; 1 and ϵb b ϵn ), respectively. The reference vector of the winner node is updated as
! ! !
Δ ws ¼ ϵb ð xi  ws Þ ð8Þ
The reference vectors of the direct topological neighbors of the winner node are updated accordingly as
! ! !
Δ wj ¼ ϵn ð xi  wj Þ 8 j A N s : ð9Þ

5. Subsequently, the age of the edges connecting node s to its topological neighbors is incremented by 1.
6. If the nodes s and t have been connected by an edge, the age of this edge is set to zero. If the nodes have not been
connected, an edge between them is created.
7. Edges with an age larger than a specified maximum value amax are removed. After removing the edges, nodes that are not
connected to other nodes and do not have any edges are also removed from the graph.
8. If the current iteration is an integer multiple of the defined parameter λ and the maximum predefined number of nodes
in the graph has not been reached, a new node r is inserted according to the following procedure:

 Find the node u with the largest local error.


 Among the neighbors of u find the node v with the largest local error.
 Insert the node r halfway between u and v as follows:
! ! !
wr ¼ ð wu þ wv Þ
ð10Þ
2

 Create edges between u and r, v and r and then remove the edge between u and v.
 Decrease the error variables of u and v by a factor α and set the error of node r:
Eu ¼ α  E u ð11Þ

Ev ¼ α  E v ð12Þ

Er ¼ E u ð13Þ

Decrease all error-variables of all nodes γ, of which the current network structure is comprised, by a factor β:
E γ ¼ Eγ  β  E γ ð14Þ
9.
Steps 1–9 are repeated as long the stopping criterion has not been met, which can be either the maximum number of
nodes or a defined performance criterion.

2.4. Setting-up procedure and parameter selection

2.4.1. Setting-up procedure of MVKDE


For the MVKDE, in the first step, a kernel function, K(u), has to be selected. The kernel function represents the way in
which the values around the observed value x are weighted. There are several possible kernel functions that can be selected,
such as uniform, triangle, Epanechnikov, Gaussian, and Cosine [9]. An overview of possible kernel functions is shown in
Table 1, where IðÞ is the indicator function.
For the multivariate kernel density estimation, additionally, the form of the multidimensional kernel needs to be
selected. One of the widely used forms is the multiplicative kernel.
A further parameter that needs to be selected is the bandwidth, h. There are, basically, two ways to select the bandwidth:
(i) rules of thumb, that are widely used in the univariate kernel density estimation, and (ii) cross-validation. In the
multidimensional case, it is very difficult to select several bandwidths and to evaluate their influence on the form of the pdf
iteratively, which can be more easily performed in the one-dimensional case. Therefore, for the multivariate case, the cross-
validation approach is recommended.
O. Fink et al. / Mechanical Systems and Signal Processing 50-51 (2015) 427–436 431

Table 1
Feasible kernel functions, derived from [9].

Kernel K(u)

Uniform 1 
I uj r 1Þ
2
Triangle ð1  jujÞIðjuj r 1Þ
Epanechnikov 3  
1  u2 I  uj r 1Þ
4
Qaurtic 15 
ð1 u2 Þ2 I uj r 1Þ
(biweight) 16
Triweight 35 
ð1 u2 Þ3 I uj r 1Þ
32  
Gaussian 1 1
pffiffiffiffiffiffi exp  u2

2
π π 
Cosine cos u I  uj r 1Þ
4 2

Generally, the MVKDE is very sensitive to the “curse of dimensionality” phenomenon [1]. This phenomenon is due to the
fact that the volume of the space increases disproportional to the increase of the available data, so that the data become
sparse and the estimation error large [1]. Therefore, for high dimensional input data, dimensionality reduction techniques
are usually applied, such as principal component analysis or similar techniques, before the probability density function can
be determined.

2.4.2. Setting-up procedure of GNG


There are several parameters that have to be selected for the GNG algorithm, prior to training and applying it.
An overview of the parameters is provided in Table 2.
The selection of the specific values for each of the parameters highly depends on the input distribution. Additionally, the
parameters cannot be selected independently, as selection of some parameters depends on the values of other parameters.
The learning rates, ϵb and ϵn, are selected in such a way that ϵb b ϵn . The learning rate of the neighboring nodes, ϵn, is
usually one or two orders of magnitude smaller than the winner learning rate, ϵb.
Small values of λ induce a fast error reduction at the beginning of the learning process, but may require more steps to
converge to the desired accumulated local error. However, large values of λ will induce a slow convergence as new nodes
will be inserted only after a large number of steps. It is generally possible to make λ adaptive, setting it to small values at the
beginning of the learning process, and then increasing it in the process of learning. This is a crucial parameter for online
learning approaches, if the novelty is detected by the changing structure of the network.
The maximum age, amax, ensures that the network structure only represents values that occur frequently. This is,
likewise, an important parameter for online learning approaches, because patterns that have already been presented to the
algorithm but have not occurred for a large number of steps would be detected as novel by the algorithm.
If a desired performance of the GNG is defined, the maximum accumulated local error should be selected as the stopping
criterion. However in some cases, an iterative approach may be required, as very small values for Emax will lead to long
computing time.
The global reduction of the errors of all nodes in the network at a given training step ensures that recent errors have
greater influence and also avoids a disproportional growth of local errors.
The error reduction factor of new nodes, α, is often set to 0.5, which means that the error assigned to the new node is the
average of the errors of the nodes among which the new node is inserted. However, there are different approaches to assign
the error value of the new nodes which do not depend on the errors of “old” nodes [5].

3. Case study

3.1. Selected system and data of the case study

The two approaches for novelty detection are validated on a case study on classifying the degradation states of a railway
turnout system.
Turnouts are critical components within the railway network particularly in critical locations, which have high capacity
utilization rates. Turnouts consist of several parts including turnout blades, stock rails, the so-called “frog” and the turnout
actuator which positions the moveable parts of the turnout (Fig. 1).
If the evolution of the system condition occurs gradually and is observable in one or several system performance
parameters, it can be used to predict the future evolution of system condition and predict the remaining useful life [15].
The degradation process of the turnout system depends on several parameters, including axle loads, train speeds,
conditions of the train wheels, and environmental conditions.
432 O. Fink et al. / Mechanical Systems and Signal Processing 50-51 (2015) 427–436

Table 2
Relevant parameters for setting of GNG.

Parameter Notation Description

Winner learning rate ϵb Scaling factor for the distance that the winning node is moved towards the presented input vector
Learning rate of ϵn Scaling factor for the distance that the neighboring nodes of the winning node are moved in the direction of
neighboring nodes the presented input vector
Insertion frequency λ Number of steps after which a new node can be inserted
Maximum age amax Number of epochs after which the connecting edge is removed from the network
Stopping criterion Emax or Either maximum accumulated local error or maximum number of steps
epmax
Error reduction factor of α The amount by which the error of every new unit is reduced
new nodes
Error reduction factor of all β The amount by which the error of all nodes is reduced in each step
nodes

Fig. 1. Parts of the turnout system.

Monitoring the condition of critical systems and components, such as turnouts located in network links operated close to
their capacity limits, can help railway operators to anticipate component failures and to implement cost effective maintenance
regimes.
The data used in this research were collected from six force transducers installed along the turnout system located in a
railway tunnel portal. Two force measurement bolts are measuring the positioning forces at the frog actuator systems. Four
additional measuring bolts are measuring the positioning forces of the actuator systems along the turnout blades. The force
measurement is activated when the positioning process starts and the system records the forces applied for the single
location along the turnout.
The system measures the applied forces for each millisecond of the positioning process. In the post-processing of the
data, the system also computes the work performed by the actuator system and stores this information for each of the
measurement locations separately. The performed work corresponds to the integral of the force curve. Since the turnout is
positioned in different directions, the applied forces can vary. For this case study, positioning processes for only one
direction were considered. The total observation period considered in this research was about 3.5 years.

3.2. General procedure

In this research a two-step procedure was applied. In the first step, a supervised learning algorithm was applied to train a
supervised learning algorithm, extreme learning machines, to discriminate between the patterns belonging to the two
classes: force-curves with an overall high or low level of performed work based on the shapes of the measured force curves.
In the second step, the two selected approaches for novelty detection are first trained on the training data of the original
dataset. Subsequently the original testing dataset is modified to generate novel patterns in order to imitate the reaction of
the system to changed operating conditions or occurred faults and failures. The original and the modified testing datasets
are both presented to the selected algorithms.
Fig. 2 displays the applied general procedure.

3.3. Classification approach

It can be assumed that the larger the positioning forces applied and amount of work performed, the higher the level of
degradation. Therefore, by identifying states with high levels of performed work, also the states with a high level of
degradation can be identified.
In this research, two levels of aggregation were considered. On the disaggregated level, distinct force-curves for one
single selected movement mechanism were evaluated. To classify these distinct force-curves, the aggregated work
performed by all the six monitoring points along the turnout was applied. Two classes were defined: force-curves with
an overall high (high-class) or low level of performed work (low-class). The shape of the curves was used in the classification
task. Therefore, each force-curve was normalized in the interval ½0; 1 with the distinct value range of each curve.
O. Fink et al. / Mechanical Systems and Signal Processing 50-51 (2015) 427–436 433

Fig. 2. Applied general procedure.

Extreme learning machines (ELMs) were applied for the classification task. The ELM is a feedforward network with a
single hidden layer and flexible processing units. ELM combine the strengths of several different machine learning
techniques, such as Support Vector Machines and feedforward neural networks with different activation functions,
including sigmoidal, polynomial and radial-basis functions [13].
The main advantage of ELM is that the learning procedure is very fast and computationally efficient [14]. They have
shown good precision and generalization ability in several benchmark studies [14]. An additional advantage is that the
parameters of the ELM do not have to be set and tuned manually, but are either set randomly or determined within the
learning procedure.
Ridge regression [12] with a regularization term of 0.01 was applied. In ridge regression, a regularization term is included
in the minimization of residuals to impose rigidity.
A holdout technique was applied to validate the generalization ability of the algorithm. The training dataset contained
90% of the entire available dataset and the testing dataset the rest of it. In total there were 11,966 samples in the dataset,
6159 of which belonged to the high-class and 5809 to the low-class: 10,769 data samples were used for training and 1197 for
testing.
The algorithm classified 99.63% of the training data samples and 99.31% of the testing data samples correctly. In the
testing dataset, eight data samples in total, out of 1197, were misclassified. There is a very small discrepancy between the
training and testing errors, which is an indication of a very good generalization ability of the algorithm.

4. Applying the selected approaches to the case study and evaluating the results

4.1. Generating novel patterns

In order to test the proposed approaches for novelty detection, novel patterns were created by the following procedure:
(
xij if xij 4 0:9
zij ¼ ð15Þ
0:5xij if xij r 0:9;

where xij is the ith observation of the jth measurement in the original dataset and zij is the ith observation of the jth
measurement of the novel dataset.
As the patterns are still normalized on their specific data range, the maximum value is still equal to 1. To keep this, the
values within the range [0.9,1.0] are not changed and the rest of the measured values are scaled to take half of the value they
had in the original dataset (Fig. 3).
As the classification task is to recognize the shapes of the curves, this approach to modify the patterns induces a
significant change in the shape of the curves. The generated patterns mimic outlier behavior. Both, the original testing
dataset and the modified testing dataset contained 1197 data patterns.

4.2. Applying the multivariate kernel density estimation

The multivariate kernel density estimation (MVKDE) with multiplicative kernels and Gaussian kernel functions (Eq. (3))
is applied. The bandwidth is set through least squares cross-validation (Eq. (4)).
The MVKDE is performed on the first two principal components of the data in order to enable visualization of the
probability density function. The first two principal components represent 45% of the variation in the original data space.
434 O. Fink et al. / Mechanical Systems and Signal Processing 50-51 (2015) 427–436

Fig. 3. Original (a) and modified (b) force-curves of a selected positioning process.

Fig. 4. Multivariate kernel density estimation of first two principal components.

The multivariate density function is estimated based on the training dataset of the original data. The resulting probability
density function of the training dataset is presented in Fig. 4.
After determining the pdf based on the training dataset, the similarity of the testing data patterns to training data is
determined. This is equivalent to the question if the ELM algorithm has been presented similar data patterns in the testing
process as in the training process. Therefore, for each data pattern in the testing dataset, the probability density based on the
multivariate kernel density function derived with the training dataset is determined. The smaller the pdf for a data pattern
from the testing dataset, the less frequently a similar data pattern has been presented to the ELM algorithm during the
training process. The smaller the value of pdf, the higher the degree of novelty of the specific data pattern and the higher the
probability of misclassifying the pattern.
For the original testing dataset, the pdf was in the range ½0:004; 1:24. For the modified dataset, the pdf was in the range
½7:12  10  6 ; 1:13. Even though the maximum values of the probability densities are similar between the testing dataset
and the modified dataset, 87.6% of the values in the modified dataset have a probability density less than 0.05. This shows
that these patterns are dissimilar to the patterns contained in the training dataset. On the contrary, 99.2% of the values in the
testing dataset have a probability density greater than 0.05. This confirms that the data in the training set is from the same
distribution.
The evaluation of the probability densities confirms that the patterns derived from the same dataset show a high degree
of similarity to the patterns used in the training dataset and therefore also high values of probability density. On the other
hand, the modified data patterns have not been represented by similar data patterns in the training dataset and are
therefore recognized as novel and previously unseen by the kernel density estimation.
O. Fink et al. / Mechanical Systems and Signal Processing 50-51 (2015) 427–436 435

4.3. Applying the GNG algorithm

Similar to MVKDE, the structure of the GNG was determined by the training dataset. For GNG the following parameters
were selected: The maximum age was set to 50 epochs, the learning rate of the winner node to 0.2 and 0.006 for the
neighboring nodes. A new node is inserted every 100 steps. The error of every new unit is reduced by the factor 0.5. The
global error reduction factor, β, of all nodes is set to 0.005. The training of the network was stopped after the average error of
0.05 was achieved. The network consisted of 530 nodes when the learning process was stopped.
For each of the data patterns in the testing dataset, the Euclidean distance to the nearest neighboring node in the
network was computed: the smaller the distance, the more similar the node to the patterns used during the training
process; the greater the distance, the more novel the pattern.
The distances to nearest neighboring nodes for the testing dataset were within the interval ½0:17; 0:92 and for the
modified testing data in the interval ½0:83; 1:65. Similar to the MVKDE, there is a variability of distances within the datasets:
99.3% of the patterns of the testing dataset have a Euclidean distance to their nearest neighbor of 0.7, and only two patterns
have a distance greater than 0.9. On the contrary, 98.3% of the modified dataset have a Euclidean distance to their nearest
neighbor greater than 0.9.
This evaluation shows that GNG was able to recognize the novelty in the patterns.

4.4. Novelty detection combined with classification

In addition to analyzing the novelty on the modified dataset, both approaches were used to determine if the mis-
classification of the eight patterns that were not correctly classified by the ELM algorithm in the testing dataset was caused
by a high degree of novelty of the misclassified patterns. Each of the approaches recognizes one of the misclassified patterns
as novel. One of the eight patterns showed the biggest distance to the nearest GNG node in the entire testing dataset of 0.92.
The MVKDE was able to recognize a misclassified pattern with a pdf of 0.027 as novel.
However, the two approaches do not recognize the same patterns as novel. This is due to the different ways of
approaching the novelty detection task. When the structure of the GNG is generated in the learning process, new nodes are
inserted in the regions with the biggest local error, which means that the nodes are very dissimilar in these regions. The
learning algorithm is set up in such a way that it strives to represent the training dataset by its network structure as well as
possible. However, the multivariate kernel density estimation only determines the density of the presented data patterns
and does not actively learn to represent the distribution of the patterns in the input space.
For practical applications, a two-step approach is proposed: detecting the novelty of the patterns in the first step and
performing the actual classification task in the second step. For this procedure, thresholds have to be defined which
determine the classification of the pattern as novel either based on the distance to the nearest neighbor or on the density.
The thresholds can be selected based on the consequences and the criticality of misclassification, respectively, the
consequences of discarding the novel pattern and not classifying it. Furthermore, it is not only possible to determine if
the pattern is novel, but also to define the degree of novelty, determined for example by the distance to the nearest node in
the network. A fuzzy membership function can be defined for the degree of novelty of a pattern. Moreover, the evolving
patterns can be used for online learning to adjust the structure of the GNG network as new patterns appear. In this case, also
the classification algorithm would have to be set up for online learning.

5. Discussion and conclusions

This paper presented and compared two different approaches to detecting novel patterns that are dissimilar to those
used for training classification algorithms. The approaches were validated on a case study from a railway turnout system, in
which the shapes of the force-curves were significantly changed. The approaches were applied to the testing dataset and a
modified dataset. Both approaches are able to detect novelty in the modified dataset and to recognize misclassified patterns
as novel.
The MVKDE is very sensitive to the “curse of dimensionality” phenomenon [1]. Therefore, for high dimensional input
data, dimensionality reduction techniques are usually applied, such as principal component analysis or similar techniques.
GNG does not show any limitations in terms of the dimensionality of the input data.
One of the advantages of applying GNG is its ability to adapt its structure to new data. However, this behavior also
implies that structures representing “old” data, data that has not been presented to the GNG for a defined number of last
epochs, becomes obsolete and the connections between the nodes are deleted. This behavior is due to the assumption that
the data that represent the currently valid input space reappear regularly in the input data. This behavior can be desirable
for certain applications. If this behavior is not desirable, the memory of the algorithm can be enlarged by increasing the
number of epochs after which the connections become obsolete or by resorting to ensembles of models.
Even though GNG is suitable for online learning, this ability shows some limitations as the nodes can only be inserted
after a predefined number of epochs. Additionally, if the novelty of a pattern is not evaluated by the proximity to the nearest
neighbor, as it is evaluated in this case study, but by the evolution of a new node in the network structure, the approach may
not be sufficiently flexible. However, the number of epochs after which a new node is inserted in the network structure has
to be significantly decreased to ensure flexibility.
436 O. Fink et al. / Mechanical Systems and Signal Processing 50-51 (2015) 427–436

The focus of the paper is primarily on comparing the two selected approaches for novelty detection. For this purpose, the
data presented to the algorithms were modified in order to be able to compare the performance of the two approaches on
data patterns with the defined novel character. This procedure is suitable to compare and demonstrate the specific
characteristics of the two approaches. However, in the next step, the two approaches need to be applied to detect novel
patterns in diagnostics tasks in which the novel patterns are not known a priori and their degree of novelty is varying.

Acknowledgments

The authors would like to thank BLS AG for providing the data for this research project.
The participation of Olga Fink to this research is partially supported by the Swiss National Science Foundation (SNF)
under Grant number 205121_147175.
The participation of Enrico Zio to this research is partially supported by the China NSFC under Grant number 71231001.

References

[1] R.E. Bellman, Adaptive Control Processes a Guided Tour, third ed. Princeton University Press, Princeton, New Jersey, 1966.
[2] S. Bersimis, S. Psarakis, J. Panaretos, Multivariate statistical process control charts: an overview, Quality Reliab. Eng. Int. 23 (2007) 517–543, http://dx.
doi.org/10.1002/qre.829.
[3] D. Dasgupta, Artificial Immune Systems and their Applications, Springer, Berlin, 1999.
[4] B. Efron, R. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall, New York, 1993.
[5] B. Fritzke, Fast learning with incremental RBF networks, Neural Process. Lett. 1 (1994) 2–5.
[6] B. Fritzke, Growing cell structures: a self-organizing network for unsupervised and supervised learning, Neural Netw. 7 (1994) 1441–1460.
[7] B. Fritzke, A growing neural gas network learns topologies, Adv. Neural Inf. Process. Syst. 7 (1995) 625–632.
[8] A. Gersho, R.M. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, Boston, 1992.
[9] W. Haerdle, Nonparametric and Semiparametric Models, Springer-Verlag, Berlin, 2004.
[10] S.S. Haykin, Neural Networks and Learning Machines, 3rd ed. Pearson Education, Upper Saddle River, 2009.
[11] T. Heyns, P.S. Heyns, J.P. de Villiers, Combining synchronous averaging with a gaussian mixture model novelty detection scheme for vibration-based
condition monitoring of a gearbox, Mech. Syst. Signal Process. 32 (2012) 200–215.
[12] A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67.
[13] G.B. Huang, D. Wang, Y. Lan, Extreme learning machines: a survey, Int. J. Mach. Learn. Cybern. 2 (2011) 107–122.
[14] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (2006) 489–501.
[15] A.K.S. Jardine, D. Lin, D. Banjevic, A review on machinery diagnostics and prognostics implementing condition-based maintenance, Mech. Syst. Signal
Process. 20 (2006) 1483–1510.
[16] J.T. Jolliffe, Principal Component Analysis, second ed. Springer, New York, 2004.
[17] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: International Joint Conference on Artificial
Intelligence (IJCAI), Morgan Kaufmann, 1995, pp. 1137–1145.
[18] J. MacGregor, T. Kourti, Statistical process control of multivariate processes, Control Eng. Pract. 3 (1995) 403–414, http://dx.doi.org/10.1016/0967-0661
(95)00014-L.
[19] M. Markou, S. Singh, Novelty detection: a review—part 1: statistical approaches, Signal Process. 83 (2003) 2481–2497.
[20] M. Markou, S. Singh, Novelty detection: a review—part 2: neural network based approaches, Signal Process. 83 (2003) 2499–2521.
[21] S. Marsland, U. Nehmzow, J. Shapiro, On-line novelty detection for autonomous mobile robots, Robot. Auton. Syst. 51 (2005) 191–206.
[22] E. Parzen, On estimation of a probability density function and mode, Ann. Math. Stat. 33 (1962) 1065–1076.
[23] M. Rosenblatt, Remarks on some nonparametric estimates of a density function, Ann. Math. Stat. (1956) 832–837.
[24] C. Surace, K. Worden, Novelty detection in a changing environment: a negative selection approach, Mech. Syst. Signal Process. 24 (2010) 1114–1128.
[25] M.L.D. Wong, L.B. Jack, A.K. Nandi, Modified self-organising map for automated novelty detection applied to vibration signal monitoring, Mech. Syst.
Signal Process. 20 (2006) 593–610.
[26] K. Worden, W.J. Staszewski, J.J. Hensman, Natural computing for mechanical systems research: a tutorial overview, Mech. Syst. Signal Process. 25
(2011) 4–111.
[27] D.Y. Yeung, C. Chow, Parzen-window network intrusion detectors, in: Proceedings of the 16th International Conference on Pattern Recognition, vol. 4,
2002, pp. 385–388.

Anda mungkin juga menyukai