Decision Tree and PCA-based Fault Diagnosis of Rotating Machinery

Mechanical Systems
and
Signal Processing
Mechanical Systems and Signal Processing 21 (2007) 13001317
Decision tree and PCA-based fault diagnosis
of rotating machinery
Weixiang Sun
, Jin Chen, Jiaqing Li

State Key Laboratory of Vibration, Shock & Noise, Shanghai Jiao Tong University, Shanghai 200030, PR China
Received 3 November 2005; received in revised form 22 June 2006; accepted 26 June 2006
Available online 7 September 2006
Abstract
After analysing the aws of conventional fault diagnosis methods, data mining technology is introduced to fault
diagnosis eld, and a new method based on C4.5 decision tree and principal component analysis (PCA) is proposed. In this
method, PCA is used to reduce features after data collection, preprocessing and feature extraction. Then, C4.5 is trained by
using the samples to generate a decision tree model with diagnosis knowledge. At last the tree model is used to make
diagnosis analysis. To validate the method proposed, six kinds of running states (normal or without any defect, unbalance,
rotor radial rub, oil whirl, shaft crack and a simultaneous state of unbalance and radial rub), are simulated on Bently
Rotor Kit RK4 to test C4.5 and PCA-based method and back-propagation neural network (BPNN). The result shows that
C4.5 and PCA-based diagnosis method has higher accuracy and needs less training time than BPNN.
r 2006 Elsevier Ltd. All rights reserved.
Keywords: Fault diagnosis; Rotating machinery; Decision tree; C4.5; Data mining; Principal component analysis
1. Introduction
Rotating machinery such as turbines and compressors are the key equipments in oil reneries, power plants
and chemical engineering plants. Defects and malfunctions (simply called faults) of these machines will result
in signicant economic loss. Therefore, these machines must be under constant surveillance. When a possible
fault is detected, diagnosis is carried out to pinpoint the fault. That is to say, diagnosis is a process of locating
the exact cause(s) of a fault.
Once a fault has been detected, the maintenance engineer is to identify the symptoms, analyse the
symptomatic information, interpret the various error messages and indications and come up with the right
diagnosis of the situation in terms of which components may have caused the fault and the reasons for the
fault of the components. Since a machine has many components and is highly complex; its fault diagnosis
usually requires technical skill and experience. It also requires extensive understanding of the machines
structure and operation, and some general concepts of diagnosis. This requires an expert engineer to have a
domain-specic knowledge of maintenance and know the ins-and-outs of the system [1].
ARTICLE IN PRESS
www.elsevier.com/locate/jnlabr/ymssp
0888-3270/$ - see front matter r 2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ymssp.2006.06.010
Corresponding author. Fax: +86 21 6293 2220x823.

E-mail addresses: wxsun@sjtu.edu.cn (W. Sun), jinchen@mail.sjtu.edu.cn (J. Chen), jqli_vsn@sjtu.edu.cn (J. Li).
In order to better equip with a non-expert to carry out the diagnosis operations, it would be wise to present
an approach to dene the causesymptom relationship for quick comprehension and concise representation.
Presently, many diagnosis methods have been proposed to help maintenance engineer to do diagnosis analysis.
For example, expert system [2], neural network [3], soft-computing technology (fuzzy logic, rough set and
genetic algorithm) [4,5] and their integration method [6] are the popular approaches. Unfortunately, because
of the bottleneck of knowledge acquisition, the application of expert system is limited. Also, for the
complexity of machinery and knowing little of the fault mechanism, the diagnosis methods based on neural
network and soft-computing technology need to be studied further to improve the diagnosis performance,
such as increasing diagnosis accuracy and decreasing running time, etc.
Data mining is the process of discovering knowledge from large amount of data, which needs no prior
domain knowledge. Data mining is the extension of machine learning. It emerged during the late 1980s, had
made great strides during the 1990s and is expected to continue to ourish in the 21st century. Data mining
has many topics such as classication, clustering, association, prediction, etc. Recently, classication problem
is the research hotspot and decision tree is one of the most widely used classication methods. CLS [7] is the
prototype of decision tree, then many other decision tree algorithms (e.g. ID3 [8], C4.5 [9], CART [10], SLIQ
[11], SPRINT [12], BOAT [13]) are proposed. CART and C4.5 (a successor of ID3) are the two best-known
and widely used algorithms. CART was developed by statisticians, while C4.5 was developed by a computer
scientist in the eld of machine learning. The decision tree created by CART is a binary tree in which each split
generates exactly two branches. In the decision tree created by C4.5, each split can generate more than two
branches. Whats more, C4.5 can solve the classication problem with continuous-valued attributes.
Generally, for a good classier, it must have the following properties [28]:
(1) Predictive accuracy: The ability of the model to correctly predict the class label of new or previously unseen
data.
(2) Speed: The computer cost involved in generating and using the model.
(3) Robustness: The ability of the model to make correct predictions given noisy data or data with missing
values.
(4) Interpretability: This refers to the level of understanding and insight that is provided by classication
model.
According to the study, C4.5 model introduced by J.R. Quinlan [9] is satised with the above criteria.
Data mining is successfully applied to business area such as market basket analysis, fraud detection,
customer retention [14], etc. Some data mining algorithms are also applied to machinery fault diagnosis. For
example, Bayesian statistical learning theory was used to diagnose rotating machine [15], Decision table was
used to diagnose boilers in thermal power [16], and a fuzzy clustering method was used to obtain fault patterns
to diagnose transformers [17]. However, there are few reports about the application of decision tree to rotating
machinery fault diagnosis. Therefore, the decision tree algorithm C4.5 is employed in this paper to make fault
diagnosis of rotating machinery. At the same time, in order to decrease the dimension of training sample and
to increase the building efciency of decision tree, principal component analysis (PCA) algorithm [18], an
unsupervised dimension reduction method, is used to make feature reduction.
The rest of the paper is organised as follows. In Section 2, C4.5 algorithm is described. The feature
reduction principle of PCA is depicted in Section 3. And the experiment of fault diagnosis based on decision
tree and PCA is studied in Section 4, followed by conclusions in Section 5.
2. C4.5 algorithm
Most decision tree algorithms consist of two distinct phases, a building (or growing) phase followed by a
pruning phase. So does C4.5 algorithm.
ARTICLE IN PRESS
W. Sun et al. / Mechanical Systems and Signal Processing 21 (2007) 13001317 1301
2.1. Building phase
In the building phase, training sample set with discrete-valued attributes is recursively partitioned until all
the records in a partition have the same class. Initially, the tree has a single root node for the entire training
set. Then for every partition, a new node is added to the decision tree. For a set of samples in a partition S, a
test attribute X is selected for further partitioning the set into S
1
; S
2
; . . . ; S
L
. New nodes for S
1
; S
2
; . . . ; S
L
are
created and these are added to the decision tree as children of the node for S. Also, the node for S is labelled
with test X, and partitions S
1
; S
2
; . . . ; S
L
are then recursively partitioned. A partition in which all the records
have identical class label is not partitioned further, and the leaf corresponding to it is labelled with the class.
The construction of decision tree depends very much on how a test attribute X is selected. C4.5 uses
information entropy evaluation function as the selection criteria [9]. The entropy evaluation function is
calculated in the following way.
Step 1: Calculate Info(S) to identify the class in the training set S.
InfoS
K
i1
ffreqC
i
; S=jSj log
2
freqC
i
; S=jSjg; (1)
where jSj is the number of cases in the training set. C
i
is a class, i 1; 2; . . . ; K, K is the number of classes and
freq(C
i
, S) is the number of cases included in C
i
.
Step 2: Calculate the expected information value, Info
X
(S) for test X to partition S
Info
X
S
L
i1
jS
i
j=jSj InfoS
i
, (2)
where L is the number of outputs for test X, S
i
is a subset of S corresponding to ith output and jS
i
j is the
number of cases of subset S
i
.
Step 3: Calculate the information gain after partition according to test X.
GainX InfoS Info
X
S. (3)
Step 4: Calculate the partition information value SplitInfo(X) acquiring for S partitioned into L subsets.
SplitInfoX
1
2
L
i1
jS
i
j
jSj
log
2
jS
i
j
jSj
1
jS
i
j
jSj
_ _
log
2
1
jS
i
j
jSj
_ _ _ _
. (4)
Step 5: Calculate the gain ratio of Gain(X) over SplitInfo(X).
GainRatioX GainX=SplitInfoX. (5)
The GainRatio(X) compensates for the weak point of Gain(X) which represents the quantity of information
provided by X in the training set. Therefore, an attribute with the highest GainRatio(X) is taken as the root of
the decision tree.
2.2. Pruning phase
Researchers have found that a large decision tree constructed from a training set usually does not retain its
accuracy over the whole sample space for over-training or over-tting. Therefore, a fully grown decision tree
needs to be pruned by removing the less reliable branches to obtain better classication performance over the
whole instance space even though it may have a higher error over the training set.
A number of empirical methods have been proposed for pruning decision tree. They can be divided into two
types, construction-time pruning (or pre-pruning) and pruning after building a fully grown tree (or post-
pruning). Pre-pruning methods (e.g. threshold method and w
2
test method [19]) are used to decide when to stop
expanding a decision tree. But the criterion to stop a tree of pre-pruning methods is often made based on local
ARTICLE IN PRESS
information. On the contrary, the post-pruning methods (e.g. cost-complexity [20], critical value [21] and
reduced error [22]) often use global information.
The C4.5 algorithm applies an error-based post-pruning strategy to deal with over-training problem, which
is a pessimistic error pruning method. As a matter of fact, for each classication node C4.5 calculates a kind of
predicted error rate based on the total aggregate of misclassications at that particular node. The error rate is
calculated as the upper limit of an a% (25% is the default value for C4.5) condence interval for the mean E/N
of a binomial distribution B(E/N) where E/N is the proportion of misclassications at the node at issue. The
error-based pruning technique essentially boils down to the replacement of vast subtrees in the classication
structure by singleton nodes or simple branch collections if these actions contribute to a drop in the overall
error rate of the root node.
2.3. Discretisation of continuous-valued attribute
Because most of the signals in fault diagnosis eld have continuous values, we must know about how C4.5
solves the classication problem with continuous attributes. In fact, the discretisation process of continuous-
valued attributes in C4.5 algorithm is a process to select the optimal threshold.
For a continuous-valued attribute X, suppose it has m values in the training set and the values are sorted in
ascending order, i.e., fa
1
; a
2
; . . . ; a
m
ga
1
pa
2
p pa
m
. For a special value a
i
, it partitions the samples into
two groups fa
1
; a
2
; . . . ; a
i
g and fa
i1
; a
i2
; . . . ; a
m
g. One has X values up to a
i
, the other has X values greater
than a
i
. And a
i
is an optional threshold for discretisation. Therefore, there exist m1 kinds of partitions or
there are m1 thresholds available. For each of these partitions, compute the information gain (See
Section 2.1) and choose the partition (given the jth partition) that maximises the gain. Accordingly, the
boundary value a
j
in the optimal partition is selected as the optimal threshold.
This dynamic discretisation method is executed for each candidate attribute in every process to select the
best test attribute.
2.4. An example of how to build a decision tree
Given a dataset S shown in Table 1, it can be found that nine samples of S belong to class 1 (C1) and ve
samples belong to class 2 (C2). So the Info(S) of dataset S can be calculated.
InfoS 9=14 log
2
9=14 5=14 log
2
5=14 0:940 bit.
ARTICLE IN PRESS
Table 1
A two-class dataset S with three attributes
Att 1 Att 2 Att 3 ClsID
A 70 T C1
A 90 T C2
A 85 F C2
A 95 F C2
A 70 F C1
B 90 T C1
B 78 F C1
B 65 T C1
B 75 F C1
C 80 T C2
C 70 T C2
C 80 F C1
C 80 F C1
C 96 F C1
Attribute 1 (X1 for short) divides the dataset S into three subsets, therefore
Info
X1
S 5=142=5 log
2
2=5 3=5 log
2
3=5 4=144=4 log
2
4=4
5=143=5 log
2
3=5 2=5 log
2
2=5
0:694 bit.
The information gain after partition using test X1 can be calculated.
GainX1 0:940 0:694 0:246 bit.
The partition information SplitInfo(X1) of attribute X1 is calculated as below:
SplitInfoX1 1=25=14 log
2
5=14 1 5=14 log
2
1 5=14
4=14 log
2
4=14 1 4=14 log
2
1 4=14
5=14 log
2
5=14 1 5=14 log
2
1 5=14
1:372 bit.
Therefore, the information gain ratio by using attribute X1 to partition the dataset S can be computed.
GainRatioX1 0:246=1:372 0:179.
Likewise, for attribute 3 (X3 for short), the information gain ratio can be calculated.
Info
X3
S 6=143=6 log
2
3=6 3=6 log
2
3=6
8=146=8 log
2
6=8 2=8 log
2
2=8
0:892 bit,
GainX3 0:940 0:892 0:048 bit,
2
6=14 1 6=14 log
2
1 6=14
8=14 log
2
8=14 1 8=14 log
2
1 8=14
0:985 bit,
GainRatioX3 0:048=0:985 0:049.
The second attribute (X2 for short) is a continuous one. Before using X2 to partition the dataset, the best
threshold must be found to discretise X2. The value set of X2 is {65, 70, 75, 78, 80, 85, 90, 95, 96}. Then
the candidate threshold set is {65, 70, 75, 78, 80, 85, 90, 95}. According to every candidate threshold, the
information gain can be calculated. They are 0.048, 0.015, 0.045, 0.09, 0.102, 0.025, 0.01 and 0.048. Hence,
the optimal threshold is selected, z 80. By using X2 with the optimal threshold to partition the dataset, the
information gain ratio can be calculated.
Info
X2
S 9=147=9 log
2
7=9 2=9 log
2
2=9
5=142=5 log
2
2=5 3=5 log
2
3=5
0:838 bit,
GainX2 0:940 0:838 0:102 bit,
2
9=14 1 9=14 log
2
1 9=14
5=14 log
2
5=14 1 5=14 log
2
1 5=14
0:940 bit,
GainRatioX2 0:102=0:940 0:108.
Among the three information gain ratios, GainRatio(X1) is the largest one. Therefore, attribute 1 (X1) is
rstly selected to partition the dataset. Each attribute value of X1 generates a branch which is according to a
ARTICLE IN PRESS
subset such as S1, S2 and S3. Then three branches are generated. The decision tree built after the rst-level
partition is displayed in Fig. 1.
Fig. 1 shows that the four samples of the middle branch (X1 B) belong to the same class C1. Therefore,
not like the left branch (X1 A) and the right branch (X1 C), the middle branch is leaf node which does not
need being partitioned any more.
For subset S1 (samples of the left branch), the information gain ratio according to attribute 3 can be
computed out. It is 0.021. And because the second attribute is continuous, the optimal threshold also must be
found rstly. The three information gains according to the three candidate thresholds (70, 85, 90) are 0.971,
0.420 and 0.171. Hence the optimal threshold is determined, z 70. Using X2 with the optimal threshold to
partition the subset S1, the information gain ratio is obtained. It is 1. Therefore, attribute 2 (X2) is selected to
partition the subset S1.
For subset S3 (samples of the right branch), the information gain ratio by using X3 to partition S3 is 1. The
two information gains according to the candidate discretising thresholds (70, 80) of X2 are 0.322 and 0.171. So
the optimal threshold is z 70. Then the information gain ratio by using X2 with the optimal threshold to
partition S3 is 0.446. Therefore, attribute 3 (X3) is selected to partition subset S3.
After the second-level partition, samples of each branch of the decision tree just belong to one class. So the
partition procedure stops. The decision tree generated through two-level partitioning is displayed in Fig. 2.
3. Feature reduction using PCA
Given a set of n dimension feature vectors x
t
(t 1; 2; . . . ; m), generally nom,
ARTICLE IN PRESS
Fig. 1. Decision tree model built after the rst-level partition.
Fig. 2. The complete decision tree built based on dataset S.
Let
m
1
m
m
t1
x
t
. (6)
Then, the covariance matrix of feature vectors is
C
1
m
m
t1
x
t
mx
t
m
T
. (7)
The principal components (PCs)are computed by solving the eigen value problem of covariance matrix C,
Cv
i
l
i
v
i
, (8)
where l (i 1; 2; . . . ; n) are the eigen values and they are sorted in descending order, v
i
(i 1; 2; . . . ; n) are the
corresponding eigenvectors.
To represent the raw feature vectors with low-dimensional ones, what needs to be done is to compute the
rst k eigenvectors (kpn) which correspond to the k largest eigen values. In order to select the number k, a
threshold y is introduced to denote the approximation precision of the k largest eigenvectors.
k
i1
l
i
_
n
i1
l
i
Xy. (9)
Given the precision parameter y, the number of eigenvectors k can be decided.
Let
V v
1
; v
2
; . . . ; v
k
; L diagl
1
; l
2
; . . . ; l
k
.
After the matrix V is decided, the low-dimensional feature vectors, named PC, of raw ones are determined as
follows:
P V
T
x
t
. (10)
The PCs of PCA have three properties [23]: (1) they are uncorrelated; (2) they have sequentially maximum
variances; (3) the mean-squared approximation error in the representation of the original vectors by the rst
several PCs is minimal.
4. Experimental analysis
A simulating experiment of rotor fault is made on the Bently RK 4. Then through data collection,
preprocessing, feature extraction and feature reduction, training samples and test samples are obtained. After
that, use the training samples to train C4.5 and generate a decision tree. At last the decision tree model is
employed to make diagnosis test on the test samples. And the diagnosis result is compared with the result of
neural network. The experimental ow chart is shown in Fig. 3.
4.1. Data collection
Six running states of a rotor are simulated on Bently Rotor Kit RK4. They are normal (without any fault/
defect), unbalance, rotor radial rub, oil whirl, shaft crack and a simultaneous state of unbalance and radial
rub. The six states are abbreviated to NORM, UBAL, RRUB, OILW, CRAK, UBRR, which are listed in
Table 2.
In the experiment [33], the diameter of the shaft is 10 mm and the length of the shaft is 560 mm. The rotor
mass wheel is 800 g and its diameter is 75 mm. In order to simulate the unbalance state, a 1.2 g weight setscrew
tightened in the mass wheel is used as unbalance mass. The radial rubbing state is simulated by using a rub
screw to hit the radial surface of the shaft. The rub screw is secured in the mounting block with a locknut and
it is adjusted to obtain different rub degree. The simultaneous state of unbalance of radial rub is simulated by
adjusting the rub screw to touch the shaft based on the unbalanced rotor kit. To simulate the oil whirl state,
ARTICLE IN PRESS
journal, uid lm bearing, oil pump, main pressure valve, two mass wheels and a preload frame are used. The
assembly of these components is referred to [33]. Because the transverse crack usually appears near the
connection point of mass wheel and the shaft in practice, a transverse crack is created near the joint place of
wheel and shaft to simulate the shaft crack state. The depth of the crack is about one fourth of the shaft
diameter.
Since vibration is the most useful monitoring signal in mechanical engineering which can well reect the
condition of a running machine, vibration is selected as the monitoring signal in this paper. And eddy current
sensors are used to collect radial vibration of the rotor. The classical waveforms and spectrums of vibration
signal in the six states are shown in Fig. 4.
Fig. 5 is the sketch of data acquisition system, and Fig. 6 is the photo of experimental system, which
includes Bently Rotor Kit RK4 test bed (comprising motor speed controller, sensors and signal conditioner),
anti-aliasing lter, data acquisition computer, oscilloscope, CF350 double-channel dynamic analyser. Rotor
displacement signals are probed by eddy current sensors. Then after signal conditioned and anti-aliasing
ltered, the displacement information is collected by computer. Oscilloscope and CF350 analyser are
employed to analyse the validity of the signals.
According to Ref. [24], the frequency range of general abnormal vibration is at low-frequency band [0, 5f
0
]
(f
0
is rotating frequency), the frequency range of gear fault is at middle-frequency band [5f
0
, 1000 Hz] and the
frequency range of bearing fault is at high-frequency band [1000 Hz, ]. Therefore, the analysis bandwidth is
determined to 1000 Hz and the sample frequency is set to 2560 Hz. The number of sample points is 4096. The
ARTICLE IN PRESS
Fig. 3. The ow chart of experiment.
Table 2
Simulating running states of a rotor
Class ID
Abbreviation of state type State description
1
NORM Without any defects
2
UBAL Unbalance
3
OILW Oil whirl
4
RRUB Rotor radial rub
5
UBRR Unbalance and radial rub
6
CRAK Shaft crack
selected constant rotating speeds are 600, 1200, 3000 and 3600 rpm for the GOOD, UBAL, RRUB, UBRR,
CRAK states, and the speeds of 1640, 1727 and 1901 rpm are for OILW state. In the horizontal and the
vertical direction, 50 groups of sample data are acquired for each running state.
ARTICLE IN PRESS
(a)
(c) (d)
(e) (f)
(b)
0 50 100 150 200 250
Rotor without any defects Rotor without any defects
Rotor in Unbalance
Rotor in Oil Whirl Rotor in Oil Whirl
Rotor in Unbalance
300 350 400 450 500
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Time (ms)
A
m
p
l
i
t
u
d
e

(
m
m
)
0 50 100 150 200 250 300 350 400 450 500
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Time (ms)
A
m
p
l
i
t
u
d
e

(
m
m
)
0 50 100 150 200 250 300 350 400 450 500
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Time (ms)
A
m
p
l
i
t
u
d
e

(
m
m
)
0 50 100 150 200 250 300
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Frequency (Hz)
A
m
p
l
i
t
u
d
e

(
m
m
)
0 50 100 150 200 250 300
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Frequency (Hz)
A
m
p
l
i
t
u
d
e

(
m
m
)
0 50 100 150 200 250 300
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Frequency (Hz)
A
m
p
l
i
t
u
d
e

(
m
m
)
Fig. 4. Waveforms and spectrums of six vibration signals: (a) waveform of vibration signal in NORM state, (b) spectrum of vibration
signal in NORM state, (c) waveform of vibration signal in UBAL state, (d) spectrum of vibration signal in UBAL state, (e) waveform of
vibration signal in OILW state, (f) spectrum of vibration signal in OILW state, (g) waveform of vibration signal in RRUB state, (h)
spectrum of vibration signal in RRUB state, (i) waveform of vibration signal in UBRR state, (j) spectrum of vibration signal in UBRR
state, (k) waveform of vibration signal in CRAK state, (l) spectrum of vibration signal in CRAK state.
4.2. Data preprocessing
As shown in Fig. 7, it is the amplitude A
p
(a relative displacement to the equilibrium position) not the
absolute displacement x
p
that reects the intensity of vibration. Hence the sample data are preprocessed by
removing the mean value.
ARTICLE IN PRESS
0 50 100 150 200 250 300 350 400 450 500
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Rotor in Radial Rub
Time (ms)
A
m
p
l
i
t
u
d
e

(
m
m
)
0 50 100 150 200 250 300 350 400 450 500
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Time (ms)
A
m
p
l
i
t
u
d
e

(
m
m
)
0 50 100 150 200 250 300 350 400 450 500
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Time (ms)
A
m
p
l
i
t
u
d
e

(
m
m
)
0 50 100 150 200 250 300
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Rotor in Radial Rub
Frequency (Hz)
A
m
p
l
i
t
u
d
e

(
m
m
)
0 50 100 150 200 250 300
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Frequency (Hz)
A
m
p
l
i
t
u
d
e

(
m
m
)
0 50 100 150 200 250 300
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Frequency (Hz)
A
m
p
l
i
t
u
d
e

(
m
m
)
Rotor in Unbalance + Radial Rub Rotor in Unbalance + Radial Rub
Rotor in Shaft Crack
Rotor in Shaft Crack
(g) (h)
(i) (j)
(k) (l)
Fig. 4. (Continued)
Given the data series x
1
; x
2
; . . . ; x
m
, the mean m is calculated as formula (11)
m
1
m
m
i1
x
i
. (11)
Hence, the data series removed mean are as follow:
x
0
i
x
i
m i 1; 2; . . . ; m. (12)
ARTICLE IN PRESS
Fig. 7. A classical vibration waveform.
Fig. 5. Sketch of data acquisition system.
Fig. 6. Experimental system.
4.3. Feature extraction
The diagnostic task of machinery is actually a problem of pattern classication and pattern recognition, of
which the crucial step is feature extraction. In the machinery fault diagnosis eld, features are often extracted
in time domain and frequency domain. And the features both in time and frequency domain have been applied
successfully [25,26]. In this paper, we extract features both in time and frequency domain and make full use of
the information from the two kinds of features.
Features in time domain: There are many feature parameters of a signal in time domain including the
dimensional parameters such as mean, root mean square (RMS), etc. and the non-dimensional parameters
such as waveform index, pulse index, etc. In this paper, a dimensional parameter PeakPeak value and six
non-dimensional statistical parameters are selected as the time domain features. The six non-dimensional
parameters are waveform index, impulsion index, peak index, tolerance index, skewness index, kurtosis index.
Given a periodical signal x(t) with the period T, the seven parameters can be calculated as following.
(1) Peakpeak value
PP maxxt minxt, (13)
(2) Waveform index
S x
RMS
= x
p
, (14)
(3) Impulsion index
I max jxtj= x
p
, (15)
(4) Peak index
C max jxtj=x
RMS
, (16)
(5) Tolerance index
L max jxtj=x
r
, (17)
(6) Skewness index
Skewness
1=T
_
T
0
xt x
3
dt
1=T
_
T
0
xt x
2
dt
_ _
3=2
, (18)
(7) Kurtosis index
Kurtosis
1=T
_
T
0
xt x
4
dt
1=T
_
T
0
xt x
2
dt
_ _
2
3, (19)
where x
p
1=T
_
T
0
jxtj dt, x
r
1=T
_
T
0
jxtj
1=2
dt
2
, x 1=T
_
T
0
xt dt, x
RMS

1=T
_
T
0
x
2
t dt
_
.
Features in frequency domain: The frequency spectrum includes the amplitude spectrum and power
spectrum. Both of them have been utilised in fault diagnosis. Here, we take into account the features from
the amplitude spectrum and by considering the symptom table of Sohre [27], we select 11 features shown
in Table 3.
The process of computing every amplitude feature includes three steps. First, make Fast Fourier
Transformation (FFT) of signal x(t) to obtain the amplitude spectrum. Then, sum up all the amplitudes of
ARTICLE IN PRESS
frequency band ranging from 0 to 10 times rotating frequency, denoting total amplitude. At last, sum up the
amplitudes of each frequent band and divide by the total amplitude to give the result as a feature of this
frequent band. All the considerations above are based on a certain frequency resolution according to sample
frequency and sample data length.
To sum up, we get 18 features (including seven features in time domain and 11 features in frequency
domain), and they are listed in Table 4.
4.4. Feature reduction
In order to decrease the relativity between features and decrease dimensions of features, PCA is employed to
make feature reduction. According to formula (68), the eigen values L of covariance matrix of 300 samples
(X
18 300
) with 18 features and the corresponding eigenvectors matrix V
18 18
can be calculated
L diag0:491; 0:255; 0:049; 0:018; 0:012; 0:007; 0:004; 0:002; 0:002; 0:001,
0:001; 0:001; 0:000; 0:000; 0:000; 0:000; 0:000; 0:000.
For the sake of keeping most of the information of original samples, select y 0:98. According to formula (9),
the number of reduced features k can be worked out.
K 6.
Then, use the rst six eigenvectors to construct a new matrix V
6
.
At last, according to formula (10), P V
T
6
X, we can get the new sample data P
6 300
with six equivalent
features.
ARTICLE IN PRESS
Table 4
The selected 18 features
Feature ID Feature description
1 Peakpeak value in time domain
2 Waveform index in time domain
3 Impulsion index in time domain
4 Peak index in time domain
5 Tolerance index in time domain
6 Skewness index in time domain
7 Kurtosis index in time domain
8 Amplitude sum in frequent band [0, 0.39f] divided by total amplitude
9 Amplitude sum in frequent band [0.40f, 0.49f] divided by total amplitude
10 Amplitude at frequent 0.50f divided by total amplitude
11 Amplitude sum in frequent band [0.51, 0.99f] divided by total amplitude
12 Amplitude at frequent 1f divided by total amplitude
13 Amplitude at frequent 1.5f divided by total amplitude
16 Amplitude sum in frequent band [3f,5f] divided by total amplitude
17 Amplitude sum of odd times f divided by total amplitude
18 Amplitude sum in frequent band [5f,10f] divided by total amplitude
Table 3
The selected 11 amplitude features
No. 1 2 3 4 5 6 7 8 9 10 11
Amplitude feature 00.39f
a
0.400.49f 0.5f 0.510.99f 1f 1.5f 2f 3f 35f Odd f 510f
a
f: rotating frequency of a rotor.
4.5. Decision tree classier training
The samples are divided into two parts: training set and testing set. Training set is used to train classier and
testing set is used to test validity of the classier. About 60% of samples are randomly selected as training set,
and the remaining 40% of samples are used as testing set. Five-fold cross-validation is employed to evaluate
classication accuracy.
The training process of C4.5 using the samples with continuous-valued attributes is as follows.
(1) The tree starts as a single node representing the training samples.
(2) If the samples are all of the same class, then the node becomes a leaf and is labelled with the class.
(3) Otherwise, the algorithm discretises every attribute (discussed in Section 2.3) to select the optimal
threshold and uses the entropy-based measure called information gain (discussed in Section 2.1) as
heuristic for selecting the attribute that will best separate the samples into individual classes.
(4) A branch is created for each best discrete interval of the test attribute, and the samples are partitioned
accordingly.
(5) The algorithm uses the same process recursively to form a decision tree for the samples at each partition.
(6) The recursive partitioning stops only when one of the following conditions is true:
(a) All the samples for a given node belong to the same class or
(b) There are no remaining attributes on which the samples may be further partitioned.
(c) There are no samples for the branch test attribute. In this case, a leaf is created with the majority class
in samples.
(7) A pessimistic error pruning method (discussed in Section 2.2) is used to prune the grown tree to improve its
robustness and accuracy.
The parameters of C4.5 algorithm are set as default values, i.e., BATCH 1, PROBTHRESH 0,
VERBOSITY 0, TRIALS 10, WINDOW 0, INCREMENT 0, GAINRATIO 1, SUBSET 0,
UNSEENS 1, MINOBJS 2, CF 0.25. Readers can refer to paper [9] to know about what the
parameters and their values mean in detail.
The trained C4.5 decision tree models before and after PCA processing are shown in Fig. 8a and b.
In Fig. 8, F is the abbreviation of Feature; F
0
denotes the equivalent feature (or the PC, after feature
reduction and C is the abbreviation of Class. For example, F2 means the 2nd Featurewaveform index,
F
0
2 means the 2nd equivalent feature or PC after feature reduction which includes the information of all 18
features but does not represent any real feature and C1 means the 1st ClassNormal.
Comparing with the two decision tree models, it can be found that the structure of decision tree changes a
lot. The decision tree model after feature reduction is more compact than that before feature reduction. The
former has six leaf nodes, uses three test attributes (F
0
1, F
0
2, F
0
3) and its depth level is 3. Accordingly, the
latter has 11 leaf nodes, uses seven test attributes (F1, F2, F4, F10, F14, F15, F17) and its depth level is 5.
Therefore, feature reduction based on PCA decreases the complexity of decision tree model and improves its
execution efciency.
In a decision tree, a path from root to leaf can be viewed as a classication rule. From this point of view, a
decision tree represents a set of rules. In Fig. 8a, 11 rules can be obtained and in Fig. 8b, 6 rules can be
acquired. The induced rules can be used as diagnosis knowledge to diagnose rotor faults or predict running
states of a rotor.
4.6. Back-propagation neural network
In order to prove the efciency of C4.5 algorithm, back-propagation neural network (BPNN) is used to
compare with it, which is widely applied in practice. The details of neural network and BPNN can be learned
from Refs. [29,30]. When the BPNN is employed, some parameters need to be congured.
Number of net Layers: Due to Horniks conclusion [31], three-layer neural network is selected. That means
the BPNN includes one input layer, one hidden layer and one output layer.
ARTICLE IN PRESS
Number of net nodes: The number of input nodes is equal to the number of features. And the number of
output nodes is equal to the number of simulated running states. There is some complexity to decide the
number of hidden nodes. It may depend on the number of input nodes and output nodes, transfer function,
the property of samples, etc. Ref. [32] gives three optional methods. They are expressed as following:
Suppose the number of input nodes is u, the number of output nodes is o, and the number of hidden nodes
is h.
Method 1: h
u o
p
a, a is a constant and a 2 1; 10.
Method 2: h log 2
u
.
Method 3: h 2u 1.
Because neural network is a kind of complex non-linear system, it is very hard to give a general formula to
determine the number of hidden nodes directly. Therefore according to the three methods, we rstly calculate
the minimum value h using method 1, and calculate the maximum value h using method 3. After that we
increase h one by one from the minimum value to the maximum value to nd a minimum optimum. For the
data before feature reduction, the number of input nodes is 18, the number of output nodes is 6 and the
selected number of hidden nodes is 23. For the data after feature reduction, the three numbers are 6, 6, 12,
respectively.
ARTICLE IN PRESS
Fig. 8. The trained decision tree models: (a) before PCA processing, (b) after PCA processing.
Transfer function: Sigmoid function is selected as the transfer function.
Learning rate: A small value is often selected to guarantee the stability of training process. But the smaller
the learning rate, the longer the training process. Therefore, the learning rate is adaptively modied according
to the gradient for both guaranteeing the stability and improving the training speed.
In order to avoid the local minima, a momentum is introduced to BPNN. The coefcient of momentum is
set to 0.8.
The samples are divided into training set and testing set. The training set is made up of 60% of all the
samples, 4/5 of which is used for training and 1/5 of which is used for validation. The testing set comprises the
other 40% of samples. Five-fold cross-validation is used to evaluate classication accuracy.
The samples are normalised before training and testing BPNN and the execution performance of BPNN is
discussed in the next section.
4.7. Diagnosis result
The main codes of C4.5 algorithm come from the Internet (http://rulequest.com/personal/c4.5r8.tar.gz),
whose original author is J.R. Quinlan. After being edited, the C4.5 algorithm is encapsulated as COM to be
called in Visual Basic. The BPNN algorithm is coded by us using Visual Basic. The programs of both the
algorithms are running at Dell Precision 530 (Pentium IV/1.5 GHz/512 MB). The analysis results of the two
classiers (C4.5 decision tree and BPNN) are compared with each other in Table 5.
As shown in Table 5, (1) Decision tree C4.5 is a good classier for machinery fault diagnosis. It has high
diagnosis accuracy 98.3%, about 23 percentage points higher than BPNN. (2) The training time of both C4.5
and BPNN decreases using the samples after feature reduction. The training time of C4.5 without feature
reduction is about double of that after feature reduction. The training time of BPNN without feature
reduction is over 3 times longer than that after feature reduction. However, the average diagnosis accuracy of
BPNN increases a little using the samples after feature reduction. And the average diagnosis accuracy of C4.5
remains the same. For the RRUB state, the diagnosis accuracy of C4.5 after feature reduction is higher than
before feature reduction. Perhaps it is because the information from time domain features and frequency
domain features is fused after PCA processing. And the information is strengthened for RRUB state.
Therefore, higher diagnosis accuracy is achieved after feature reduction. But for UBAL state, the diagnosis
accuracy of C4.5 decreases after feature reduction compared with that before feature reduction. Perhaps it is
because after feature reduction based on PCA, a little information is lost by discarding the non-PCs. (3) The
simultaneous state of unbalance and radial rub is a complex condition. It is a little difcult to diagnose the
complex state. Therefore, the testing accuracy of the simultaneous state is relatively low compared with those
of NORM, UBAL and RRUB. (4) The training time of C4.5 is far less than that of BPNN. The latter is over
10
3
times longer than the former. Therefore, we can say the execution efciency of C4.5 is higher than that of
BPNN.
ARTICLE IN PRESS
Table 5
Comparison of four diagnosis results
Classier Accuracy of testing set (%) Average
testing
accuracy
(%)
Average
validation
accuracy
(%)
Average
training
accuracy
(%)
Average
training
time (s)
NORM UBAL OILW RRUB UBRR CRAK
C4.5 (feature reduction) 100 95 100 100 95 100 98.3 97.8 99.1 0.033
C4.5 (no feature
reduction)
100 100 100 95 95 100 98.3 97.8 98.3 0.071
BPNN (feature
reduction)
100 100 95 100 95 85 95.8 98.3 98.9 97.732
BPNN (no feature
reduction)
100 100 85 100 95 90 95 97.8 98.3 304.437
5. Conclusions
The data mining technology is introduced to rotating machinery fault diagnosis eld and the decision tree
C4.5 is proposed to make fault diagnosis of rotor faults. It does not require additional information (e.g.
domain knowledge or prior knowledge of distributions for samples) besides that already contained in the
training data.
Six classical rotor running states including normal, unbalance, rotor radial rub, oil whirl, shaft crack and a
simultaneous state of radial rub with unbalance are simulated on Bently Rotor Kit. And the sample data are
used to make fault diagnosis test. It is proved by the experiment that C4.5 is a good classier and it can
diagnose rotor faults accurately.
The PCA is an unsupervised feature reduction method. After PCA processing, the redundant features can
be removed effectively. In this work, 18 features decrease to six efcient features. The reduction ratio is about
66.7%. Although most of the features are reduced, the average diagnosis accuracy does not decrease. For
some states, the diagnosis accuracy arises a little for the information fusion performance of PCA.
Compared with BPNN, C4.5 extracts knowledge quickly from the training samples, while training BPNN
will take large quantity of time and thousands of iterations. Also, the accuracy of C4.5 is comparable or even
superior to neural network.
Although C4.5 algorithm is a good classier and widely applied in many elds, it has some limitations.
(1) Hard to predict the value of continuous attribute. Since C4.5 algorithm is designed to classify or predict
the category decision attributes, it is less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
(2) Empty branches and insignicant branches. For some classication problems with many classes and
relatively small number of training examples, there may be some empty branches (the number of samples
in the node is zero or close to zero) existing in the trees constructed by C4.5. For discrete attributes, the
number of branches equals the number of possible values of the selected attribute, but some of these
branches contribute nothing to classication. These insignicant branches not only reduce the usability of
C4.5 but also bring on the problem of over-tting.
(3) Not good scalability. Since C4.5 algorithm requires the training data to be put in memory, it has poor
scalability to learn from large amount of sample data.
All the disadvantages of C4.5 algorithm above need to be studied further.
Acknowledgements
Support for this work from the National 10.5 Science and Technology Key Project (grant number:
2001BA204B05-KHKZ0009) and National Natural Science Foundation of China (no. 50335030) is gratefully
acknowledged. Thanks go to Wu Xing for directing the experiment.
References
[1] S.A. Patel, A.K. Kamrani, Intelligent decision support system for diagnosis and maintenance of automated systems, Computers
Industrial Engineering 30 (2) (1996) 297319.
[2] B.S. Yang, D.S. Lim, C.C. Tan, VIBEX: an expert system for vibration fault diagnosis of rotating machinery using decision tree and
decision table, Expert System with Application 28 (2005) 735742.
[3] C.Z. Chen, C.T. Mo, A method for intelligent fault diagnosis of rotating machinery, Digital Signal Processing 14 (2004) 203217.
[4] D. Gayme, S. Menon, C. Ball, et al., Fault detection and diagnosis in turbine engines using fuzzy logic, in: 22nd International
Conference of Digital Object Identier, vol. 2426, North American Fuzzy Information Processing Society, 2003, pp. 341346.
[5] Y.G. Wang, B. Liu, Z.B. Guo, et al., Application of rough set neural network in fault diagnosing of test-launching control system of
missiles, in: Proceedings of the Fifth World Congress on Intelligent Control and Automation, Hangzhou, PR China, 2004,
pp. 17901792.
[6] X.J. Xiang, J.F. Shan, F.W. Fuchs, The study of transformer failure diagnose expert system based on rough set theory, in: Power
Electronics and Motion Control Conference, vol. 2, IPEMC 2004. The Fourth International, pp. 533536.
ARTICLE IN PRESS
[7] E.B. Hunt, J. Marin, P.J. Stone, Experiments in Induction, Academic Press, New York, 1966.
[8] J.R. Quinlan, Induction of decision trees, Machine Learning 1 (1986) 81106.
[9] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan, Kaufmann, San Mateo, CA, 1993.
[10] L. Breiman, J. Friedman, R. Olshen, et al., Classication and Regression Trees, Wadsworth International Group, Monterey, CA,
1984.
[11] M. Mehta, R. Agrawal, J. Rissanen, SLIQ: A Fast Scalable Classier for Data Mining, EDBT 96, Avignon, France, 1996.
[12] J. Shafer, R. Agrawal, M. Mehta, SPRINT: a scalable parallel classier for data mining, in: Proceedings of the 22nd International
Conference on Very Large Data Based, Mumbai, India, 1996.
[13] J. Gehrke, V. Ganti, R. Ramakrishnan, et al., BOAToptimistic decision tree construction, in: Proceedings of the 1999 ACM
SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, 1999.
[14] X.B. Li, Multivariate decision trees for data mining, Ph.D. Thesis, Darla Moore School of Business, University of South Carolina,
1999.
[15] D. Jiang, S.T. Huang, W.P. Lei, et al., Study of data mining based machinery fault diagnosis, in: Proceedings of the First
International Conference on Machine Learning and Cybernetics, Beijing, PR China, 2002, pp. 536539.
[16] P. Yang, S.S. Liu, Fault diagnosis for boilers in thermal power plant by data mining, in: The Eighth International Conference on
Control, Automation, Robotics and Vision, Kunming, China, 2004, pp. 21762180.
[17] A.P. Chen, C.C. Lin, Fuzzy approaches for fault diagnosis of transformers, Fuzzy Sets and System 118 (2001) 139151.
[18] X. Xu, X.N. Wang, An adaptive network intrusion detection method based on PCA and support vector machines, Lecture Notes in
Articial Intelligence 3584 (2005) 696703.
[19] T. Niblett, Constructing decision trees in noisy domains, in: Proceedings of the Second European Working Session on Learning,
Sigma Press, Bled, 1987, pp. 6778.
[20] L. Breiman, J.H. Friedman, R.A. Olshen, et al., Classication and Regression Trees, Wadsworth International Group, Belmont, CA,
1984.
[21] J. Mingers, An empirical comparison of pruning methods for decision tree induction, Machine Learning 4 (1989) 227243.
[22] J.R. Quinlan, Simplifying decision tree, International Journal of ManMachine Studies 27 (1987) 221234.
[23] I.J. JolliHe, Principal Component Analysis, Springer, New York, 1986.
[24] K.H. Nam, S.N. Lee, Diagnosis of rotating machines by utilizing a backpropagation neural net, IEEE 0-7803-0582-5/92 (1992)
10641067.
[25] B. Samanta, K.R. Al-Balushi, Articial neural network based fault diagnostics of rolling element bearings using time-domain
features, Mechanical Systems and Signal Processing 17 (2) (2003) 317328.
[26] R.Q. Li, J. Chen, X. Wu, et al., Fault diagnosis of rotating machinery based on SVD, FCM and RST, International Journal of
Advanced Manufacturing Technology 14333015 (2005).
[27] J. Sawyer, K. Hallenberg, Sawyers Turbomachinery Maintenance Handbooks, Turbomachinery International Publications,
Norwalk, USA, 1980.
[28] J.W. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, Los Altos, CA, 2001.
[29] Z.Z. Shi, Knowledge Discovery, Tsinghua University Press, 2002.
[30] M.T. Hagan, H.B. Demuth, M.H. Beale, Neural Network Design, PWS Publishing Company, 1996.
[31] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks 4 (2) (1991) 251257.
[32] L.M. Zhang, Models and Applications of Articial Neural Networks, Fudan University Press, 1992.
[33] Operation Manual of Bently Rotor Kit Model RK4, Bently, Nevada, 2002.
ARTICLE IN PRESS

Decision Tree and PCA-based Fault Diagnosis of Rotating Machinery

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Decision Tree and PCA-based Fault Diagnosis of Rotating Machinery

Diunggah oleh

Hak Cipta:

Format Tersedia

Mechanical Systems

, Jin Chen, Jiaqing Li

Corresponding author. Fax: +86 21 6293 2220x823.

Anda mungkin juga menyukai