Department of Mathematics
Student: Supervisors:
J. Derikx dr. E. Cator
s4052552 drs. J. Heres
16 January 2017
[Abstract]
Alliander N.V. operates the largest electricity distribution
network in the Netherlands. The grid is mostly underground
and therefore not easy accessible. The goal of this thesis is to
approximate failure rates of the underground assets. In case
of a successful attempt, the obtained information can be used
to further improve the policy of Alliander.
This booklet starts with a detailed introduction of the set-
ting in which the general problem is situated. Afterwards the
reader is prepared with the necessary survival analysis theory,
the overarching framework to make the problem mathemati-
cally precise.
The main goal is tackled by two approaches, originating from
two different fields of mathematics. The first method is the
Cox model, a proportional hazard model that comes directly
from survival analysis. It is widely approved and respected
in this context. In this thesis the baseline hazard function
corresponding to the Cox model is retrieved in another way,
specialized to the situation. A proof is included which guar-
antees that this addition gives the sought after result.
The second approach comes from the rising field of machine
learning. Random forest is a versatile algorithm that is used
in this case to predict the failure rates based on the details
that are known from the assets.
The thesis concludes with the results of both methods. For
Alliander, these results are immediately valuable. Besides
the application also more general results are discussed in a
comparison between the two approaches.
3
[Acknowledgements]
The thesis as conclusion of the master is a struggle for many
students. Unfortunately I was no exception. Although I will
never argue that my project has great potential, I do think it
is fair to say it is a personal achievement. Most of the obsta-
cles I encountered didn’t have anything to do with the subject
of my thesis and are therefore obviously not mentioned in one
of the chapters. These pages only tell the reader something
about the electricity distribution networks, but the thesis as
a whole stands symbol for much more than that. Knowing
this I realize that writing this thesis by myself was simply
impossible and I am very thankful for all the help I got.
First of all, I want to thank Jacco Heres. As a supervisor he
was closely involved with every little step of this thesis and he
was never tired of helping me out. It didn’t matter for what
reason I got stuck, he always looked for a way to make things
work. Eric Cator has helped me in a whole different way. Be-
sides the statistical and mathematical knowledge he shared,
it was his never ending positivism that made the difference. I
truly believe he was the only one who could pull me through
this.
Lastly, I would like to express my appreciation towards all
friends and relatives. I am very grateful for all the support,
from the good conversations to correcting my embarrassing
spelling mistakes.
5
Contents
1 Introduction 9
1.1 Alliander . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Content of this thesis . . . . . . . . . . . . . . . . . . . . . . . 14
2 Survival analysis 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Survival models . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Framework of data . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Objective test . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Implementation 59
5.1 Description of the data . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Application of the Cox model . . . . . . . . . . . . . . . . . . 62
5.2.1 Glmnet . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 Baseline hazard . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Application of Random forest . . . . . . . . . . . . . . . . . . 65
7
5.4 Objective test . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7 Further research 83
8 Bibliography 85
8
Chapter 1
Introduction
The main goal of this thesis is general and widely applicable. However, the
whole reason that this problem is considered in the first place comes from
Alliander. To understand the general problem better this thesis starts with
a full description of the origin and environment of the problem.
1.1 Alliander
Alliander is a network company responsible for the distribution of energy in
the Netherlands. Alliander’s role is to operate and maintain the network for
electricity and gas to ensure the connection between energy suppliers and
consumers.
Alliander actually incorporates a group of companies, including Liander,
Liandon and Allego. Each company has a different focus but all have en-
ergy network operation as underlying activity. Distribution System Operator
(DSO) Liander keeps the energy distribution infrastructure in good condition
to ensure reliable distribution of gas and electricity to millions of consumers
and businesses every day. Liandon focuses on the development of sustainable
technologies and intelligent energy infrastructures. The company focuses pri-
marily on high voltage components. Allego is developing customised charging
solutions and infrastructure for municipalities, businesses and transport com-
panies. The latter is only one example of multiple companies which explore
and develop new business activities that fit Alliander’s strategy and role in
the transition towards a more renewable energy system.
The vast majority of the energy Alliander distributes flows in the traditional
9
Figure 1.1: Service area of Alliander since January 2016, source: [1].
way through the network. Power is generated at for instance energy plants
and wind farms and arrives at Alliander through the international and na-
tional energy networks of the Transmission System Operators (TSO) like
TenneT and Gasunie. Alliander’s network is the last step in the distribu-
tion towards the clients. However, times are changing and due to techno-
logical developments the network is not a straightforward one-way system
anymore.
Customers enjoy a lot of freedom and are able to make their own energy
choices. They can choose their own supplier and service providers from a
wide range of choices. In addition, a growing number of consumers and
businesses are feeding their self-generated energy into the network, resulting
in reversed flows. This distributed generation relies quite often on natural
sources such as wind and solar energy, which require more flexibility from
the network. On top of that there is a higher demand caused by energy
needs previously not provided by electricity but by fossil fuels. For instance,
the increasing number of electric vehicle charging points and the transition
towards electrical heating instead of the original gas installations. All these
changes make the energy supply chain more dynamic [2]. It is going to be
10
Figure 1.2: A modern energy distribution network.
a big challenge to ensure that all energy is distributed safely, reliable and
efficiently despite all these changes in an ageing network [3].
The Netherlands is divided in different regions and all distribution system
operators are responsible for their own area. The fee operators can charge
their customers is based on their performance and reliability. According
to the annual report [1] Alliander has 7240 employees to take care of 5.7
million customer connections. Figure 1.3 shows that all these connections
together cover quite a substantial part of the Netherlands. In fact Alliander
has a share of roughly 35%. A complete and detailed description of Dutch
distribution networks can be found in [4].
1.2 Reliability
Right now life without electricity is unimaginable. Due to their growing de-
pendence on the electricity supply, customers have high expectations regard-
ing the reliability of the electricity network. Simply put: customers demand
safe and continuous access to electricity, 24 hours a day, seven days a week.
Alliander has managed to keep the outage minutes to a minimum. On av-
11
Figure 1.3: Overview of Distribution System Operators in the Netherlands
for electricity [5].
erage a customer has no access to electricity for only 22 minutes each year.
This makes Alliander’s network one of the best worldwide. But as explained
before it is of great interest for Alliander to reduce this number even further
in order to become more reliable. Besides, customers want to pay as little as
possible for their reliable energy supply. That is why Alliander works daily
to continue improving their operational effectiveness and efficiency.
Maintenance and replacement are Alliander’s main tools to influence both
reliability and affordability. All the assets of Alliander are worth 7.7 billion
euro and yearly Alliander spends 700 million euro to maintain them. Quite
a big share of Alliander’s capital is underground (see figure 1.4). Maintain-
ing overground assets like transformers is relatively straightforward because
they can be checked visually on regular basis. However most of the assets
are underground and are therefore less accessible. Although placing assets
underground is the main reason why Alliander has one of the most reliable
networks in the world, these assets are responsible for more than 80 percent
of the total outage time (see figure 1.6). A substantial part of these under-
ground assets has exceeded their life expectancy according to the manufac-
turer (which is typically fifty years as shown in figure 1.5). The expiration
12
Figure 1.4: Book value per component.
date is only an estimate and from practice it is known that assets can live
much longer, mainly because the workload of several assets is much lower
than its capacity. Replacing all assets which exceeded the expiration date
is expensive and can even increase the hazard rate since brand new assets
suffer from fabrication errors.
Figure 1.5: The distribution network is ageing. In the near future a lot of
medium voltage cables will exceed their expected life time.
The few facts mentioned above illustrate the complexity of maintaining one
of the largest electricity distribution grids in the Netherlands. A smart main-
tenance policy is therefore of great interest for Alliander.
13
There are a few reasonable actions. Replacing underground assets before
they fail will of course prevent outages. This is rather expensive but can be
practical when combined with other underground construction works. More
recent possibilities to prevent (long) outages in the underground network are
Smart Cable Guards and Remote Switches. In all these cases it is essential
to have more knowledge of likely to fail assets. This project will be an
attempt to assist Alliander in maintaining their grid by providing failure
rate estimations.
Figure 1.6: The total number of outage minutes, split per component, cus-
tomers of Alliander had per year as an average over the period 2008-2015.
The main goal of this thesis is to make predictions of future failures from
the obtained data regarding failures in underground assets and to draw con-
clusions on the most important drivers for failures. This enables Alliander
to transition from a reactive to a proactive maintenance policy. Recently
Alliander celebrated its 100th anniversary and from all these years data has
been stored. The challenge will be to effectively use as much data as possible
in order to optimally estimate the probabilities of failure. Some examples of
covariates that might be taken into account are: construction, soil, length (of
cables) and several aggregates of the yearly load profile. For older assets less
data is available, whereby calculations might be less accurate. This thesis is
14
a continuation of the research performed by Alliander from 2014 onwards [6].
Most of the data gathering and preprocessing was done in this project.
Multiple methods will be used for the survival analysis of the assets. One can
think of proportional hazard models and machine learning algorithms. From
both formats an instance will be chosen. These algorithms will be adapted to
this specific case to optimally use their predictive power. This will generally
result in estimating the failure rates of the assets.
To compare and review the used algorithms objectively, a test will be devel-
oped. A theoretical test by calculating the error might be the obvious choice.
However, this project will examine the effect of the estimations in feasible
policies. The models will be ranked based on the effect their predictions have
in the policies. Hopefully this will be a more realistic measure.
To give a more concrete description of the project, the preceding story has
been translated into a single question.
Main question: What are the current and future failure rates of underground
cables and joints, given the available data?
This question can be split into several tasks.
• Adapt and optimize various algorithms to calculate the current and
future failure rates.
• Develop a test to objectively compare the algorithms.
• Realize code in R to compute the failure rates.
After having introduced the setting of the project in the current chapter, this
thesis will continue with the necessary theory of survival analysis in chapter
2. There, also some notations will be fixed. Next two predictive algorithms of
different nature will be described in chapter 3 and 4. These methods will be
shaped for this specific case in chapter 5 of which the results will be examined
in chapter 6. This thesis will end with recommendations for further research
in chapter 7.
The R language is widely used among statisticians and data miners for devel-
oping statistical software and data analysis and is therefore also the language
of choice for this thesis. Suitable packages for both proportional hazard mod-
els [7] and survival tree methods [8] are available in R.
This research project has been performed as a final examination for the
M.Sc. Mathematics at the Radboud University in Nijmegen, in corporation
with Alliander.
15
Chapter 2
Survival analysis
The branch of statistics that is suitable for the main question is called survival
analysis. To understand the following small introduction into this subject
some basic notions of probability theory and statistics might be helpful. This
necessary background information can be obtained from [9].
2.1 Introduction
Survival analysis analyses the expected duration of time until an event hap-
pens, in other words the random variable of most interest is time-to-event.
In biomedical studies this event is often death. Because of this common
application the field is termed survival analysis. Traditionally only a single
event occurs for each subject, after which the organism or object is dead or
broken. Survival analysis methods can be applied to a wide range of data
other than biomedical survival data. Other time-to-event data can include:
length of stay in a hospital, duration of a strike, time-to-employment and of
course time-to-failure for electrical components. As always in statistics the
goal is to draw conclusions from obtained data.
Usually a random variable X is characterized by its cumulative distribution
function F (x) = P(X ≤ x). It can be represented by either the probability
density function (continuous case) or the probability mass function (discrete
case). In both cases the function is denoted by f . In the context of time-to-
event data, there are other functions which describe the situation in a more
natural and intuitive way. Still, they are in one-to-one correspondence with
the distribution function and therefore characterize the distribution of the
17
random variable. They are especially important in the analysis of time-to-
event data.
Definition 1. The survival function S of X is defined as
where t is some time instance and X is a random variable denoting the time
of death. That is, the survival function is the probability that the time of
death is later than some specified time t. The survival function must be non-
increasing: S(u) ≤ S(t) if u ≥ t. This reflects the notion that survival to a
later age is only possible if all younger ages are attained. From now on time
is represented by the non-negative real numbers. In other words t ∈ [0, ∞)
and hence F (0) = 0 and S(0) = 1.
Definition 2. The hazard function λ of a finite discrete random variable X
is defined as
The hazard function at t reflects the danger (or hazard) of getting killed at
time t, having survived till (just before) time t. In the context of engineering
the hazard rate is commonly called the failure rate, denoting the probability
density of failure at time t given the device functioned well until (just before)
time t. Extensions to infinite discrete and continuous random variables exist,
but for this project a finite discrete random variable suffices. To rule out eter-
nal life, there must be at least one t for which λ(t) = 1 holds. Furthermore,
the hazard function is non-negative by definition. Note that λ is not other-
wise constrained; it may be increasing, decreasing or non-monotonic.
For now, as in the definition of the hazard function, only the discrete case is
treated, the continuous case will follow soon. The next connection between
the survival and hazard function only holds for the finite discrete case. Now
denote by t1 < t2 < . . . < tn the support of the discrete distribution of X.
Then S can be recovered from λ via
i
Y
S(ti ) = (1 − λ(tj )). (2.3)
j=1
Note that (2.2) only applies to the discrete case. Therefore relation (2.3) be-
tween the survival and hazard function only holds in that case. The relation
18
rises from the fact reaching time ti one has to survive all preceding points of
time. In the continuous case, the following definition is used:
Definition 3. The hazard function λ of a continuous random variable X is
defined as
d
λ(t) = − log S(t). (2.4)
dt
Although in this case λ is not defined as a conditional probability, it still has
a similar interpretation as the discrete version. Because S(t) = P(X > t),
one could argue that
P(X = t | X ≥ t) = P(X = t | X ≥ t)
= P(X = t and X ≥ t) / P(X ≥ t)
= P(X = t) / P(X ≥ t)
≈ P(X = t) / S(t)
S(t) − S(t + ∆t) (2.6)
= lim
∆t→0 ∆t · S(t)
d
= − log S(t)
dt
= λ(t).
In the continuous case the connection between the survival and hazard func-
tion is even more clear. Using the cumulative hazard function
Z t
Λ(t) = λ(s)ds, (2.7)
0
gives
19
increases again. A sketch of this curve is shown in figure 2.1. It models the
properties of some mechanical systems, most failures are either soon after
the start of operation, or much later, as the system ages.
failure rate
time
technical random wear out
failure failure failure
2.2 Censoring
In reliability engineering there is generally a relatively short window for the
duration of any study where the event is observed to occur or not occur. For
instance, for this thesis, during a period of 10 years failures have been regis-
tered, although components may live for more than 100 years. Adjustments
have to be made to account for potential biases. If a subject does not have
an event during the observation time, they are called censored. The subject
is censored in the sense that nothing is observed or known about that subject
before or after the time of censoring. A censored subject may or may not
have an event outside the observation time. Censoring is an example of miss-
ing data problem which is common in survival analysis. There are different
kinds of censoring of which several will now be described.
Ideally, both the birth and death dates of a subject are known, in which case
the lifetime is known.
observation window
time
20
It can very well happen that researchers lose track of certain subjects or that
some subjects are still alive when the study ends. The exact time of death is
then unknown for these particular subjects. This is called right censoring and
occurs quite frequently in reliability engineering, as knowing future hazard
rates is most interesting when the bulk of a population of subjects is still in
operation.
observation window
time
observation window
time
It may also happen that subjects may not be observed or registered at all:
this is called truncation. left censoring is different from truncation, since
for a left censored subject, the existence is known. Left truncation is also
common in reliability engineering.
Luckily the amount of censored data is limited because the placement year
of the majority of Alliander’s current assets is known. Right censoring and
left truncation still have to be taken into account.
21
2.3 Survival models
Besides information about the birth and death of an object further mea-
surements might be available that may be associated with the lifetime. For
instance in Alliander’s case, this can be geographical data or information
about the workload of an asset. These measures are called covariates (also
referred to as explanatory variables and predictors). Besides using lifetime
data, survival models can incorporate these explanatory variables in their
predictions. This gives a more accurate prediction of the time that passes
before some event occurs. In this section some survival models are briefly
introduced.
Here c : R → R>0 can be any function as long as the integral of λ(t, i) over
[0, ∞) is infinite. This model is called the multiplicative hazard model or the
proportional hazard model, as the hazard for an individual is the product of
a baseline hazard λ0 with only the time as input and a function c with the
background information as input. So the unique effect of a unit increase in
a covariate is multiplicative with respect to the hazard rate. Separating the
covariates from the time-dependent part enables a separate analysis.
This type of model is similar to linear regression and logistic regression.
Specifically, these methods assume that a single line, curve, plane, or surface
is sufficient to separate groups (alive, dead) or to estimate a quantitative
response (survival time).
Predictive Trees
22
Each branch in a survival tree indicates a split on the value of a variable.
Walking through the tree from the root will narrow down the group of sub-
jects. Be aware that overfitting comes into play really fast when narrowing
down too much. The general idea is that predictions can be made based on
the change of ratios within the group due to a split, even before the danger of
overfitting is present. Instead of drawing conclusions from a single survival
tree a good alternative is to build many survival trees, where each tree is con-
structed using a sample of the data, and average the trees to predict survival.
This is the method underlying the survival random forest models.
(Ti , Si , ∆i , Xi ) (2.10)
is available. Here Ti is the age at which i enters the study, which might be
greater than zero. Si is the age at which i leaves the study. Leaving the
study could happen because the individual died, in which case ∆i = 1. If,
for any other reason, the subject drops out, ∆i = 0 and the data is censored.
Xi = (Xi,1 , . . . , Xi,d ) ∈ Rd is a vector of potential predictors referred to as
the covariates. This data is often visualised in the schematic way shown in
figure 2.5.
23
observation window
..
.
time
As mentioned above, during a period of almost ten years failures have been
registered and coupled to specific assets. This data will be used to train
and test the different models. In order to test the models in a honest and
realistic way, imagine how the final result would be used. Given a good
model, Alliander would run it and predict the hazard rates of the current
assets. A ranked list of assets will be used while maintaining the network.
With this in mind the following concept has been developed (see also figure
2.6).
Suppose survival data has been generated during period [T, S]. Imagine the
situation for Alliander at time t ∈ [T, S]. Alliander would use period [T, t]
to train its model, and then predict the hazard of the assets which are in
use at time t. This is exactly what the test will do, with the advantage that
the failures in period (t, S] can be used to validate the different models. So
the models will be trained with data that was already available at time t.
This means that only failures registered in the period [T, t] will be taken into
account. The model will then be used to predict the hazard rates of each
asset that is in use at time t.
The dashed line goes up every time it passes a failed asset. After the last
failed asset the dashed line reaches 100 percent. The final measure is obtained
by the area under the dashed line. To compensate for different sizes of test
24
observation window
train test
time
T t S
Figure 2.6: Visualization of how the data is split for the objective test.
100
80
60
40
20
0
i1 i2 i3 i4 i5 i6 i7 i8 i9 i10
25
Chapter 3
In statistics, survival models relate the time that passes before some event
occurs to one or more relevant covariates. In proportional hazards models, a
class of survival models, the main assumption is that the covariates and the
hazard rate are multiplicatively related. To justify this assumption, one can
think of the effect that the choice of conductor material has on the failure
rate of an electricity cable. For instance, cables made of aluminium may
have a doubled hazard rate compared to copper cables. A typical medical
example would be the effect of a treatment on the duration of the illness, in
this context also age and gender are often considered.
Like other survival models, proportional hazards models consist of two parts:
the underlying hazard function and a part that depends on the effect pa-
27
rameters or covariates. The baseline hazard function, often denoted by λ0 ,
describes how the hazard rate changes over time. As mentioned earlier the
covariates are assumed to be multiplicatively related to the hazard which
gives us the following equation:
The second part is responsible for the involvement of the explanatory vari-
ables. The function c : R → R>0 can be any function as long as the integral
over t ∈ [0, ∞) of λ(t, i) stays infinite for each asset i. In other words, λ(t, i)
has to satisfy the requirements of a hazard function. Most of the time, c is
restricted to some form.
Fitting the Cox model to given data essentially means determining the opti-
mal λ0 (t) and β. Once these parameters are obtained, predictions could be
made to support future decision making. So, obviously, it is desirable that
the estimated values of the parameters λ0 (t) and β give the most realistic
predictions. Numerous influences affect the accuracy of the final values of the
parameters. The real mechanisms behind the failure of assets are much more
complex than the proposed model, not all relevant data is available and the
data quality is far from ideal. The commonly used technique in situations
like this is to optimize the likelihood of the parameters based on the data.
To write down the likelihood of the data explicitly, the assumptions of the
chosen model are used and therefore the result will be a function of λ0 (t)
and β. The most likely λ0 (t) and β are those for which this probability is
maximal.
Sir David Cox observed that if the proportional hazards assumption holds
(or, is assumed to hold), then it is possible to estimate the effect parame-
ter(s) without any consideration of the hazard function [10]. This approach
to survival data is called the application of the Cox proportional hazards
model. Following this observation this thesis starts with the optimization of
28
β. During this process λ0 (t) is allowed to be an arbitrary function. In other
words: the method works for each baseline hazard function. Now this is a se-
vere requirement and arguably unnecessary in the sense that an assumption
of some smoothness would be reasonable.
(Ti , Si , ∆i , Xi ), (3.3)
for each i in the finite set I of observed objects. Object i is observed during
the interval [Ti , Si ], ∆i denotes whether the event has occurred or not and
Xi is the vector of covariates. If ∆i = 1 the event has occurred and Si is the
event time, otherwise ∆i = 0 and Si is the censoring time. For the time being
ties will be ignored to prevent a too complicated expression (see section 3.3
for an inclusion of ties in the model). So each Si is unique for every i with
∆i = 1. Nevertheless, in general more objects are alive during time Si . The
set of these objects under observation during Si will be referred to as the risk
set, in mathematical notation this is written as
29
Observe that the factors of λ0 (t) cancel out. This immediately makes the
method independent of the baseline hazard function. Assuming that the
subjects are statistically independent, the joint probability of all realized
events conditioned upon the existence of events at those times is the partial
likelihood
Y Y eβXi
L(β) = Li (β) = X . (3.6)
i|∆i =1 i|∆i =1 eβXj
j∈Ri
This function can be maximized over β by calculating the first and second
derivative and using the Newton-Raphson algorithm. This algorithm is de-
scribed by Kendall Atkinson [11].
3.3 Ties
It is very likely that events will happen at the same time, especially when
time is considered as a discrete variable. When studying a flock of geese the
researcher possibly checks in ones a day, by lack of time and resources. The
event of death is then accurate on the day and two or more geese might have
died on the same day.
To be precise, there are two different i, j ∈ I with the event (∆i = ∆j = 1)
happening on the same time (Si = Sj ). Clearly expression (3.5) is now
incorrect, but fortunately there are several ways to handle situations in which
there are ties in the time data. These methods are approximations since
calculating the exact likelihood might be time-consuming. Breslow published
his method in 1975 that is really close to the one described above, however
the approach of Efron [12] is more successful, especially when the number of
tied events for a time point is relatively large [13].
30
As usual the set of objects is finite, so suppose t1 , . . . , tn are the different times
at which events occur. For a time point ti the risk set Ri = {r1 , . . . , rk } is
slightly different and this time also an event set Hi = {h1 , . . . , hl } is used.
Their definitions are
θr θr θrk
Lσ (β) = X1 × X 2 ×...× X (3.9)
θj θj − θr1 θj − θr1 − θr2 − . . . − θrk−1
j∈Ri j∈Ri j∈Ri
where θi = eβXi . All the likelihoods are different because the order in which
the θri , . . . , θrk are subtracted in the denominator affects the product. This
can be changed by subtracting the average θ̄ = (θri + . . . + θrk )/k instead of
the different θrl , which gives an approximation instead of the exact result.
Now the likelihood is the same for every different order, namely
θr θr θrk
L(β) = X1 × X 2 × ... × X (3.10)
θj θj − θ̄ θj − (k − 1)θ̄
j∈Ri j∈Ri j∈Ri
Y
θi
j∈Hi
= k−1
! (3.11)
Y X
θj − mθ̄
m=0 j∈Ri
Taking the product of all these factors over all the event times ti gives the
final likelihood. Now the same steps as above can be followed: apply the
logarithmic function to obtain an expression of the log-likelihood L(β), which
can subsequently be optimized with the Newton-Raphson algorithm. Notice
that when there are no ties at a certain time point, (3.10) reduces to the
expression of the previous paragraph.
31
Example: Gehan data
Treated Control
weeks event weeks event
1. 6 1 22. 1 1
2. 6 1 23. 1 1
3. 6 1 24. 2 1
4. 6 0 25. 2 1
5. 7 1 26. 3 1
6. 9 0 27. 4 1
7. 10 1 28. 4 1
8. 10 0 29. 5 1
9. 11 0 30. 5 1
10. 13 1 31. 8 1
11. 16 1 32. 8 1
12. 17 0 33. 8 1
13. 19 0 34. 8 1
14. 20 0 35. 11 1
15. 22 1 36. 11 1
16. 23 1 37. 12 1
17. 25 0 38. 12 1
18. 32 0 39. 15 1
19. 32 0 40. 17 1
20. 34 0 41. 22 1
21. 35 0 42. 23 1
The data set clarifies several of the concepts discussed earlier. Patient 9 is
an example of (right) censored data, as the exact time of death is unknown,
but the fact that the patient survived for at least 11 weeks will be used
32
nonetheless. Actually, the exact time of death of all the patients is unknown.
For some reason the survival time is measured in weeks. Hence tied survival
times are no exception, as can easily be concluded from table 3.1. The set is
rather small and has only a single covariate, but is nevertheless suitable for
the first real Cox regression of this thesis.
The data set will be analysed to obtain the coefficient of the covariate
“treated”. This uses the algorithm which has been described in this sec-
tion. Because the data is already in a usable format the R code for fitting
the Cox model is pretty compact.
Given the amount of data, this is a reasonable result. The value for β (since
there is only one covariate, β contains only one value) is −1.572 and this
means the hazard function will be multiplied with e−1.572 ≈ 0.208. Recall
that this doesn’t mean that treatment will give you an 80 % higher change
to overcome leukemia. It means that at a certain time the probability you
will die, is only 20 percent of the probability that it would have been without
treatment. This shows that the treatment is pretty effective. The likelihood-
ratio test at the bottom of the output is a test of the null hypothesis that
β = 0. In this instance the null hypothesis is soundly rejected.
Even more cogent than the coefficient from the output, is a display on how
estimated survival depends upon treatment, because the principal purpose
of the study was to assess the impact of treatment on survival.
33
Figure 3.1: Estimated survival functions for those receiving treatment and
the control group in the Gehan data. Each estimate is accompanied by a
point-wise 95-percent confidence envelope.
Now the fist steps have been made, it is time for investigating the baseline
hazard function λ0 (t). Once again the maximum likelihood method will be
exploited. This time though, the work that has already been done is used.
The likelihood expression will be based on the value of β estimated just now,
the data and the still unknown λ0 (t).
As in the previous example, and in all following cases in this thesis, time will
be treated as a discrete variable. To simplify, 100 = {1, 2, . . . , 100} is chosen
as the domain of λ0 . When t = 3, λ0 (t) says something about the probability
that an asset survives its 3rd year of life. There is not a very strong reason
to bound the domain at 100 other than some field experience, since there
are not yet assets older than 100 years. Making the domain explicit will also
make the description easier to understand and it will be straightforward to
adapt the procedure to cases where a different domain is desired.
So the domain of λ0 consists of the natural numbers from 1 to 100. As a result
the baseline hazard function will be piecewise constant. Even though this is
far from realistic, this practical form still gives satisfactory predictions.
34
λ0 (t)
λ3
λ4
λ2
λ5
λ97
λ1 λ98
λ99
λ100
t
1 2 3 4 5 ... 97 98 99 100
Figure 3.2: Piecewise constant baseline hazard function with the still un-
known values λ1 , λ2 , . . . , λ100 .
The main goal of this section is to estimate the theoretical values of the base-
(0)
line hazard function λ0 . Therefore a likelihood function will be optimized
over the values of λ0 . The maximum value of this likelihood function is at-
(0)
tained in λ̂0 . For each j ∈ 100, λ0 (j) is abbreviated to λj . For λ0 and λ̂0
similar notation is used.
The survival data D = (Ti , Si , ∆i , Xi )i∈I has the usual format as described
in section 2.4. The collection of assets I is finite and its cardinality is N .
Using Ti and Si , the year of birth Bi and (a bound for) the age of death Di
can be calculated. Since Di is the age at which asset i fails, if ∆i = 1 then
Di = Si and Di > Si otherwise. So sometimes Di is known exactly, but in
all cases the bound suffices. The distribution of this data is left unspecified.
At the end of this section it will be clear that the results hold independently
of the distribution of the data.
The hazard function of each asset i is based on its covariates, therefore this
also holds for its survival function Si and the cumulative hazard function Λi .
Using these, the probability can be rewritten in the following way.
35
P[Di > j | Di ≥ j] = e−(Λ i (j+1)−Λi (j))
R j+1
= e− j λi (t) dt (3.13)
βX
= e−λj e i
The last step uses that the baseline hazard function is piecewise constant on
each interval [j, j + 1). The entire likelihood is built from expressions of this
form, going over all different assets and the years they have survived in the
observation window. Besides this, the following three indicator functions are
used,
1 : Di > j (i.e. asset i survives age j)
1S
j (i) =
0 : otherwise
1 : Di = j (i.e. asset i dies at age j)
1Dj (i) = (3.14)
0 : otherwise
1 : j ∈ [Ti , Si ] (i.e. asset i reaches age j)
1Rj (i) =
0 : otherwise
The actual bounds of the observation window are left unspecified because the
results of this are independent of these bounds. These indicators are easily
calculated when the data is available. The expression for the likelihood of
the data in the observation window is then
Now the factor of each j only depends on λj . This enables us to optimize over
each λj separately because the likelihood is a product of all strictly positive
factors. For the actual optimization the power and speed of a computer is
used.
Although the maximum likelihood is a respected approach, this is not yet a
completely satisfactory reason to use this specific method. The next theorem
actually states that this procedure is consistent. This means that for an
increasing number of observations the result will converge to the desired
(0)
value λj .
Theorem 4 (Consistency). Let D = (Ti , Si , ∆i , Xi )i∈I be Cox survival data
with N = |I|. The value of λj for which Lj (λj , D) is maximal converges to
(0)
λj as N goes to infinity.
36
Proof. As stated before, the likelihood of each j ∈ 100 can be considered
separately. This likelihood is a function of the data D and more importantly
λj . By increasing the number of observations the likelihood will converge
towards the corresponding expected value. At the end of this proof it is
(0)
shown that the optimum of this expected value is attained in λj . Note that
1Sj (i) · 1Rj (i) · (−λj eβXi ) + 1Dj (i) · 1Rj (i) · log(1 − e−λj e
βXi
X
log(Lj (λj , D)) = ).
i∈I
(3.17)
For the same reason it is allowed to divide the function by N , which will turn
the expression into an average over the set of observations I. This leads to
N
X yn
lim = E[Y ], (3.19)
N →∞
n=1
N
37
are written without i as variable, for example 1Sj instead of 1Sj (i). The result
looks like
h i
E 1Sj · 1R
j · (−λ j eβX
) + 1D
j · 1R
j · log(1 − e −λj eβX
) . (3.20)
The goal is to optimize this expected value over the variable λj . First, some
rewriting is necessary. By the law of total expectation
E Y1 = E E[Y1 |Y2 ] (3.21)
h i
E 1R
j · (−λj eβX
) · E[ 1S
j | X, B] + 1R
j · log(1 − e −λj eβX
) · E[ 1D
j | X, B] . (3.23)
The conditioned expected values that remain can be calculated explicitly. 1Sj
means that an asset has survived for at least j years. The probability of
this can be written as the product of j terms which have the form of (3.13).
Similarly, 1D
j implies that an asset survived for j − 1 years and then died in
year j. The resulting product is similar except for the last term. This gives
j
E[1Sj | X, B] =
(0) βX
e−λk e
Q
k=1
j−1 (3.24)
(0)
E[1 | X, B] = (
(0)
D
Q −λk eβX −λj eβX
j e ) · (1 − e )
k=1
h j
E 1R
(0) βX
βX
e−λk e ) +
Q
j · (−λj e ) · (
k=1
j−1 i (3.25)
(0)
1 · log(1 − e
(0)
R −λj eβX
Q −λk eβX −λj eβX
j )·( e ) · (1 − e ) .
k=1
38
In this expression, 1R
j is left untouched because otherwise it would only make
the expression more complicated. Some rewriting gives the following clearer
product
h j−1 i
(0) (0) (0)
E 1j · (e −λk eβX −λj eβX −λj eβX −λj eβX
Y
R βX
) · − λj e · e + log(1 − e ) · (1 − e ) .
k=1
(3.26)
The first factors do not depend on λj . Still it is not immediately clear that
a constraint on λj can be derived from the last factor since it also depends
on X. Nevertheless this factor, denoted by l(λj , X), is examined next:
(0) βX βX (0) βX
l(λj , X) = −λj eβX · e−λj e
+ log(1 − e−λj e ) · (1 − e−λj e
). (3.27)
Xβ
dl(λj , X) (0) βX e−λj e · eβX (0)
−λj eβX
= −eβX · e−λj e + −λ eβX · (1 − e ). (3.28)
dλj 1−e j
The derivative is equated to zero. This will give the λj for which the original
likelihood is optimal. This value is denoted by λ̂j to differentiate it from the
underlying variable.
Xβ
(0)
−λj eβX e−λ̂j e · eβX (0) βX
0 = −e βX
·e + · (1 − e−λj e
) (3.29)
1 − e−λ̂j eβX
· e−λ(0)
eβX
j e
βX βX
· (1 − e−λ̂j e ) = e−λ̂j e
Xβ · (1 − e−λj
eβX
·
(0) βX
e
)
(0) (0)
−λj eβX
−(λj +λ̂j )eβX eXβ (0)
βX
e −
e = e−λ̂j e−(
− λ̂
j +λ j )e
(3.30)
(0)
−λj eβX βX
e = e−λ̂j e
(0)
λj = λ̂j
39
(0)
This shows that l(λj , X) ≤ l(λj , X) for every possible value of λj . This
result holds for all values of X. To summarize, the adjusted likelihood will,
by the law of large numbers, converge to its expected value and this expected
(0)
value admits its maximal value at λ̂j = λj .
3.5 Simulation
After all the theoretical work, it would be good to see the algorithm at work.
Instead of discussing another example, this time a simulation is executed.
The data in an example is obtained from the real world, while data in a
simulation is artificial. One does not suffer from low data quality. The data
will be generated by precisely following the model with parameters that are
chosen beforehand. One can see a simulation as an example from the perfect
world. In the perfect world, the method of obtaining β and λ0 (t) should work
extremely well. Unexpected results from a simulation can help to point out
certain biases.
Imagine walking in a large zoo, wherein monkeys are living. Assume that
there is no zoo keeper who can remember another zoo keeper who in turn
can remember a time when there were no monkeys in the zoo. For some
reason each monkey gets the same amount of bananas every single day. Some
monkeys get 2 bananas a day, others eat 5 bananas a day and there are even
monkeys who don’t get any food at all. These apes are observed for 10 days
to determine the effect of the amount of bananas they get fed. Besides, the
survival curve in general is of interest. To create this imaginary troop of
monkeys, set β = 0.09 (there is just one covariate) and use the baseline
hazard function from figure 3.3.
40
Monkeys are generated under the assumption that the Cox model describes
the hazard rates of these monkeys perfectly. First, the number of bananas it
can eat is randomly chosen and then is calculated how many days it survives.
This is a random process. Finally it has to be taken into account that not all
monkeys live during the 10 days of the observation interval. If the monkey
is lucky enough to be alive during the observation time, it is registered. This
process runs until n ∈ N monkeys are generated.
Once the data is generated, the survival analysis is done as it is described in
this chapter. The simulation has run with n = 1000, 10000 and 100000. The
results are shown in table 3.2 and figure 3.4.
It comes as no surprise that the approximations get better when the number
of apes rises. For the same reason the estimates are much more accurate for
the first 40 days compared to the days after that period. The figure also shows
that the algorithm sets the failure rate to 0 when no data is available. This
is far from ideal and therefore some smoothing methods will be considered
later.
The example may look like a high school exercise, but it is a reliable simu-
lation. Further simulations have been performed with realistic hazard rates
and hazard ratios. Also in a situation isomorphic to the case of Alliander the
simulation gives satisfactory results.
41
3.6 Extensions of the Cox model
For classical problems, where the number of observations is much larger than
the number of predictors, the Cox model performs well. However, every
day more and more data is available, this has been described as the big
data phenomena. Statistical questions have grown simultaneously and have
drastically increased in size. Sometimes the number of covariates is bigger
than the number of observations. The Cox model doesn’t perform that well
in these situations.
Robert Tibshirani introduced lasso in order to improve the prediction accu-
racy and interpretability of regression models. He did this by altering the
model fitting process to select only a subset of the provided covariates in the
final model rather than using all of them [14]. This provides a solution that is
well-defined, and has fewer nonzero βi as well. Even in the classical situation
this may better estimate β than the original Cox model. Throughout the
years the method has been developed for fast convergence [15].
More recently, Zou and Hastie [16] proposed the elastic net for linear re-
gression. This is a combination of lasso and the earlier available method
called ridge regression. They hoped to get the best of both worlds. Park and
Hastie [17] applied this to the Cox model and proposed an efficient algorithm
to solve this problem. Their algorithm exploits the near-piecewise linearity
of the coefficients to approximate the solution at different constraints, and
then numerically maximizes the likelihood for each constraint via a Newton-
Raphson iteration initialized at the approximation. Goeman [18] developed
a hybrid algorithm, combining gradient descent [19] and Newton-Raphson.
Friedman et al. [20] instead employ cyclical coordinate descent to develop a
fast algorithm to fit the Cox model with elastic net penalties. The last at-
tempt has resulted in glmnet, an R package which will be used for the results
in this thesis. In the rest of this section the notions of these extensions of
the Cox model will be discussed.
Lasso
42
Lr (β) = L(β) + kβk2 . (3.31)
43
Lr -norm Ll -norm
Figure 3.5: Geometric interpretation of minimizing the ridge and the lasso
penalty.
Elastic net
As mentioned in the introduction of section 3.6, lasso and ridge regression can
be combined into what is called an elastic net, for which the corresponding
expression for the likelihood is
The elastic-net penalty is controlled by α, and bridges the gap between lasso
(α = 1) and ridge (α = 0). The tuning parameter γ controls the overall
strength of the penalty. It is known that the ridge penalty shrinks the co-
efficients of correlated predictors towards each other, while the lasso tends
to pick one of them and discards the others. The elastic-net penalty mixes
these two; if predictors are correlated in groups, an elastic net with α = 0.5
tends to select the groups in or out together. The elastic-net penalty also
achieves numerical stability; for example, if α = 1 − for some small > 0,
the elastic net performs much like the lasso, but removes any degeneracies
and wild behaviour caused by extreme correlations like ridge regression would
do.
44
The result is that highly correlated covariates will tend to have similar re-
gression coefficients which is very different from lasso. This phenomenon is
referred to as the grouping effect and is generally considered desirable. In
many applications one would like to find all the associated covariates, rather
than selecting only one from each set of strongly correlated covariates.
Glmnet
45
Chapter 4
A decision tree is a structure that repetitively splits data based on the values
of covariates into smaller subsets. The process stops when eventually the
subsets are too small, resulting in a conclusion based on this subset. The
structure can be visualized as a tree such as the example in figure 4.1
This decision tree can be used for predicting whether a first year math student
will complete their study. Given the covariates of an individual, the tree
narrows down the actual probability of survival. Each node corresponds to
a certain covariate and each of its children to the values that this covariate
can take. A leaf represents the conclusion, the predicted value of the target
variable. It is straightforward to understand and interpret a decision tree.
This transparency allows us not only to use the predictions but also to observe
the decision rules that are created.
47
age > 20
s no
ye
0.12 0.26
Figure 4.1: Percentage of first year math students that don’t drop out.
Trees in which all the covariates and in particular the target variable can take
values from a finite set, are called classification trees. But decision trees are
able to predict both numerical and categorical data and it is not specialised
in analysing data sets that have only one type of variable. In order to solve
the problem such as the one in this thesis, this flexibility is useful. Trees in
which the target variable can take continuous values (typically real numbers)
are called regression trees.
Decision trees are a popular method for machine learning tasks because it
requires little data preparation. It serves as an off-the-shelf procedure even
for incomplete or unnormalized data. On top of that, decision trees perform
well with large data sets. Large amounts of data can be analysed using
standard computing resources in reasonable time.
The beauty of decision trees lies not in the interpretation but in the construc-
tion. As ‘machine learning’ suggests, a computer generates the tree based on
available data. A tree starts by splitting the source set into subsets based
on the values of a covariate. This process is recursively repeated on each
derived subset. The recursion is completed when splitting no longer changes
the distribution over the target variable, generally because the subset has
an uniform value for the target variable. The splitting also stops when the
subset has become too small.
The covariate and the conditions on the value at each split are carefully
selected in order to get the best split for that moment. By doing so the
algorithm keeps its focus on the most relevant variables during the construc-
tion and is thereby robust to inclusion of irrelevant features. Here “best”
is objectively measured with some metric; this is explained in the next sec-
48
tion. Even though this strategy greedily selects the best split at each step
in the construction, this does not mean the result is the optimal decision
tree. It turns out that the problem of learning an optimal decision tree is
NP-complete. Nevertheless, the local optimum that the greedy algorithm
finds is a practical solution.
Selecting splits
n X
X
MG (f ) = fc (Ik )(1 − fc (Ik )).
k=1 c∈C
The more divided each subset is, the more this number will go to zero.
So this will result in a split that separates each value of a categorical
target variable the best.
2. Information gain
Another way of classifying a split can be to look at the difference of
information gain. Information theory studies, among other things, the
quantification of information. The entropy or information gain H of a
random variable X is defined as minus the expectation of the logarithm
of the probability distribution, i.e.
49
H(X) = E[− log2 (F (X))].
n X
X
ME (f ) = − fc (Ik ) log2 (fc (Ik )).
i=1 c∈C
1 X 1
2 1 X 1
MV (f ) = − (ya − y b ) − (ya − yb )2 .
|IT |2 a,b∈I 2 |IF |2 a,b∈I 2
T F
50
4.2 Random forest
The largest problem with decision tree learning is overfitting [22]. Trees tend
to get too complex and therefore do not generalize well, or they are grown
too deep (the subset at each leaf is too small) and learn highly irregular
patterns. So although they are easy to obtain, they are seldomly accurate.
Besides, decision trees have problems learning concepts like XOR, parity or
multiplexer problems. Moreover, as pointed out before, splits based on infor-
mation gain are biased in favor of attributes with more levels. Fortunately,
creating an ensemble of trees by adding some randomness, can solve these
problems and greatly boost the performance of the model. This leads to the
random forest algorithm.
The word forest suggests the construction of multiple decision trees. The
general idea is that the final prediction is the mode of the classes (classi-
fication) or the mean prediction (regression) of the individual trees. This
method leads to better model performance. While the predictions of a single
tree are highly sensitive to noise in its training set, the average of many trees
is not, as long as the trees are not correlated.
To correct for the decision trees’ habit of overfitting it is important that the
correlation between the trees is reduced. In a first attempt, by Tin Kam Ho
[23], trees were trained on random samples of variables instead of the entire
set of covariates. Informally, this causes individual learners to not over-focus
on features that appear highly descriptive in the training set, but fail to be
as predictive for points outside that set. Several variations on Tin Kam Ho’s
work had the same surprising result. Until then it was commonly believed
that the complexity of a classifier can only grow to a certain level, exceeding
that level will hurt the accuracy with overfitting. The new results shattered
this assumption.
The idea of Ho was extended by Leo Breiman and Adele Cutler [24]. They
named their algorithm Random forests and it became their trademark. The
algorithm was also influenced by an article of Amit and Geman [25] which
introduced the idea of introducing randomness while splitting nodes. The
known notations were combined with Breimans earlier invention of tree bag-
ging, where each tree is fitted to a random sample of the training set instead
of the entire training set. This creates even more variation in the trees in
order to create a healthy forest. The next section will describe the most
important building blocks of Random forest.
51
4.3 Random subspace method
Ho named his first attempt to solve the overfitting problem of decision trees:
the random subspace method. Before constructing a tree, a random subset
of the covariates is selected. The algorithm that would otherwise select the
same descriptive covariate again and again, is now forced to consider different
options. So instead of selecting the most optimal split, a forest of trees with
less optimal splits is created.
The quality of a random forest mainly depends on two things: the correla-
tion between the trees and the strength of each individual tree (see section
4.3). Ideally, the correlation is low while the individual strength is large.
Both values depend on the number m of covariates that is randomly se-
lected for each tree. Increasing m will result in a higher correlation and a
larger individual strength and reducing m will have the opposite effect. So
somewhere in the middle is the optimal value for m which is selected by the
algorithm
p itself. Typically, for a classification problem with M covariates,
m = b (M )c features are used in each split. For regression problems the
inventors recommend m = bM/3c with a minimum of 5 as the default.
Tree bagging
Out-of-bag error
52
This is useful because now the previous discussed extensions can be opti-
mized. For the random subspace method the optimal number of selected
features can be determined. In tree bagging the size of the bag is a free
parameter which can also be optimized. Finally, the out-of-bag error is also
used to show the importance of each variable.
Another way of characterizing the accuracy of random forest is by calculating
how far the percentage of trees with the right prediction exceeds the percent-
age of trees for any other (wrong) prediction. Breiman used this to formulate
a probability that the prediction of a random forest is wrong. In his article
he proves that this probability does not converge to zero for an increasing
number of trees. This is the mathematical support for the counterintuitive
fact that a more complex forest is not prone to overfitting.
There is an upper bound for the probability that the prediction of a random
forest is wrong. The two ingredients involved in this bound are the strength
(s) of the individual trees and the correlation (c) between them. This upper
bound has the order of magnitude of c/s2 . This helps to understand the
functioning of random forest.
Variable importance
53
gender school type av intake intake VR LR exam
boy mixed 0.166 1 2 0.619 0.261
boy mixed 0.166 2 3 0.205 0.134
girl mixed 0.166 3 2 −1.364 −1.723
boy mixed 0.166 1 2 0.205 0.967
girl girls 0.103 2 1 0.371 0.544
girl girls 0.103 1 2 2.189 1.734
.. .. .. .. .. .. ..
. . . . . . .
Table 4.1: Data set with school results of children from London.
portance of the variables is calculated. The data set consists of 4059 children
from 65 different schools in London. The head of the data is shown in table
4.1. From every student several details are recorded. The gender and school
type and average intake score of the school are known. Furthermore the re-
sults of four tests are included; the intake, verbal reasoning, logical reasoning
and the final exam.
One might think the gender of a student has a lot of influence on the final
exam score, now it is possible to actually test whether this holds true. The
following code will give the variable importance of the fitted random forest
model.
The output shows that the logical reasoning and the intake test are the most
important covariates. The gender of the student has only a marginal effect
on the final exam in comparison to the other variables.
54
Missing value replacement
Empty entries in the training data can be problematic for fitting random
forest. Missing data can be replaced with the most frequent value or the
average of the non-missing values for that variable. This might seem to be
a rude solution but actually gives a quite satisfying result in most cases as
will be shown in the next example.
The result can be boosted even more. After running random forest one can
calculate the error of the target values for each individual in the training
set, let’s call these proximities. Now each missing entry is again filled in by
the most frequent value or the average of the values, but now weighted by
the proximities. So accurate predictions are considered more reliable. From
experience is known that four to six iterations are enough.
The next example compares the two methods of value replacement. This
time data from baseball players is used to predict their annual income. A
small snapshot of the data is given in table 4.2.
55
This data set is entirely filled out except for some incomes of first year players.
Since random forest can not handle missing target values, these players will be
removed. A total of 263 players remains from the original 322 players.
Next, some percentage p of the records is randomly removed and filled back
in with both methods. The first method uses the average or most common
value. The second method also uses the iterations described above. The data
is split into a training set and a test set. The random forest model is fitted
on the training set and used for predictions on the test set. The mean error
of the predictions is plotted.
1 > library ( randomForest )
2 >
3 > baseball <- read . delim ( " ~/ data / baseball . txt " )
4 > baseball <- baseball [! is . na ( baseball $ Y ) ,]
5 >
6 > RemoveData <- function ( percent , data ) {
7 + if ( percent != 0) {
8 + for ( i in colnames ( data ) ) {
9 + if ( i != " Y " ) {
10 + data [ sample ( nrow ( data ) , floor ( nrow ( data ) /100 * percent ) ) , i ] <- NA
11 + }
12 + }
13 + }
14 + return ( data )
15 + }
16 >
17 > MeanPredError <- function ( percent , data , impute = FALSE ) {
18 + set . seed (37)
19 +
20 + data <- RemoveData ( percent , data )
21 + if ( impute ) {
22 + data <- rfImpute ( Y ~ . , data = data )
23 + }
24 + else {
25 + data <- na . roughfix ( data )
26 + }
27 +
28 + samp <- sample ( nrow ( data ) , floor (0.7 * nrow ( data ) ) )
29 + trainset <- data [ samp , ]
30 + testset <- data [ - samp , ]
31 +
32 + model <- randomForest ( Y ~ . , data = trainset )
33 + testset $ PRED <- predict ( model , testset )
34 + testset $ DIFF <- abs ( testset $ PRED - testset $ Y )
35 + return ( mean ( testset $ DIFF ) )
36 + }
37 >
38 > error _ roughfix <- rep (0 , 100)
39 > error _ impute <- rep (0 , 100)
40 > for ( i in c (0:99) ) {
41 + error _ roughfix [ i + 1] <- MeanPredError (i , baseball )
42 + error _ impute [ i + 1] <- MeanPredError (i , baseball , TRUE )
43 + }
44 > plot ( error _ roughfix , col = " red " )
45 > points ( error _ impute , col = " blue " )
56
Figure 4.2: Mean prediction error after removing a percentage of the data.
Figure 4.3: Mean prediction error after removing a percentage of the data in
the training set. The test set is complete.
In both cases the extended method is slightly more successful. More surpris-
ing is that not much prediction power is lost after a substantial percentage
of the data is removed.
57
Chapter 5
Implementation
The previous chapters gave a precise and formal description of two methods
for survival analysis. As always, the practical implementation is not without
any obstacles and this chapter is an attempt to cover the application of the
methods. Moreover, this chapter reveals some additional technical details
that are worth mentioning.
59
Figure 5.1: A cross section of a joint connecting two red medium voltage
cables.
60
dinates, but for cables both the coordinates of the start and end point are
known. The grid is divided in rayons which of course is only a projection to a
smaller set knowing the coordinates of each asset. Nevertheless, knowing the
rayon can be valuable because they might hold historical information. Exam-
ples of rayons are: “Amsterdam”, “Rijnland” and “Gelderland-Oost”.
Figure 5.2: A plot of the coordinates of the medium voltage joints, a red dot
indicates the occurrence of an event.
Besides the exact location, the covariates also provide a detailed descrip-
tion of the surroundings, for example the type of soil and the subsidence
of the ground. Next to that weather-related information like the number of
dry days and the temperature is known. Even the distance to the nearest
weather station is added such that the accuracy of the weather information
can be estimated. All the just mentioned covariates are known to be influ-
ential. For instance, underground assets tend to decay faster in hot and dry
circumstances.
It is common sense that the distance to the nearest weather station should
not have a direct effect on the hazard rate. This covariate gave the inspiration
to investigate interactions between covariates. Weather based covariates are
probably less accurate if the distance to a weather station increases. Even-
tually these interactions are not taken into account. Still this covariate is
added to the analysis because the algorithms should be robust enough to
handle covariates with no direct influence.
The environment is furthermore described by the distances to the nearest
tree, nearest building and nearest railroad. Assets close to a tree are more
61
vulnerable for failure, since a storm could take down a tree which in turn
can damage the grid. Moreover, some numbers that indicate the population
density are available.
Other meaningful data concerns the construction of the asset. For joints only
the type and sort are written down. The type indicates for what purpose
the joint is placed and the sort tells how the joint itself is made. Especially
the latter is believed to be of great importance. For instance a joint of sort
“nekaldiet” has a bad reputation. It is the task of this thesis to give an
objective final judgement in this case. The construction of the cables has
more parameters such as length, diameter, conductor and isolator.
Even though it can probably be derived from the construction, there are
some covariates that store the usage profile. The voltage level and for cables
also the maximum and mean load fraction are known.
Finally, because defects in the grid can damage surrounding assets, for each
asset the number of close failures is known.
First off, the data has to be prepared for the analysis. The first rigorous step
is to delete all entries of which the birth year is missing. As explained in the
previous section, the year at which an asset is placed is not always registered,
but this information is crucial for Cox regression.
The dates on which an asset enters and leaves the observation window are
both registered. For the regression however, only the years of these events
are used. This is because also only the year (and not the date) of placement
is known. This is in line with the explanation that the baselines handle time
as a discrete variable.
Unfortunately, there are still missing records and the method can not be
applied on these. Therefore they are imputed in a straightforward manner.
Numeric values are imputed by the mean and missing factors are replaced by
by a new level named “unknown”. An example of a factor is the covariate
“sort” which has “nekaldietmof” or “oliemof” as possible values. In the final
calculation of the hazard rate, this factor itself cannot be multiplied with the
corresponding β. Each of the values of the factor has its own separate β. To
abuse this property some of the numeric covariates are artificially transformed
to a factor. It turned out that this approach led to better results.
62
5.2.1 Glmnet
The Glmnet package is mainly used for making a selection of the covariates.
The package also returns the corresponding values for β, but these will not
be used. Fitting the Cox model again, but now on the selected subset with
the official Cox package, gives better predictions.
In chapter 3 it is explained that Glmnet uses a penalty to select covariates.
Figure 5.3 shows that the likelihood is optimal for roughly e−10.5 (in which
case there are still 107 of the 137 initial covariates left). However, Glmnet
selects the penalty with the fewest number of selected covariates with a like-
lihood that is still close enough to the optimal value. The selected penalty
lies around e−8 and 33 covariates have a coefficient other than zero.
Figure 5.3: The penalty factor plotted against the likelihood and the number
of selected covariates.
Glmnet can handle large numbers of covariates and is still able to select the
most important ones. This makes it possible to experiment a little bit. One
of the experiments was to feed Glmnet covariates after applying a function
to the covariates. The idea was that it would enhance fits to quadratic or
logarithmic relationships. This is not used in the final model because it was
not extremely successful.
In another experiment meaningful interactions between covariates were stud-
ied. Theoretically this would make it unnecessary to execute the method
separately on each asset type. Indeed, if a specific covariate would be more
63
important for a certain type, this method would select its combination. First,
the normal selection procedure of Glmnet was executed. The selected co-
variates were then pairwise multiplied which led to a quadratic number of
covariates. Now Glmnet was asked to perform another selection. Note that
it could still choose the normal covariates which are the ones multiplied by it-
self. Glmnet did actually select combinations of covariates, but, surprisingly,
the prediction power got worse.
64
It deals with the trade-off between the goodness of fit and the complexity
of the model. Let L be the maximum value of the likelihood function for
the model and let k be the number of estimated parameters. Then the AIC
value of the model is
From the formula it follows that a smaller value for AIC indicates that the
model is better. The AIC is calculated for the last fifty values of j because
this is the part where the models are different.
Table 5.1
According to Akaike, the segmented smoothing is a better model and this will
be shown again later in this chapter. The index has a penalty for increasing
the number of parameters and this discourages overfitting. So although the
likelihood is higher the predictive power might be less. In the next chapter
only the most successful smoothing technique is applied.
Implementing random forest is less work because the method can handle big
unprepared data sets. Like the Cox model, Random forest can not handle
missing entries. So these are all imputed with the extended imputation from
chapter 4. Note that now also the year of placement is imputed instead
of deleting the specific records. Another difference is that now the whole
observation window dates are used, again because Random forest has less
restrictions on the data set.
After the whole data table is filled, Random forest is invited to build a model.
Stratified sampling is used to over-represent failed assets when a tree is build.
This makes the model slightly better. The stratification ratio is 10 : 1, so on
average the set used for building a tree contains a failed asset for every ten
cencored assets.
65
5.4 Objective test
Before applying the two methods to the whole data set, they are objectively
tested. The concept of this test was already mentioned in the beginning of
this thesis in chapter 2. Failures have been registered from 2007 till 2017.
The failures up to 2015 and the state of the assets at that point will be used
to train the model. Afterwards the models will predict the failure rates of
the set of assets that have lived in the period from 2015 to 2017. The test has
also been performed with other split points, but here only one split is shown.
This split gave the best results for both methods and all data sets.
After the predictions are made, the test set is ordered based on the predic-
tions. Now the following curve shows the percentage of assets that failed
between 2015 and 2017 plotted against the ordered data set. This is called
an ROC-curve and is often used in information theory. A higher area under
the curve means that on average failed assets have been assigned a higher
failure rate compared to cencored assets. So one aims for the highest possible
area under the curve.
It might be difficult to judge the outcome of this test by itself. Maybe one
would expect a higher number. But keep in mind that the models are fitted
to only a part of the real failures and have to predict which asset is going to
fail although less than 1 percent actually failed.
Table 5.2: Area under curve for the two methods and the four data sets split
at 2015.
The result of the test will not be discussed in this chapter. It will be taken
into account in the final judgement at the end of the next chapter. This
test is designed to compare the two different methods, but it does not fully
exploit the data in order to get the best fitting model. The values of (the
most accurate) models are interesting for Alliander. For this reason both
methods are applied on the full data set. This is of course done with cross-
validation; the data set is split into five parts, and to determine the failure
rate of one part the model is trained with the four other parts.
66
Figure 5.4: ROC-curve of the Cox model without smoothing method on
medium voltage joints. The area under the curve is 0.6538.
67
Chapter 6
Following the description in the previous chapter will lead to a program. For
this thesis, it was possible to run this program on the server of Alliander.
The server had a faster processor and more memory available. This chapter
contains all the results of both the Cox model and Random forest generated
by the code. Some work is done to assist the reader with interpreting the
information in the right way. At the end of this chapter a final discussion
will conclude which model is better compared to the other.
69
Two coefficients for medium voltage joints
Covariate Value Factor Lower Upper Variance
1110 1.345 1.042 1.735
dry days 1150 2.642 1.791 3.898 0.015
1200 0.993 0.822 1.200
main cable failures 1.063 0.984 1.148 0.429
Table 6.1
to the other covariate “main cable failures” (MCF). In the data set DD takes
on 13 different values, but Glmnet has selected only 3 of them. The Cox
regression has given them each their own factor. Apparently if DD is 1150
the hazard rate is multiplied by 2.642 resulting in an higher risk, while the
risk is lower if the value is 1200. The MCF of all assets is much more diverse,
but the mean impact is 1.063 which is relatively close to 1. However the
variance of MCF is significantly high. So all hazard rates are multiplied with
a wide variety of factors. This shows that MCF has a big impact on the
predictions.
At the start of the analyses there were roughly 150 covariates. Table 6.2
displays the full list of coefficients that is still left after the selection of Glm-
net. The most important covariate is “construction”. The “nekaldietmof” is
roughly four times as bad as the others in the table. Keep in mind that not
all values are mentioned, so for instance the “gietharsmof” is only twice as
good as the “nekaldietmof”, since values that aren’t mentioned in the table
basically have a factor equal to 1.
Joints with an unknown voltage have an incredibly high risk. But there are
negligible few assets with an unknown voltage; this can be deducted from the
big range of the lower and upper 95%-bound. The same thing is true for a
value of the covariates “number of dry days”. These situations are therefore
not significant for the predictions of the future. Nevertheless, it might be
worth to investigate these specific situations, because the underlying cause
of these extreme values could be a solvable flaw in the system. In all tables
the irrelevant extreme values are coloured red.
The covariates “rayon” and “voltage” are important because they have ex-
treme factors.
The last covariate worth mentioning is “main cable failures”. Despite the
fact that the factor is close to 1, the variance is what makes this covariate
important. Because each main cable failure increases the failure rate with
70
6.3 percent, the wide range of the covariate translates into a large variance
of the factor. In all tables the most relevant factors are coloured blue.
The tables with coefficients for the three other data sets are given but not
discussed any further.
Table 6.2
71
Coefficients for medium voltage cables
Covariate Value Factor Lower Upper Variance
wires 3 2.563 1.788 3.674 0.225
distance to railroad [10, 100] 1.419 1.161 1.734 0.013
y coordinate 3.232 3.232 3.232 0.128
galvanic failures 0.908 0.891 0.925 0.009
copper 0.827 0.708 0.967
conductor 0.023
unknown 0.332 0.175 0.629
light clay 1.237 1.026 1.491
fine sand 0.805 0.640 1.011
soil 0.015
mire 1.302 1.053 1.610
heavy clay 1.276 1.039 1.568
main cable failures 1.079 1.015 1.147 0.773
digging activity near main cable 0.780 0.779 0.781 0.035
12.5 0.711 0.467 1.081
height isoline 0.007
-2.5 1.149 0.980 1.347
isolation XLPE 2.088 1.650 2.642 0.239
2 0.818 0.565 1.184
subsidence 4 0.829 0.710 0.967 0.023
8 1.475 1.205 1.805
Friesland 1.497 1.166 1.923
GD-oost 0.872 0.680 1.120
rayon 0.079
Rijnland 1.457 1.156 1.836
Veluwe 0.531 0.401 0.702
overflow 1.156 0.994 1.345 0.002
10.000V 1.183 0.806 1.737
voltage 0.028
unknown 0.122 0.028 0.524
Table 6.3
72
Coefficients for low voltage joints
Covariate Value Factor Lower Upper Variance
dry days 0.004 0.004 0.004 0.000
inhabitants 1.198 1.198 1.198 0.075
address density 1.013 1.013 1.013 0.000
krimpmof 1.460 0.355 6.001
kunstharsmof 0.391 0.222 0.688
massamof 0.501 0.274 0.914
nekaldietmof 1.301 0.554 3.057
construction 0.009
unknown 0.535 0.406 0.704
cellpack 0.706 0.522 0.956
filoform 0.921 0.677 1.254
other 0.720 0.539 0.961
low miniral 0.523 0.288 0.949
mire 0.705 0.626 0.794
soil 0.011
heavy clay 0.702 0.592 0.833
fine sand 0.711 0.617 0.820
2.5 1.245 1.155 1.343
52.5 3.650 1.735 7.681
height isoline 0.016
70 6.585 2.949 14.710
80 4.511 1.125 18.080
2 0.526 0.441 0.627
3 0.690 0.579 0.823
subsidence 0.051
4 0.671 0.608 0.742
8 1.409 1.297 1.530
digging activity 1.023 1.009 1.037 0.006
area neighbourhood 1.033 1.033 1.033 0.007
cars in area 0.773 0.773 0.773 0.023
south NHN 0.634 0.566 0.711
Friesland 0.568 0.465 0.694
GD-Oost 1.203 1.028 1.407
rayon 0.049
Gooi en Flevoland 0.706 0.627 0.796
NHN 0.555 0.498 0.620
Veluwe 0.630 0.540 0.734
sort connection 1.981 1.859 2.112 0.165
voltage unknown 11.270 8.918 14.240 0.206
Table 6.4
73
Coefficients for low voltage cables
Covariate Value Factor Lower Upper Variance
1 0.431 0.203 0.915
wires 7 0.113 0.047 0.272 0.018
unknown 0.356 0.277 0.459
diameter 0.996 0.994 0.997 0.000
loam 0.175 0.044 0.702
light clay 0.853 0.769 0.946
light sand 0.816 0.734 0.906
low miniral 0.966 0.746 1.250
soil 0.014
mire 0.767 0.697 0.843
water 1.227 0.996 1.511
sand 0.766 0.713 0.823
subsidence 1.453 1.333 1.582 0.041
digging activity near main cable 0.594 0.593 0.595 0.058
87.5 15.200 8.126 28.424
height isoline 0.022
90 4.312 1.074 17.309
plastic 0.827 0.772 0.886
isolation PVC 0.851 0.776 0.934 0.026
XLPE 0.632 0.587 0.680
5 1.308 1.211 1.412
subsidence 0.020
8 1.311 1.211 1.419
length 1.010 1.010 1.011 0.374
max load 1.600 1.598 1.601 0.493
area neighbourhood 1.012 1.012 1.012 0.001
GD-Oost 0.717 0.640 0.803
Gooi en Flevoland 0.788 0.721 0.862
rayon NHN 0.721 0.654 0.796 0.033
Rijnland 0.911 0.827 1.003
Veluwe 0.536 0.491 0.585
400V 2.287 1.810 2.890
voltage 0.066
unknown 0.381 0.139 1.043
Table 6.5
74
Besides the coefficients, the Cox model provides the valuable baseline hazard.
Figure 6.1 shows the baseline where the values for each year were approxi-
mated separately.
One can definitely recognise a bath tub curve as shown in figure 2.1. But
there is an additional peak around the age of 40. After some research this
is caused by a specific sort. For joints with nekaldiet as construction only
survival data from an interval around 40 years is available. Therefore it is
better to compensate the baseline hazard for predictions on other joints, but
this is beyond the scope of this project.
The main benefit of the baseline hazard is the ability to make predictions
for the coming years. Figure 6.2 shows the expected number of failures for
the next 40 years. With the right replacement policy Alliander can try to
prevent the increment of the number of failures.
The baseline hazard functions for the three other data sets are given but not
discussed any further.
75
Figure 6.2: Expected number of failed medium voltage joints for the next 40
years according to the Cox model.
76
Figure 6.4: Baseline hazard functions of the low voltage joints.
77
6.2 Random forest
After fitting Random forest the resulting model is not immediately inter-
pretable. The collection of trees is not structured and all but compact.
However chapter 4 did mention a way to show the importance of variables.
The results are shown in figure 6.6.
78
Figure 6.7: Variable importance of Random forest on medium voltage cables.
79
Figure 6.9: Variable importance of Random forest on low voltage cables.
Random forest itself does not model the lifetime of an asset. It can predict
whether an asset has failed or not at this very moment, but it has not a
build-in function to describe the coming years. The prediction is the average
over all the trees. This gives a number between 0 and 1 which is used as
the probability that the asset has failed. However, it is doubtful that this
number is a realistic hazard rate. Alliander has done some work to scale
the outcome of Random forest, by compensating for the stratification. This
scaling process will not be discussed in this thesis.
80
6.3 Conclusion
Both methods have been applied on the four entire data sets (with cross-
validation). Ordering the assets on hazard rate and calculating the ROC-
curves gives the following results:
Table 6.6: Area under curve for the two methods and the four data sets with
cross validation.
Table 6.7: Area under curve for the two methods and the four data sets split
at 2015.
Numbers don’t lie; Random forest is better at pointing out the assets with
a higher risk in a relative short period of time. However, as is shown in
this chapter, it falls short when it comes to interpretation. Not only are the
absolute hazard rates not the right order of magnitude, but also pointing out
the most influential covariates seems difficult.
The Cox model depends on assumptions made about the relation between
covariates and the hazard rate. Presumably these assumptions do not per-
fectly correspond to reality. Nevertheless, as a first-order estimate it gives a
valuable result. The hazard rates are in the right order of magnitude and the
model gives even more information. The coefficients together with the base-
line hazard function give you insight in the lifetime of the assets and enable
you to make accurate predictions for the future. For all this, a price has to
be paid; it takes roughly ten times as long to fit the Cox model compared to
fitting Random forest.
Despite the extra features of the Cox model the predictions of the objec-
tive test as well as the cross-validation are worse. This would indicate that
81
in Allianders’ situation the time-dependent features of the Cox model are
not decisive. The Random forest model main focus are the covariates and
apparently these are better predictors. The Cox model might perform bet-
ter in a totally different situation where the lifetime has more influence on
the hazard rate compared to the explanatory variables. To illustrate this,
the cross-validation has again been performed on the medium voltage joints.
This time though, all covariates but soil have been removed. Indeed the Cox
model is more powerfull in this situation with an area under the ROC-curve
of 0.6484 compared to 0.5352 of the Random forest.
Taking everything into consideration, the Cox model is the preferred method.
82
Chapter 7
Further research
Often the longer you work on a project the more you realise how little work
you have done. This is very true for this thesis. Some thoughts about further
research that came up during this project are listed below.
Let’s start with the baseline hazard function as part of the Cox model. Dur-
ing the implementation there were scattered values for the higher ages and
the baseline hazard function was zero for ages with missing data. Both facts
are warnings for overfitting and some kind of smoothing is wise. The seg-
mented version of the baseline hazard is a first attempt. Further research
could be done to find a better segmentation or even entirely different ways
of smoothing. Of course the likelihood based on the data is the highest for
the original method of retrieving the baseline hazard function. But reducing
overfitting could actually improve the predictions.
This thesis focuses on the differences between the Cox model and random
forest. They are compared by applying them to the same problem. Instead,
one could investigate if combining the two methods is lucrative. A construc-
tion that uses the strongest components of each method could be a solution
to get the best of both worlds. Some first ideas have been tested but are
beyond the scope of this thesis.
The final suggestion is about the time-dependence of the Cox model. The
Cox model separates the time-dependent baseline hazard function and the
proportional factor. Classically, this factor is only based on the covariates
and not on time. The book [29] introduces time-varying covariates and even
coefficients. This could be interesting because then the different life cycles
of for instance the joints do not influence the general baseline hazard.
83
Chapter 8
Bibliography
85
[11] K. E. Atkinson, An Introduction to Numerical Analysis, John Wiley &
Sons, Inc, ISBN 0-471-62489-6, 1989.
[13] J. Borucka, Methods for handling tied events in the cox proportional
hazard model, Studia Oeconomica Posnaiensia, vol. 2, no. 2 (263), 2014.
[14] R. Tibshirani, Regression Shrinkage and Selection via the lasso, Journal
of the Royal Statistical Society, Series B (methodological) 58, 1996.
[16] H. Zou, T. Hastie, Regularization and Variable Selection via the Elastic
Net, Journal of the Royal Statistical Society B, 67(2), 301320, 2005.
86
[24] L. Breiman, Random Forests, Springer Publishing, Machine Learning
45: 5, 2001.
[25] Y. Amit, D. Geman, Shape Quantization and Recognition with Random-
ized Trees, Neural Computation. 9, 1545-1588, 1996.
[26] L. Breiman, A. Cutler, A. Liaw, M. Wiener, randomForest: Breiman and
Cutler’s Random Forests for Classification and Regression, R package,
Juli 10, 2015.
[27] L. Toloşi, T. Lengauer, Classification with correlated features: unrelia-
bility of feature ranking and solutions, Bioinformatics 27 (14): 1986-1994,
2011.
[28] H. Akaike, A new look at the statistical model identification, IEEE Trans-
actions on Automatic Control, 19 (6): 716723, 1974.
[29] T. Martinussen, T. H. Scheike, Dynamic Regression Models for Survival
Data, Springer, 2006.
87