Anda di halaman 1dari 87

Faculty of Science

Department of Mathematics

Estimation of Failure Rates

Student: Supervisors:
J. Derikx dr. E. Cator
s4052552 drs. J. Heres

16 January 2017
[Abstract]
Alliander N.V. operates the largest electricity distribution
network in the Netherlands. The grid is mostly underground
and therefore not easy accessible. The goal of this thesis is to
approximate failure rates of the underground assets. In case
of a successful attempt, the obtained information can be used
to further improve the policy of Alliander.
This booklet starts with a detailed introduction of the set-
ting in which the general problem is situated. Afterwards the
reader is prepared with the necessary survival analysis theory,
the overarching framework to make the problem mathemati-
cally precise.
The main goal is tackled by two approaches, originating from
two different fields of mathematics. The first method is the
Cox model, a proportional hazard model that comes directly
from survival analysis. It is widely approved and respected
in this context. In this thesis the baseline hazard function
corresponding to the Cox model is retrieved in another way,
specialized to the situation. A proof is included which guar-
antees that this addition gives the sought after result.
The second approach comes from the rising field of machine
learning. Random forest is a versatile algorithm that is used
in this case to predict the failure rates based on the details
that are known from the assets.
The thesis concludes with the results of both methods. For
Alliander, these results are immediately valuable. Besides
the application also more general results are discussed in a
comparison between the two approaches.

3
[Acknowledgements]
The thesis as conclusion of the master is a struggle for many
students. Unfortunately I was no exception. Although I will
never argue that my project has great potential, I do think it
is fair to say it is a personal achievement. Most of the obsta-
cles I encountered didn’t have anything to do with the subject
of my thesis and are therefore obviously not mentioned in one
of the chapters. These pages only tell the reader something
about the electricity distribution networks, but the thesis as
a whole stands symbol for much more than that. Knowing
this I realize that writing this thesis by myself was simply
impossible and I am very thankful for all the help I got.
First of all, I want to thank Jacco Heres. As a supervisor he
was closely involved with every little step of this thesis and he
was never tired of helping me out. It didn’t matter for what
reason I got stuck, he always looked for a way to make things
work. Eric Cator has helped me in a whole different way. Be-
sides the statistical and mathematical knowledge he shared,
it was his never ending positivism that made the difference. I
truly believe he was the only one who could pull me through
this.
Lastly, I would like to express my appreciation towards all
friends and relatives. I am very grateful for all the support,
from the good conversations to correcting my embarrassing
spelling mistakes.

5
Contents

1 Introduction 9
1.1 Alliander . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Content of this thesis . . . . . . . . . . . . . . . . . . . . . . . 14

2 Survival analysis 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Survival models . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Framework of data . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Objective test . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 METHOD I: Cox model 27


3.1 Proportional hazards models . . . . . . . . . . . . . . . . . . . 27
3.2 The Cox model . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Ties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Baseline hazard . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Extensions of the Cox model . . . . . . . . . . . . . . . . . . . 42

4 METHOD II: Random forest 47


4.1 Decision tree learning . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Random subspace method . . . . . . . . . . . . . . . . . . . . 52

5 Implementation 59
5.1 Description of the data . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Application of the Cox model . . . . . . . . . . . . . . . . . . 62
5.2.1 Glmnet . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.2 Baseline hazard . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Application of Random forest . . . . . . . . . . . . . . . . . . 65

7
5.4 Objective test . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6 Results & Conclusions 69


6.1 Cox model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7 Further research 83

8 Bibliography 85

8
Chapter 1

Introduction

The main goal of this thesis is general and widely applicable. However, the
whole reason that this problem is considered in the first place comes from
Alliander. To understand the general problem better this thesis starts with
a full description of the origin and environment of the problem.

1.1 Alliander
Alliander is a network company responsible for the distribution of energy in
the Netherlands. Alliander’s role is to operate and maintain the network for
electricity and gas to ensure the connection between energy suppliers and
consumers.
Alliander actually incorporates a group of companies, including Liander,
Liandon and Allego. Each company has a different focus but all have en-
ergy network operation as underlying activity. Distribution System Operator
(DSO) Liander keeps the energy distribution infrastructure in good condition
to ensure reliable distribution of gas and electricity to millions of consumers
and businesses every day. Liandon focuses on the development of sustainable
technologies and intelligent energy infrastructures. The company focuses pri-
marily on high voltage components. Allego is developing customised charging
solutions and infrastructure for municipalities, businesses and transport com-
panies. The latter is only one example of multiple companies which explore
and develop new business activities that fit Alliander’s strategy and role in
the transition towards a more renewable energy system.
The vast majority of the energy Alliander distributes flows in the traditional

9
Figure 1.1: Service area of Alliander since January 2016, source: [1].

way through the network. Power is generated at for instance energy plants
and wind farms and arrives at Alliander through the international and na-
tional energy networks of the Transmission System Operators (TSO) like
TenneT and Gasunie. Alliander’s network is the last step in the distribu-
tion towards the clients. However, times are changing and due to techno-
logical developments the network is not a straightforward one-way system
anymore.

Customers enjoy a lot of freedom and are able to make their own energy
choices. They can choose their own supplier and service providers from a
wide range of choices. In addition, a growing number of consumers and
businesses are feeding their self-generated energy into the network, resulting
in reversed flows. This distributed generation relies quite often on natural
sources such as wind and solar energy, which require more flexibility from
the network. On top of that there is a higher demand caused by energy
needs previously not provided by electricity but by fossil fuels. For instance,
the increasing number of electric vehicle charging points and the transition
towards electrical heating instead of the original gas installations. All these
changes make the energy supply chain more dynamic [2]. It is going to be

10
Figure 1.2: A modern energy distribution network.

a big challenge to ensure that all energy is distributed safely, reliable and
efficiently despite all these changes in an ageing network [3].
The Netherlands is divided in different regions and all distribution system
operators are responsible for their own area. The fee operators can charge
their customers is based on their performance and reliability. According
to the annual report [1] Alliander has 7240 employees to take care of 5.7
million customer connections. Figure 1.3 shows that all these connections
together cover quite a substantial part of the Netherlands. In fact Alliander
has a share of roughly 35%. A complete and detailed description of Dutch
distribution networks can be found in [4].

1.2 Reliability

Right now life without electricity is unimaginable. Due to their growing de-
pendence on the electricity supply, customers have high expectations regard-
ing the reliability of the electricity network. Simply put: customers demand
safe and continuous access to electricity, 24 hours a day, seven days a week.
Alliander has managed to keep the outage minutes to a minimum. On av-

11
Figure 1.3: Overview of Distribution System Operators in the Netherlands
for electricity [5].

erage a customer has no access to electricity for only 22 minutes each year.
This makes Alliander’s network one of the best worldwide. But as explained
before it is of great interest for Alliander to reduce this number even further
in order to become more reliable. Besides, customers want to pay as little as
possible for their reliable energy supply. That is why Alliander works daily
to continue improving their operational effectiveness and efficiency.
Maintenance and replacement are Alliander’s main tools to influence both
reliability and affordability. All the assets of Alliander are worth 7.7 billion
euro and yearly Alliander spends 700 million euro to maintain them. Quite
a big share of Alliander’s capital is underground (see figure 1.4). Maintain-
ing overground assets like transformers is relatively straightforward because
they can be checked visually on regular basis. However most of the assets
are underground and are therefore less accessible. Although placing assets
underground is the main reason why Alliander has one of the most reliable
networks in the world, these assets are responsible for more than 80 percent
of the total outage time (see figure 1.6). A substantial part of these under-
ground assets has exceeded their life expectancy according to the manufac-
turer (which is typically fifty years as shown in figure 1.5). The expiration

12
Figure 1.4: Book value per component.

date is only an estimate and from practice it is known that assets can live
much longer, mainly because the workload of several assets is much lower
than its capacity. Replacing all assets which exceeded the expiration date
is expensive and can even increase the hazard rate since brand new assets
suffer from fabrication errors.

Figure 1.5: The distribution network is ageing. In the near future a lot of
medium voltage cables will exceed their expected life time.

The few facts mentioned above illustrate the complexity of maintaining one
of the largest electricity distribution grids in the Netherlands. A smart main-
tenance policy is therefore of great interest for Alliander.

13
There are a few reasonable actions. Replacing underground assets before
they fail will of course prevent outages. This is rather expensive but can be
practical when combined with other underground construction works. More
recent possibilities to prevent (long) outages in the underground network are
Smart Cable Guards and Remote Switches. In all these cases it is essential
to have more knowledge of likely to fail assets. This project will be an
attempt to assist Alliander in maintaining their grid by providing failure
rate estimations.

Figure 1.6: The total number of outage minutes, split per component, cus-
tomers of Alliander had per year as an average over the period 2008-2015.

1.3 Content of this thesis

The main goal of this thesis is to make predictions of future failures from
the obtained data regarding failures in underground assets and to draw con-
clusions on the most important drivers for failures. This enables Alliander
to transition from a reactive to a proactive maintenance policy. Recently
Alliander celebrated its 100th anniversary and from all these years data has
been stored. The challenge will be to effectively use as much data as possible
in order to optimally estimate the probabilities of failure. Some examples of
covariates that might be taken into account are: construction, soil, length (of
cables) and several aggregates of the yearly load profile. For older assets less
data is available, whereby calculations might be less accurate. This thesis is

14
a continuation of the research performed by Alliander from 2014 onwards [6].
Most of the data gathering and preprocessing was done in this project.
Multiple methods will be used for the survival analysis of the assets. One can
think of proportional hazard models and machine learning algorithms. From
both formats an instance will be chosen. These algorithms will be adapted to
this specific case to optimally use their predictive power. This will generally
result in estimating the failure rates of the assets.
To compare and review the used algorithms objectively, a test will be devel-
oped. A theoretical test by calculating the error might be the obvious choice.
However, this project will examine the effect of the estimations in feasible
policies. The models will be ranked based on the effect their predictions have
in the policies. Hopefully this will be a more realistic measure.
To give a more concrete description of the project, the preceding story has
been translated into a single question.
Main question: What are the current and future failure rates of underground
cables and joints, given the available data?
This question can be split into several tasks.
• Adapt and optimize various algorithms to calculate the current and
future failure rates.
• Develop a test to objectively compare the algorithms.
• Realize code in R to compute the failure rates.
After having introduced the setting of the project in the current chapter, this
thesis will continue with the necessary theory of survival analysis in chapter
2. There, also some notations will be fixed. Next two predictive algorithms of
different nature will be described in chapter 3 and 4. These methods will be
shaped for this specific case in chapter 5 of which the results will be examined
in chapter 6. This thesis will end with recommendations for further research
in chapter 7.
The R language is widely used among statisticians and data miners for devel-
oping statistical software and data analysis and is therefore also the language
of choice for this thesis. Suitable packages for both proportional hazard mod-
els [7] and survival tree methods [8] are available in R.
This research project has been performed as a final examination for the
M.Sc. Mathematics at the Radboud University in Nijmegen, in corporation
with Alliander.

15
Chapter 2

Survival analysis

The branch of statistics that is suitable for the main question is called survival
analysis. To understand the following small introduction into this subject
some basic notions of probability theory and statistics might be helpful. This
necessary background information can be obtained from [9].

2.1 Introduction

Survival analysis analyses the expected duration of time until an event hap-
pens, in other words the random variable of most interest is time-to-event.
In biomedical studies this event is often death. Because of this common
application the field is termed survival analysis. Traditionally only a single
event occurs for each subject, after which the organism or object is dead or
broken. Survival analysis methods can be applied to a wide range of data
other than biomedical survival data. Other time-to-event data can include:
length of stay in a hospital, duration of a strike, time-to-employment and of
course time-to-failure for electrical components. As always in statistics the
goal is to draw conclusions from obtained data.
Usually a random variable X is characterized by its cumulative distribution
function F (x) = P(X ≤ x). It can be represented by either the probability
density function (continuous case) or the probability mass function (discrete
case). In both cases the function is denoted by f . In the context of time-to-
event data, there are other functions which describe the situation in a more
natural and intuitive way. Still, they are in one-to-one correspondence with
the distribution function and therefore characterize the distribution of the

17
random variable. They are especially important in the analysis of time-to-
event data.
Definition 1. The survival function S of X is defined as

S(t) = P(X > t) = 1 − F (t), (2.1)

where t is some time instance and X is a random variable denoting the time
of death. That is, the survival function is the probability that the time of
death is later than some specified time t. The survival function must be non-
increasing: S(u) ≤ S(t) if u ≥ t. This reflects the notion that survival to a
later age is only possible if all younger ages are attained. From now on time
is represented by the non-negative real numbers. In other words t ∈ [0, ∞)
and hence F (0) = 0 and S(0) = 1.
Definition 2. The hazard function λ of a finite discrete random variable X
is defined as

λ(t) = P(X = t | X ≥ t). (2.2)

The hazard function at t reflects the danger (or hazard) of getting killed at
time t, having survived till (just before) time t. In the context of engineering
the hazard rate is commonly called the failure rate, denoting the probability
density of failure at time t given the device functioned well until (just before)
time t. Extensions to infinite discrete and continuous random variables exist,
but for this project a finite discrete random variable suffices. To rule out eter-
nal life, there must be at least one t for which λ(t) = 1 holds. Furthermore,
the hazard function is non-negative by definition. Note that λ is not other-
wise constrained; it may be increasing, decreasing or non-monotonic.
For now, as in the definition of the hazard function, only the discrete case is
treated, the continuous case will follow soon. The next connection between
the survival and hazard function only holds for the finite discrete case. Now
denote by t1 < t2 < . . . < tn the support of the discrete distribution of X.
Then S can be recovered from λ via

i
Y
S(ti ) = (1 − λ(tj )). (2.3)
j=1

Note that (2.2) only applies to the discrete case. Therefore relation (2.3) be-
tween the survival and hazard function only holds in that case. The relation

18
rises from the fact reaching time ti one has to survive all preceding points of
time. In the continuous case, the following definition is used:
Definition 3. The hazard function λ of a continuous random variable X is
defined as
d
λ(t) = − log S(t). (2.4)
dt
Although in this case λ is not defined as a conditional probability, it still has
a similar interpretation as the discrete version. Because S(t) = P(X > t),
one could argue that

S(t) − S(t + ∆t)


P(X = t) = lim . (2.5)
∆t→0 ∆t
Hence

P(X = t | X ≥ t) = P(X = t | X ≥ t)
= P(X = t and X ≥ t) / P(X ≥ t)
= P(X = t) / P(X ≥ t)
≈ P(X = t) / S(t)
S(t) − S(t + ∆t) (2.6)
= lim
∆t→0 ∆t · S(t)
d
= − log S(t)
dt
= λ(t).

In the continuous case the connection between the survival and hazard func-
tion is even more clear. Using the cumulative hazard function

Z t
Λ(t) = λ(s)ds, (2.7)
0

gives

S(t) = exp(−Λ(t)). (2.8)

An example of a continuous hazard function is the bathtub curve, which is


large for small values of t, then decreases to some minimum, and thereafter

19
increases again. A sketch of this curve is shown in figure 2.1. It models the
properties of some mechanical systems, most failures are either soon after
the start of operation, or much later, as the system ages.

failure rate

time
technical random wear out
failure failure failure

Figure 2.1: Bathtub curve

2.2 Censoring
In reliability engineering there is generally a relatively short window for the
duration of any study where the event is observed to occur or not occur. For
instance, for this thesis, during a period of 10 years failures have been regis-
tered, although components may live for more than 100 years. Adjustments
have to be made to account for potential biases. If a subject does not have
an event during the observation time, they are called censored. The subject
is censored in the sense that nothing is observed or known about that subject
before or after the time of censoring. A censored subject may or may not
have an event outside the observation time. Censoring is an example of miss-
ing data problem which is common in survival analysis. There are different
kinds of censoring of which several will now be described.
Ideally, both the birth and death dates of a subject are known, in which case
the lifetime is known.

observation window

time

Figure 2.2: Uncensored data

20
It can very well happen that researchers lose track of certain subjects or that
some subjects are still alive when the study ends. The exact time of death is
then unknown for these particular subjects. This is called right censoring and
occurs quite frequently in reliability engineering, as knowing future hazard
rates is most interesting when the bulk of a population of subjects is still in
operation.

observation window

time

Figure 2.3: Right censored data

Besides uncertainty about the time of death, it might be unclear at which


time a subject is born. Imagine you want to study a certain disease. You
would like to keep track of the time a patient is infected. In this case the
time of birth is when the patient gets infected and the time of death is when
the illness is cured. A sick patient could enter the study, not knowing exactly
when he got infected. Another instance of this type of censoring occurs when
you want to study the time it takes for a student to master a particular skill.
The time-to-event data is the time at which the learning process starts until
the student is able to use the new skill. Some students may already know
the skill before the observation time of the study. In both cases the exact
time of birth is unknown, and therefore this is called left-censoring.

observation window

time

Figure 2.4: Left censored data

It may also happen that subjects may not be observed or registered at all:
this is called truncation. left censoring is different from truncation, since
for a left censored subject, the existence is known. Left truncation is also
common in reliability engineering.
Luckily the amount of censored data is limited because the placement year
of the majority of Alliander’s current assets is known. Right censoring and
left truncation still have to be taken into account.

21
2.3 Survival models
Besides information about the birth and death of an object further mea-
surements might be available that may be associated with the lifetime. For
instance in Alliander’s case, this can be geographical data or information
about the workload of an asset. These measures are called covariates (also
referred to as explanatory variables and predictors). Besides using lifetime
data, survival models can incorporate these explanatory variables in their
predictions. This gives a more accurate prediction of the time that passes
before some event occurs. In this section some survival models are briefly
introduced.

Proportional hazards models

Let λ0 be the hazard function of some random variable, let β ∈ Rd be a fixed


row vector and suppose Xi ∈ Rd is a vector of covariates belonging to some
subject i ∈ I. Consider the following hazard function:

λ(t, i) = λ0 (t)c(βXi ). (2.9)

Here c : R → R>0 can be any function as long as the integral of λ(t, i) over
[0, ∞) is infinite. This model is called the multiplicative hazard model or the
proportional hazard model, as the hazard for an individual is the product of
a baseline hazard λ0 with only the time as input and a function c with the
background information as input. So the unique effect of a unit increase in
a covariate is multiplicative with respect to the hazard rate. Separating the
covariates from the time-dependent part enables a separate analysis.
This type of model is similar to linear regression and logistic regression.
Specifically, these methods assume that a single line, curve, plane, or surface
is sufficient to separate groups (alive, dead) or to estimate a quantitative
response (survival time).

Predictive Trees

In some cases alternative partitions give more accurate classification or quan-


titative estimates. One set of alternative methods are tree-structured sur-
vival models, including random forests. Examining both types of models for a
given data set is a reasonable strategy and is also the core of this thesis.

22
Each branch in a survival tree indicates a split on the value of a variable.
Walking through the tree from the root will narrow down the group of sub-
jects. Be aware that overfitting comes into play really fast when narrowing
down too much. The general idea is that predictions can be made based on
the change of ratios within the group due to a split, even before the danger of
overfitting is present. Instead of drawing conclusions from a single survival
tree a good alternative is to build many survival trees, where each tree is con-
structed using a sample of the data, and average the trees to predict survival.
This is the method underlying the survival random forest models.

2.4 Framework of data


Taking the previous sections into account, a framework will now be specified
in which survival data is usually given. It fixes the notation and makes the
rest of this thesis easier to read and understand.
There are always finitely many independent subjects under observations,
which can be cables, joints or other components. The set of these individuals
is denoted by I. Then for each individual i ∈ I, data of the form

(Ti , Si , ∆i , Xi ) (2.10)

is available. Here Ti is the age at which i enters the study, which might be
greater than zero. Si is the age at which i leaves the study. Leaving the
study could happen because the individual died, in which case ∆i = 1. If,
for any other reason, the subject drops out, ∆i = 0 and the data is censored.
Xi = (Xi,1 , . . . , Xi,d ) ∈ Rd is a vector of potential predictors referred to as
the covariates. This data is often visualised in the schematic way shown in
figure 2.5.

2.5 Objective test


Proportional hazards models as well as predictive trees estimate hazard rates.
As mere numbers and estimates, these are not immediately of interest for
Alliander. However one can argue that a higher hazard rate implies that it
is better to replace this asset over another asset with a lower hazard rate.
Following this philosophy, the set of assets will be ordered on hazard rate
and in that form presented to Alliander. A model which describes the reality
more accurately will be classified as a better model.

23
observation window

..
.

time

Figure 2.5: Schematic visualization of survival data.

As mentioned above, during a period of almost ten years failures have been
registered and coupled to specific assets. This data will be used to train
and test the different models. In order to test the models in a honest and
realistic way, imagine how the final result would be used. Given a good
model, Alliander would run it and predict the hazard rates of the current
assets. A ranked list of assets will be used while maintaining the network.
With this in mind the following concept has been developed (see also figure
2.6).

Suppose survival data has been generated during period [T, S]. Imagine the
situation for Alliander at time t ∈ [T, S]. Alliander would use period [T, t]
to train its model, and then predict the hazard of the assets which are in
use at time t. This is exactly what the test will do, with the advantage that
the failures in period (t, S] can be used to validate the different models. So
the models will be trained with data that was already available at time t.
This means that only failures registered in the period [T, t] will be taken into
account. The model will then be used to predict the hazard rates of each
asset that is in use at time t.

Suppose the test set consists of 10 assets, i1 , . . . i10 in order of predicted


hazard rate. Assets i1 , i3 and i4 have actually failed in the test period (t, S].
This is visualized in the plot of figure 2.7.

The dashed line goes up every time it passes a failed asset. After the last
failed asset the dashed line reaches 100 percent. The final measure is obtained
by the area under the dashed line. To compensate for different sizes of test

24
observation window

train test
time
T t S

Figure 2.6: Visualization of how the data is split for the objective test.

percentage of failed assets

100

80

60

40

20

0
i1 i2 i3 i4 i5 i6 i7 i8 i9 i10

Figure 2.7: Calculation of final measure.

sets in general also the horizontal axis is given in percentages.


Another model could assign other predictions to each asset, resulting in a
different order. This model performs better, according to the test, if the failed
assets are given a higher failure rate compared to the cencored assets.
The concept of the test will be repeated briefly when used in chapter 5.

25
Chapter 3

METHOD I: Cox model

This chapter is a theoretical description of the first proposed method to tackle


the problem formulated in section 1.3. The Cox model is a proportional haz-
ards model first described in 1972 and nowadays it is still under development.
This thesis will use the more advanced techniques related to the Cox model.
To fully understand these techniques, first the traditional Cox model will be
introduced and presented as in the original paper from 1972. This chapter
will not show how the Cox model is eventually used on the data but rather
describe the mathematical equations that form the core of this method. The
application of the model on the data is shown in chapter 5 and the results
in chapter 6.

3.1 Proportional hazards models

In statistics, survival models relate the time that passes before some event
occurs to one or more relevant covariates. In proportional hazards models, a
class of survival models, the main assumption is that the covariates and the
hazard rate are multiplicatively related. To justify this assumption, one can
think of the effect that the choice of conductor material has on the failure
rate of an electricity cable. For instance, cables made of aluminium may
have a doubled hazard rate compared to copper cables. A typical medical
example would be the effect of a treatment on the duration of the illness, in
this context also age and gender are often considered.
Like other survival models, proportional hazards models consist of two parts:
the underlying hazard function and a part that depends on the effect pa-

27
rameters or covariates. The baseline hazard function, often denoted by λ0 ,
describes how the hazard rate changes over time. As mentioned earlier the
covariates are assumed to be multiplicatively related to the hazard which
gives us the following equation:

λ(t, i) = λ0 (t)c(βXi ). (3.1)

The second part is responsible for the involvement of the explanatory vari-
ables. The function c : R → R>0 can be any function as long as the integral
over t ∈ [0, ∞) of λ(t, i) stays infinite for each asset i. In other words, λ(t, i)
has to satisfy the requirements of a hazard function. Most of the time, c is
restricted to some form.

3.2 The Cox model


In the Cox model the function c is the exponential function, so the hazard
function becomes

λ(t, i) = λ0 (t)eβXi . (3.2)

Fitting the Cox model to given data essentially means determining the opti-
mal λ0 (t) and β. Once these parameters are obtained, predictions could be
made to support future decision making. So, obviously, it is desirable that
the estimated values of the parameters λ0 (t) and β give the most realistic
predictions. Numerous influences affect the accuracy of the final values of the
parameters. The real mechanisms behind the failure of assets are much more
complex than the proposed model, not all relevant data is available and the
data quality is far from ideal. The commonly used technique in situations
like this is to optimize the likelihood of the parameters based on the data.
To write down the likelihood of the data explicitly, the assumptions of the
chosen model are used and therefore the result will be a function of λ0 (t)
and β. The most likely λ0 (t) and β are those for which this probability is
maximal.
Sir David Cox observed that if the proportional hazards assumption holds
(or, is assumed to hold), then it is possible to estimate the effect parame-
ter(s) without any consideration of the hazard function [10]. This approach
to survival data is called the application of the Cox proportional hazards
model. Following this observation this thesis starts with the optimization of

28
β. During this process λ0 (t) is allowed to be an arbitrary function. In other
words: the method works for each baseline hazard function. Now this is a se-
vere requirement and arguably unnecessary in the sense that an assumption
of some smoothness would be reasonable.

Sir David Cox (born 15 July


1924) is a very prominent British
statistician. Queen Elizabeth II
knighted him in 1985 because of
his impressive scientific contribu-
tions.

Recall from page 23 that survival data is of the form

(Ti , Si , ∆i , Xi ), (3.3)

for each i in the finite set I of observed objects. Object i is observed during
the interval [Ti , Si ], ∆i denotes whether the event has occurred or not and
Xi is the vector of covariates. If ∆i = 1 the event has occurred and Si is the
event time, otherwise ∆i = 0 and Si is the censoring time. For the time being
ties will be ignored to prevent a too complicated expression (see section 3.3
for an inclusion of ties in the model). So each Si is unique for every i with
∆i = 1. Nevertheless, in general more objects are alive during time Si . The
set of these objects under observation during Si will be referred to as the risk
set, in mathematical notation this is written as

Ri = {j ∈ I | Si ∈ [Tj , Sj ]}. (3.4)

Then, conditioning upon the existence of a unique event at some particular


time t the probability that the event occurs in the subject i for which ∆i = 1
and Si = t is

asset i fails λ0 (Si )eβXi eβXi


Li (β) = = X = X . (3.5)
asset in Ri fails λ0 (Si )eβXj eβXj
j∈Ri j∈Ri

29
Observe that the factors of λ0 (t) cancel out. This immediately makes the
method independent of the baseline hazard function. Assuming that the
subjects are statistically independent, the joint probability of all realized
events conditioned upon the existence of events at those times is the partial
likelihood

Y Y eβXi
L(β) = Li (β) = X . (3.6)
i|∆i =1 i|∆i =1 eβXj
j∈Ri

Instead of maximizing this function immediately, first the logarithmic func-


tion is applied. This function keeps the maxima on the same place, since it
is a monotonic function. This trick is widely used in statistics because the
product becomes a sum which is easier to handle. The expression for the
log-likelihood is
!!
X X
l(β) = log(L(β)) = βXi − log eβXj . (3.7)
i|∆i =1 j∈Ri

This function can be maximized over β by calculating the first and second
derivative and using the Newton-Raphson algorithm. This algorithm is de-
scribed by Kendall Atkinson [11].

3.3 Ties
It is very likely that events will happen at the same time, especially when
time is considered as a discrete variable. When studying a flock of geese the
researcher possibly checks in ones a day, by lack of time and resources. The
event of death is then accurate on the day and two or more geese might have
died on the same day.
To be precise, there are two different i, j ∈ I with the event (∆i = ∆j = 1)
happening on the same time (Si = Sj ). Clearly expression (3.5) is now
incorrect, but fortunately there are several ways to handle situations in which
there are ties in the time data. These methods are approximations since
calculating the exact likelihood might be time-consuming. Breslow published
his method in 1975 that is really close to the one described above, however
the approach of Efron [12] is more successful, especially when the number of
tied events for a time point is relatively large [13].

30
As usual the set of objects is finite, so suppose t1 , . . . , tn are the different times
at which events occur. For a time point ti the risk set Ri = {r1 , . . . , rk } is
slightly different and this time also an event set Hi = {h1 , . . . , hl } is used.
Their definitions are

Ri = {j ∈ I | ti ∈ [Tj , Sj ]} and Hi = Ri ∩ {j ∈ I | ∆j = 1}. (3.8)

To calculate the exact likelihood, every ordering of r1 , . . . , rk should be taken


into account, because it is unclear in which order the events did occur. For
instance, the likelihood of the order σ = (r1 , . . . , rk ) is

θr θr θrk
Lσ (β) = X1 × X 2 ×...× X (3.9)
θj θj − θr1 θj − θr1 − θr2 − . . . − θrk−1
j∈Ri j∈Ri j∈Ri

where θi = eβXi . All the likelihoods are different because the order in which
the θri , . . . , θrk are subtracted in the denominator affects the product. This
can be changed by subtracting the average θ̄ = (θri + . . . + θrk )/k instead of
the different θrl , which gives an approximation instead of the exact result.
Now the likelihood is the same for every different order, namely

θr θr θrk
L(β) = X1 × X 2 × ... × X (3.10)
θj θj − θ̄ θj − (k − 1)θ̄
j∈Ri j∈Ri j∈Ri
Y
θi
j∈Hi
= k−1
! (3.11)
Y X
θj − mθ̄
m=0 j∈Ri

Taking the product of all these factors over all the event times ti gives the
final likelihood. Now the same steps as above can be followed: apply the
logarithmic function to obtain an expression of the log-likelihood L(β), which
can subsequently be optimized with the Newton-Raphson algorithm. Notice
that when there are no ties at a certain time point, (3.10) reduces to the
expression of the previous paragraph.

31
Example: Gehan data

To make the above easier to comprehend a classic example is discussed in


this section. Table 3.1 is the data set that was used as an example in Cox’s
original paper [10]. The data shows the length of remission in weeks for two
groups of leukemia patients. The first group was treated with an actual drug
and the control group was treated with a placebo. Thus patient 2 was treated
but died after 6 weeks and patient 9 survived for 11 weeks and then left the
study for some other reason than death. The last column shows that the
death of all patients in the control group was registered and none of them
survived for longer then 23 weeks.

Treated Control
weeks event weeks event
1. 6 1 22. 1 1
2. 6 1 23. 1 1
3. 6 1 24. 2 1
4. 6 0 25. 2 1
5. 7 1 26. 3 1
6. 9 0 27. 4 1
7. 10 1 28. 4 1
8. 10 0 29. 5 1
9. 11 0 30. 5 1
10. 13 1 31. 8 1
11. 16 1 32. 8 1
12. 17 0 33. 8 1
13. 19 0 34. 8 1
14. 20 0 35. 11 1
15. 22 1 36. 11 1
16. 23 1 37. 12 1
17. 25 0 38. 12 1
18. 32 0 39. 15 1
19. 32 0 40. 17 1
20. 34 0 41. 22 1
21. 35 0 42. 23 1

Table 3.1: Gehan data

The data set clarifies several of the concepts discussed earlier. Patient 9 is
an example of (right) censored data, as the exact time of death is unknown,
but the fact that the patient survived for at least 11 weeks will be used

32
nonetheless. Actually, the exact time of death of all the patients is unknown.
For some reason the survival time is measured in weeks. Hence tied survival
times are no exception, as can easily be concluded from table 3.1. The set is
rather small and has only a single covariate, but is nevertheless suitable for
the first real Cox regression of this thesis.

The data set will be analysed to obtain the coefficient of the covariate
“treated”. This uses the algorithm which has been described in this sec-
tion. Because the data is already in a usable format the R code for fitting
the Cox model is pretty compact.

1 > library ( survival )


2 >
3 > gehan <- read . table ( "
4 + http :// data . princeton . edu / wws509 / datasets / gehan . dat " )
5 > names ( gehan ) <- c ( " treated " , " weeks " , " event " )
6 >
7 > coxph ( Surv ( weeks , event ) ~ treated , data = gehan )
8
9 coef exp ( coef ) se ( coef ) z p
10 treated -1.572 0.208 0.412 -3.81 0.00014
11
12 Likelihood ratio test =16.4 on 1 df , p =5.26 e -05
13 n = 42 , number of events = 30

The column marked z in the output is a Wald statistic which is an indicator


for the accuracy of the estimator. Under the null hypothesis this statistic has
a standard normal distribution, and the probability that z = −3.81 is (the
p-value) 0.00014. This is enough evidence to reject the null hypothesis.

Given the amount of data, this is a reasonable result. The value for β (since
there is only one covariate, β contains only one value) is −1.572 and this
means the hazard function will be multiplied with e−1.572 ≈ 0.208. Recall
that this doesn’t mean that treatment will give you an 80 % higher change
to overcome leukemia. It means that at a certain time the probability you
will die, is only 20 percent of the probability that it would have been without
treatment. This shows that the treatment is pretty effective. The likelihood-
ratio test at the bottom of the output is a test of the null hypothesis that
β = 0. In this instance the null hypothesis is soundly rejected.

Even more cogent than the coefficient from the output, is a display on how
estimated survival depends upon treatment, because the principal purpose
of the study was to assess the impact of treatment on survival.

33
Figure 3.1: Estimated survival functions for those receiving treatment and
the control group in the Gehan data. Each estimate is accompanied by a
point-wise 95-percent confidence envelope.

3.4 Baseline hazard

Now the fist steps have been made, it is time for investigating the baseline
hazard function λ0 (t). Once again the maximum likelihood method will be
exploited. This time though, the work that has already been done is used.
The likelihood expression will be based on the value of β estimated just now,
the data and the still unknown λ0 (t).
As in the previous example, and in all following cases in this thesis, time will
be treated as a discrete variable. To simplify, 100 = {1, 2, . . . , 100} is chosen
as the domain of λ0 . When t = 3, λ0 (t) says something about the probability
that an asset survives its 3rd year of life. There is not a very strong reason
to bound the domain at 100 other than some field experience, since there
are not yet assets older than 100 years. Making the domain explicit will also
make the description easier to understand and it will be straightforward to
adapt the procedure to cases where a different domain is desired.
So the domain of λ0 consists of the natural numbers from 1 to 100. As a result
the baseline hazard function will be piecewise constant. Even though this is
far from realistic, this practical form still gives satisfactory predictions.

34
λ0 (t)

λ3
λ4
λ2

λ5

λ97
λ1 λ98
λ99
λ100
t
1 2 3 4 5 ... 97 98 99 100

Figure 3.2: Piecewise constant baseline hazard function with the still un-
known values λ1 , λ2 , . . . , λ100 .

The main goal of this section is to estimate the theoretical values of the base-
(0)
line hazard function λ0 . Therefore a likelihood function will be optimized
over the values of λ0 . The maximum value of this likelihood function is at-
(0)
tained in λ̂0 . For each j ∈ 100, λ0 (j) is abbreviated to λj . For λ0 and λ̂0
similar notation is used.

The survival data D = (Ti , Si , ∆i , Xi )i∈I has the usual format as described
in section 2.4. The collection of assets I is finite and its cardinality is N .
Using Ti and Si , the year of birth Bi and (a bound for) the age of death Di
can be calculated. Since Di is the age at which asset i fails, if ∆i = 1 then
Di = Si and Di > Si otherwise. So sometimes Di is known exactly, but in
all cases the bound suffices. The distribution of this data is left unspecified.
At the end of this section it will be clear that the results hold independently
of the distribution of the data.

The soon to be considered likelihood is a product of factors of which each


factor expresses whether an asset i ∈ I did or did not survive its jth year of
life given it reached that year. The corresponding probability equals

P[Di > j | Di ≥ j]. (3.12)

The hazard function of each asset i is based on its covariates, therefore this
also holds for its survival function Si and the cumulative hazard function Λi .
Using these, the probability can be rewritten in the following way.

35
P[Di > j | Di ≥ j] = e−(Λ i (j+1)−Λi (j))
R j+1
= e− j λi (t) dt (3.13)
βX
= e−λj e i

The last step uses that the baseline hazard function is piecewise constant on
each interval [j, j + 1). The entire likelihood is built from expressions of this
form, going over all different assets and the years they have survived in the
observation window. Besides this, the following three indicator functions are
used,

1 : Di > j (i.e. asset i survives age j)
1S
j (i) =
0 : otherwise

1 : Di = j (i.e. asset i dies at age j)
1Dj (i) = (3.14)
0 : otherwise

1 : j ∈ [Ti , Si ] (i.e. asset i reaches age j)
1Rj (i) =
0 : otherwise

The actual bounds of the observation window are left unspecified because the
results of this are independent of these bounds. These indicators are easily
calculated when the data is available. The expression for the likelihood of
the data in the observation window is then

(e−λj e i )1j (i)·1j (i) · (1 − e−λj e i )1j (i)·1j (i) .


βX S R βX D R
Y Y
L(λ0 , D) = (3.15)
j∈100 i∈I

Now the factor of each j only depends on λj . This enables us to optimize over
each λj separately because the likelihood is a product of all strictly positive
factors. For the actual optimization the power and speed of a computer is
used.
Although the maximum likelihood is a respected approach, this is not yet a
completely satisfactory reason to use this specific method. The next theorem
actually states that this procedure is consistent. This means that for an
increasing number of observations the result will converge to the desired
(0)
value λj .
Theorem 4 (Consistency). Let D = (Ti , Si , ∆i , Xi )i∈I be Cox survival data
with N = |I|. The value of λj for which Lj (λj , D) is maximal converges to
(0)
λj as N goes to infinity.

36
Proof. As stated before, the likelihood of each j ∈ 100 can be considered
separately. This likelihood is a function of the data D and more importantly
λj . By increasing the number of observations the likelihood will converge
towards the corresponding expected value. At the end of this proof it is
(0)
shown that the optimum of this expected value is attained in λj . Note that

(e−λj e i )1j (i)·1j (i) · (1 − e−λj e i )1j (i)·1j (i) .


βX S R βX D R
Y
Lj (λj , D) = (3.16)
i∈I

As always the logarithm of the expression is easier to examine. This will


not change the outcome of the optimization since the logarithmic function is
strictly monotonic. This gives

1Sj (i) · 1Rj (i) · (−λj eβXi ) + 1Dj (i) · 1Rj (i) · log(1 − e−λj e
βXi
X
log(Lj (λj , D)) = ).
i∈I
(3.17)
For the same reason it is allowed to divide the function by N , which will turn
the expression into an average over the set of observations I. This leads to

log(Lj (λj , D)) 1 X S


1j (i)·1Rj (i)·(−λj eβXi )+1Dj (i)·1Rj (i)·log(1−e−λj e i ).
βX
=
N N i∈I
(3.18)
So if N tends to infinity the law of large numbers can be applied. According
to that law, the average over a large number of trials should be close to the
expected value, and keeps getting closer as more trials are performed. In
mathematical notation this looks like

N
X yn
lim = E[Y ], (3.19)
N →∞
n=1
N

where all yn are trials from the random variable Y .


In the average log-likelihood (3.18), all the separate values Xi are now re-
placed by the random variable X, the distribution of the covariates. Similarly
B and D are introduced for Bi and Di which occur in the indicator functions.
This makes the indicator functions independent of i ∈ I and therefore they

37
are written without i as variable, for example 1Sj instead of 1Sj (i). The result
looks like
h i
E 1Sj · 1R
j · (−λ j eβX
) + 1D
j · 1R
j · log(1 − e −λj eβX
) . (3.20)

The goal is to optimize this expected value over the variable λj . First, some
rewriting is necessary. By the law of total expectation

   
E Y1 = E E[Y1 |Y2 ] (3.21)

another expected value can be introduced. This turns (3.20) into


h  i
E E 1Sj · 1R
j · (−λj e βX
) + 1D
j · 1R
j · log(1 − e−λj eβX
) | X, B . (3.22)

Because the expected value is conditioned over X and B, it distributes over


several factors. For instance, given X and B, 1R
j is just a constant and can
therefore be taken out of the expected value. The result is

h i
E 1R
j · (−λj eβX
) · E[ 1S
j | X, B] + 1R
j · log(1 − e −λj eβX
) · E[ 1D
j | X, B] . (3.23)

The conditioned expected values that remain can be calculated explicitly. 1Sj
means that an asset has survived for at least j years. The probability of
this can be written as the product of j terms which have the form of (3.13).
Similarly, 1D
j implies that an asset survived for j − 1 years and then died in
year j. The resulting product is similar except for the last term. This gives

j
E[1Sj | X, B] =
(0) βX
e−λk e
Q
k=1
j−1 (3.24)
(0)
E[1 | X, B] = (
(0)
D
Q −λk eβX −λj eβX
j e ) · (1 − e )
k=1

Inserting these equations in (3.23) gives the following result:

h j
E 1R
(0) βX
βX
e−λk e ) +
Q
j · (−λj e ) · (
k=1
j−1 i (3.25)
(0)
1 · log(1 − e
(0)
R −λj eβX
Q −λk eβX −λj eβX
j )·( e ) · (1 − e ) .
k=1

38
In this expression, 1R
j is left untouched because otherwise it would only make
the expression more complicated. Some rewriting gives the following clearer
product

h j−1  i
(0) (0) (0)
E 1j · (e −λk eβX −λj eβX −λj eβX −λj eβX
Y
R βX
) · − λj e · e + log(1 − e ) · (1 − e ) .
k=1
(3.26)
The first factors do not depend on λj . Still it is not immediately clear that
a constraint on λj can be derived from the last factor since it also depends
on X. Nevertheless this factor, denoted by l(λj , X), is examined next:

(0) βX βX (0) βX
l(λj , X) = −λj eβX · e−λj e
+ log(1 − e−λj e ) · (1 − e−λj e
). (3.27)

Now l(λj , X) goes to −∞ when taking the limit of λ to either 0 or ∞. So the


positive value of λj where l(λj , X) is maximal can be calculated by taking
the derivative, which is


dl(λj , X) (0) βX e−λj e · eβX (0)
−λj eβX
= −eβX · e−λj e + −λ eβX · (1 − e ). (3.28)
dλj 1−e j

The derivative is equated to zero. This will give the λj for which the original
likelihood is optimal. This value is denoted by λ̂j to differentiate it from the
underlying variable.


(0)
−λj eβX e−λ̂j e · eβX (0) βX
0 = −e βX
·e + · (1 − e−λj e
) (3.29)
1 − e−λ̂j eβX

Straightforward calculus will give the final result.

 · e−λ(0)
eβX
 j e
βX βX
· (1 − e−λ̂j e ) = e−λ̂j e
Xβ  · (1 − e−λj
eβX
·
(0) βX
e
)
(0) (0)
 
−λj eβX 
−(λj +λ̂j )eβX eXβ (0)
 βX
e −
e  = e−λ̂j e−(
− λ̂
j +λ j )e
(3.30)
(0)
−λj eβX βX
e = e−λ̂j e
(0)
λj = λ̂j

39
(0)
This shows that l(λj , X) ≤ l(λj , X) for every possible value of λj . This
result holds for all values of X. To summarize, the adjusted likelihood will,
by the law of large numbers, converge to its expected value and this expected
(0)
value admits its maximal value at λ̂j = λj .

3.5 Simulation
After all the theoretical work, it would be good to see the algorithm at work.
Instead of discussing another example, this time a simulation is executed.
The data in an example is obtained from the real world, while data in a
simulation is artificial. One does not suffer from low data quality. The data
will be generated by precisely following the model with parameters that are
chosen beforehand. One can see a simulation as an example from the perfect
world. In the perfect world, the method of obtaining β and λ0 (t) should work
extremely well. Unexpected results from a simulation can help to point out
certain biases.
Imagine walking in a large zoo, wherein monkeys are living. Assume that
there is no zoo keeper who can remember another zoo keeper who in turn
can remember a time when there were no monkeys in the zoo. For some
reason each monkey gets the same amount of bananas every single day. Some
monkeys get 2 bananas a day, others eat 5 bananas a day and there are even
monkeys who don’t get any food at all. These apes are observed for 10 days
to determine the effect of the amount of bananas they get fed. Besides, the
survival curve in general is of interest. To create this imaginary troop of
monkeys, set β = 0.09 (there is just one covariate) and use the baseline
hazard function from figure 3.3.

Figure 3.3: Baseline hazard function of the simulation.

40
Monkeys are generated under the assumption that the Cox model describes
the hazard rates of these monkeys perfectly. First, the number of bananas it
can eat is randomly chosen and then is calculated how many days it survives.
This is a random process. Finally it has to be taken into account that not all
monkeys live during the 10 days of the observation interval. If the monkey
is lucky enough to be alive during the observation time, it is registered. This
process runs until n ∈ N monkeys are generated.
Once the data is generated, the survival analysis is done as it is described in
this chapter. The simulation has run with n = 1000, 10000 and 100000. The
results are shown in table 3.2 and figure 3.4.

n 1000 10000 100000


β 0.09578496 0.106472 0.09247071

Table 3.2: Estimates of the coefficient β.

Figure 3.4: Estimates of the baseline hazard function.

It comes as no surprise that the approximations get better when the number
of apes rises. For the same reason the estimates are much more accurate for
the first 40 days compared to the days after that period. The figure also shows
that the algorithm sets the failure rate to 0 when no data is available. This
is far from ideal and therefore some smoothing methods will be considered
later.
The example may look like a high school exercise, but it is a reliable simu-
lation. Further simulations have been performed with realistic hazard rates
and hazard ratios. Also in a situation isomorphic to the case of Alliander the
simulation gives satisfactory results.

41
3.6 Extensions of the Cox model
For classical problems, where the number of observations is much larger than
the number of predictors, the Cox model performs well. However, every
day more and more data is available, this has been described as the big
data phenomena. Statistical questions have grown simultaneously and have
drastically increased in size. Sometimes the number of covariates is bigger
than the number of observations. The Cox model doesn’t perform that well
in these situations.
Robert Tibshirani introduced lasso in order to improve the prediction accu-
racy and interpretability of regression models. He did this by altering the
model fitting process to select only a subset of the provided covariates in the
final model rather than using all of them [14]. This provides a solution that is
well-defined, and has fewer nonzero βi as well. Even in the classical situation
this may better estimate β than the original Cox model. Throughout the
years the method has been developed for fast convergence [15].
More recently, Zou and Hastie [16] proposed the elastic net for linear re-
gression. This is a combination of lasso and the earlier available method
called ridge regression. They hoped to get the best of both worlds. Park and
Hastie [17] applied this to the Cox model and proposed an efficient algorithm
to solve this problem. Their algorithm exploits the near-piecewise linearity
of the coefficients to approximate the solution at different constraints, and
then numerically maximizes the likelihood for each constraint via a Newton-
Raphson iteration initialized at the approximation. Goeman [18] developed
a hybrid algorithm, combining gradient descent [19] and Newton-Raphson.
Friedman et al. [20] instead employ cyclical coordinate descent to develop a
fast algorithm to fit the Cox model with elastic net penalties. The last at-
tempt has resulted in glmnet, an R package which will be used for the results
in this thesis. In the rest of this section the notions of these extensions of
the Cox model will be discussed.

Lasso

When solving optimization problems without the existence of a unique so-


lution in a continuous neighbourhood, restricting the solution space can be
effective. The method called ridge regression, is commonly used in fitting
problems. It is often used to decrease the (Euclidian) norm of the solution,
but may also be used to enforce smoothness. Adding ridge regression to cox
regression gives

42
Lr (β) = L(β) + kβk2 . (3.31)

Here L(β) is the negative logarithm of the likelihood of β. As before, the


minima of this function are of interest. Adding ||β||2 introduces a penalty
and the model is penalized. Ridge regression makes sure that coefficients
can only increase if there is a substantial improvement of the likelihood.
This improves the prediction error by shrinking large regression coefficients
in order to reduce overfitting, but it does not set any of them to zero because
the square of a small coefficient becomes negligible. So although it improves
the prediction error, it does not perform covariate selection and therefore
does not help to make the model more interpretable.
To achieve both of these goals Robert Tibshirani introduced lasso in 1996.
Lasso (least absolute shrinkage and selection operator) forces the sum of the
absolute value of the regression coefficients to be relatively small, in contrast
to the square of the coefficients. This forces certain coefficients to be set
to zero, effectively choosing a simpler model that does not include those
coefficients.
Selecting coefficients simplifies models to make them easier to interpret, it
shortens the computation time and it reduces overfitting. Data may contain
many features that are either redundant or irrelevant, lasso removes both
which doesn’t incur much loss of information. Selecting only a single covariate
from a correlated group will typically result in increased prediction error,
since the model is less robust. This is the main reason why ridge regression
sometimes outperforms lasso. The to be minimized expression after adding
lasso to the problem is

Ll (β) = L(β) + kβk. (3.32)

So in addition to the regularization property of ridge regression, lasso can


select coefficients. Lasso can set coefficients to zero, while ridge regression,
which appears superficially similar, cannot. This is due to the difference in
the shape of the constraint boundaries in the two cases.
From figure 3.5, one can see that the constraint region defined by the lasso
norm is a square rotated so that its corners lie on the axes (in general a
cross-polytope), while the region defined by the ridge regression is a circle
(in general an n-sphere), which is rotationally invariant and, therefore, has
no corners. As seen in the figure, a convex object that lies tangent to the
boundary, such as the line shown, is likely to encounter a corner (or in higher

43
Lr -norm Ll -norm

Figure 3.5: Geometric interpretation of minimizing the ridge and the lasso
penalty.

dimensions an edge or higher-dimensional equivalent) of a hypercube, for


which some components of β are identically zero. In the case of an n-sphere
on the other hand, the points on the boundary, for which some of the com-
ponents of β are zero, are not distinguished from the others and the convex
object is no more likely to contact a point at which some components of β
are zero than one for which none of them are.

Elastic net

As mentioned in the introduction of section 3.6, lasso and ridge regression can
be combined into what is called an elastic net, for which the corresponding
expression for the likelihood is

Le (β) = L(β) + γ[(1 − α)kβk2 + αkβk]. (3.33)

The elastic-net penalty is controlled by α, and bridges the gap between lasso
(α = 1) and ridge (α = 0). The tuning parameter γ controls the overall
strength of the penalty. It is known that the ridge penalty shrinks the co-
efficients of correlated predictors towards each other, while the lasso tends
to pick one of them and discards the others. The elastic-net penalty mixes
these two; if predictors are correlated in groups, an elastic net with α = 0.5
tends to select the groups in or out together. The elastic-net penalty also
achieves numerical stability; for example, if α = 1 −  for some small  > 0,
the elastic net performs much like the lasso, but removes any degeneracies
and wild behaviour caused by extreme correlations like ridge regression would
do.

44
The result is that highly correlated covariates will tend to have similar re-
gression coefficients which is very different from lasso. This phenomenon is
referred to as the grouping effect and is generally considered desirable. In
many applications one would like to find all the associated covariates, rather
than selecting only one from each set of strongly correlated covariates.

Glmnet

Glmnet is an R package written by Jerome Friedman, Trevor Hastie, Rob


Tibshirani and Noah Simon. It fits a Cox model via maximum likelihood
with an elastic net penalty. The algorithm is extremely fast, and can exploit
sparsity in the data. A variety of predictions can be made from the fitted
models. Although this thesis will only use the Cox model, the package con-
tains more regression models. For the Cox regression k-fold cross-validation
is available except the fact that it can not handle interval survival data. That
is why for this project another package is used in addition to eventually fit
the Cox model.
The glmnet algorithms use cyclical coordinate descent, which successively
optimizes the objective function over each parameter with others fixed, and
cycles repeatedly until convergence. The package also makes use of the strong
rules for efficient restriction of the active set. Due to highly efficient updates
and techniques such as warm starts, active-set convergence and a core of
fortran subroutines, the algorithms can compute the solution path very fast.
For more details see the reference [21].

45
Chapter 4

METHOD II: Random forest

Machine learning is a subfield of computer science which studies algorithms


that can learn from and make predictions on data without being explicitly
programmed. Such algorithms build a model from example inputs in order
to make predictions on new examples, rather than following strictly static
program instructions. This method fits the problem of this thesis, formulated
in section 1.3, very well. Because this approach is completely different from
the Cox model, this chapter will start with an introduction into this field.
It will work towards a detailed description of Random forest, the machine
learning algorithm that will be used in this thesis.

4.1 Decision tree learning

A decision tree is a structure that repetitively splits data based on the values
of covariates into smaller subsets. The process stops when eventually the
subsets are too small, resulting in a conclusion based on this subset. The
structure can be visualized as a tree such as the example in figure 4.1
This decision tree can be used for predicting whether a first year math student
will complete their study. Given the covariates of an individual, the tree
narrows down the actual probability of survival. Each node corresponds to
a certain covariate and each of its children to the values that this covariate
can take. A leaf represents the conclusion, the predicted value of the target
variable. It is straightforward to understand and interpret a decision tree.
This transparency allows us not only to use the predictions but also to observe
the decision rules that are created.

47
age > 20
s no
ye

highschool average > 7 gender is male


s no s no
ye ye

0.42 0.84 0.67


oldest child
s no
ye

0.12 0.26

Figure 4.1: Percentage of first year math students that don’t drop out.

Trees in which all the covariates and in particular the target variable can take
values from a finite set, are called classification trees. But decision trees are
able to predict both numerical and categorical data and it is not specialised
in analysing data sets that have only one type of variable. In order to solve
the problem such as the one in this thesis, this flexibility is useful. Trees in
which the target variable can take continuous values (typically real numbers)
are called regression trees.
Decision trees are a popular method for machine learning tasks because it
requires little data preparation. It serves as an off-the-shelf procedure even
for incomplete or unnormalized data. On top of that, decision trees perform
well with large data sets. Large amounts of data can be analysed using
standard computing resources in reasonable time.
The beauty of decision trees lies not in the interpretation but in the construc-
tion. As ‘machine learning’ suggests, a computer generates the tree based on
available data. A tree starts by splitting the source set into subsets based
on the values of a covariate. This process is recursively repeated on each
derived subset. The recursion is completed when splitting no longer changes
the distribution over the target variable, generally because the subset has
an uniform value for the target variable. The splitting also stops when the
subset has become too small.
The covariate and the conditions on the value at each split are carefully
selected in order to get the best split for that moment. By doing so the
algorithm keeps its focus on the most relevant variables during the construc-
tion and is thereby robust to inclusion of irrelevant features. Here “best”
is objectively measured with some metric; this is explained in the next sec-

48
tion. Even though this strategy greedily selects the best split at each step
in the construction, this does not mean the result is the optimal decision
tree. It turns out that the problem of learning an optimal decision tree is
NP-complete. Nevertheless, the local optimum that the greedy algorithm
finds is a practical solution.

Selecting splits

Algorithms for constructing decision trees usually work top-down, by choos-


ing a variable at each step that best splits the set of items. Until now, the
metric to define the quality of a split was not defined. There are several
metrics, and all of them measure a split by examining the created subsets.
They assign a value to each subset based on the homogeneity of the target
value. The final result is the average of these values over all subsets.
1. Gini impurity
The first metric measures how successful the split would be as final node
by calculating the percentage of individuals which would be assigned
the wrong value for their target variable. Let I be the set of individuals
that would reach the node that is about to split. Assume some split
creates n ∈ N subsets, say I = I1 ∪ I2 ∪ . . . ∪ In . For each subset Ik
and each value c ∈ C of the target value, let fc (Ik ) be the ratio of
individuals of Ik with target variable c. The probability that subjects
from subset Ik with target variable c are assigned the wrong value then
is 1 − fc (Ik ). Summing over all the subjects gives the final measure:

n X
X
MG (f ) = fc (Ik )(1 − fc (Ik )).
k=1 c∈C

The more divided each subset is, the more this number will go to zero.
So this will result in a split that separates each value of a categorical
target variable the best.
2. Information gain
Another way of classifying a split can be to look at the difference of
information gain. Information theory studies, among other things, the
quantification of information. The entropy or information gain H of a
random variable X is defined as minus the expectation of the logarithm
of the probability distribution, i.e.

49
H(X) = E[− log2 (F (X))].

This was originally done by Claude E. Shannon, the founder of infor-


mation theory. This function is derived from the fact that it takes
log2 (n) bits to represent a variable that can take one of n values. If
these values are equally probable, the entropy is equal to the number
of bits. In most cases not all values have the same probability. This
changes the entropy because a variable contains less information if the
outcome is more certain. Therefore, the definition takes the probability
distribution into account. Again taking the average over all the indi-
viduals in I, the information gain of a split with ratios fc (Ik ) as before,
is

n X
X
ME (f ) = − fc (Ik ) log2 (fc (Ik )).
i=1 c∈C

Although information gain is usually a good measure for deciding the


relevance of an attribute, it has some drawbacks. A notable problem
occurs when information gain is applied to covariates that have a large
number of distinct values. For example, one is building a decision tree
for data describing customers. One of the input attributes might be
the customer’s credit card number. This attribute has a high mutual
information, because it uniquely identifies each customer. But it is
unlikely to generalize to new customers.
3. Variance reduction
Finally one could try to reduce the variance of the subsets. This is
useful because homogeneity of the target value in each node is desirable.
This method has the advantage that it works well for both discrete and
continuous target variables. Assume the target variable is continuous.
Let’s say that f splits the individuals I into two subsets (I = IT ∪ IF )
based on a certain condition on a continuous covariate. The first subset
consists of the individuals that satisfy the condition and the latter
contains the rest. The metric is based on the variance of the target
variable y of each subset. Explicitly, it is given by

1 X 1
2 1 X 1
MV (f ) = − (ya − y b ) − (ya − yb )2 .
|IT |2 a,b∈I 2 |IF |2 a,b∈I 2
T F

50
4.2 Random forest

The largest problem with decision tree learning is overfitting [22]. Trees tend
to get too complex and therefore do not generalize well, or they are grown
too deep (the subset at each leaf is too small) and learn highly irregular
patterns. So although they are easy to obtain, they are seldomly accurate.
Besides, decision trees have problems learning concepts like XOR, parity or
multiplexer problems. Moreover, as pointed out before, splits based on infor-
mation gain are biased in favor of attributes with more levels. Fortunately,
creating an ensemble of trees by adding some randomness, can solve these
problems and greatly boost the performance of the model. This leads to the
random forest algorithm.

The word forest suggests the construction of multiple decision trees. The
general idea is that the final prediction is the mode of the classes (classi-
fication) or the mean prediction (regression) of the individual trees. This
method leads to better model performance. While the predictions of a single
tree are highly sensitive to noise in its training set, the average of many trees
is not, as long as the trees are not correlated.

To correct for the decision trees’ habit of overfitting it is important that the
correlation between the trees is reduced. In a first attempt, by Tin Kam Ho
[23], trees were trained on random samples of variables instead of the entire
set of covariates. Informally, this causes individual learners to not over-focus
on features that appear highly descriptive in the training set, but fail to be
as predictive for points outside that set. Several variations on Tin Kam Ho’s
work had the same surprising result. Until then it was commonly believed
that the complexity of a classifier can only grow to a certain level, exceeding
that level will hurt the accuracy with overfitting. The new results shattered
this assumption.

The idea of Ho was extended by Leo Breiman and Adele Cutler [24]. They
named their algorithm Random forests and it became their trademark. The
algorithm was also influenced by an article of Amit and Geman [25] which
introduced the idea of introducing randomness while splitting nodes. The
known notations were combined with Breimans earlier invention of tree bag-
ging, where each tree is fitted to a random sample of the training set instead
of the entire training set. This creates even more variation in the trees in
order to create a healthy forest. The next section will describe the most
important building blocks of Random forest.

51
4.3 Random subspace method

Ho named his first attempt to solve the overfitting problem of decision trees:
the random subspace method. Before constructing a tree, a random subset
of the covariates is selected. The algorithm that would otherwise select the
same descriptive covariate again and again, is now forced to consider different
options. So instead of selecting the most optimal split, a forest of trees with
less optimal splits is created.
The quality of a random forest mainly depends on two things: the correla-
tion between the trees and the strength of each individual tree (see section
4.3). Ideally, the correlation is low while the individual strength is large.
Both values depend on the number m of covariates that is randomly se-
lected for each tree. Increasing m will result in a higher correlation and a
larger individual strength and reducing m will have the opposite effect. So
somewhere in the middle is the optimal value for m which is selected by the
algorithm
p itself. Typically, for a classification problem with M covariates,
m = b (M )c features are used in each split. For regression problems the
inventors recommend m = bM/3c with a minimum of 5 as the default.

Tree bagging

Bagging was designed in 1994 by Breiman to improve stability and accuracy


of machine learning algorithms in general. The word “bagging” comes from
bootstrap aggregating, and it follows the saying “pulling oneself up by one’s
bootstraps” since improvement is made without external help. Differences in
the trees of the forest is created by giving each tree its own random sample
of the training set. About one-third of the cases is left out of the sample and
is used for the out-of-bag error.

Out-of-bag error

An additional benefit of bagging is that the prediction error of the forest


can be measured during training. On each individual in the training set a
prediction of the target variable can be made by using only the trees for
which the individual was not in the training set. Calculating the difference
of this prediction with the real value and then taking the average over all the
individuals of the training set gives a good indication of the performance of
the model.

52
This is useful because now the previous discussed extensions can be opti-
mized. For the random subspace method the optimal number of selected
features can be determined. In tree bagging the size of the bag is a free
parameter which can also be optimized. Finally, the out-of-bag error is also
used to show the importance of each variable.
Another way of characterizing the accuracy of random forest is by calculating
how far the percentage of trees with the right prediction exceeds the percent-
age of trees for any other (wrong) prediction. Breiman used this to formulate
a probability that the prediction of a random forest is wrong. In his article
he proves that this probability does not converge to zero for an increasing
number of trees. This is the mathematical support for the counterintuitive
fact that a more complex forest is not prone to overfitting.
There is an upper bound for the probability that the prediction of a random
forest is wrong. The two ingredients involved in this bound are the strength
(s) of the individual trees and the correlation (c) between them. This upper
bound has the order of magnitude of c/s2 . This helps to understand the
functioning of random forest.

Variable importance

In Cox regression there is the possibility to order variables on the influence


they have. This can be done as well with random forest and can by itself
be a helpful result. It can also be used to assist the prediction by first
running random forest over all variables and then once again with only the
most important variables. A natural technique to measure the influence was
already described in Breiman’s original paper [24]. It is also implemented in
the R package for random forest that is used for this thesis [26].
To measure the importance of a covariate, the out-of-bag error is calculated
twice, once in the regular way and the second time with a random permuta-
tion of the values of the variable. The difference between these errors shows
how important a variable really is. This score can be normalized by the stan-
dard deviation of all differences, but this is not necessary when comparing
variables within the same experiment.
However, some researchers are sceptical [27]. They claim that the technique
is biased towards categorical variables with more levels. Moreover, smaller
groups of correlated features are favored over larger groups.
The following is an example in which random forest is applied and the im-

53
gender school type av intake intake VR LR exam
boy mixed 0.166 1 2 0.619 0.261
boy mixed 0.166 2 3 0.205 0.134
girl mixed 0.166 3 2 −1.364 −1.723
boy mixed 0.166 1 2 0.205 0.967
girl girls 0.103 2 1 0.371 0.544
girl girls 0.103 1 2 2.189 1.734
.. .. .. .. .. .. ..
. . . . . . .

Table 4.1: Data set with school results of children from London.

portance of the variables is calculated. The data set consists of 4059 children
from 65 different schools in London. The head of the data is shown in table
4.1. From every student several details are recorded. The gender and school
type and average intake score of the school are known. Furthermore the re-
sults of four tests are included; the intake, verbal reasoning, logical reasoning
and the final exam.

One might think the gender of a student has a lot of influence on the final
exam score, now it is possible to actually test whether this holds true. The
following code will give the variable importance of the fitted random forest
model.

1 > library ( randomForest )


2 >
3 > student _ data <- read . table ( " ~/ data / student _ data . txt " )
4 > model <- randomForest ( exam ~ . , data = student _ data )
5 > importance ( model )
6 IncNodePurity
7 LR _ SCORE 1009.51497
8 GENDER 47.12455
9 SCHOOL _ TYPE 76.79218
10 AV _ SCHOOL _ INTAKE 424.67686
11 VR _ SCORE 72.17104
12 INTAKE _ SCORE 632.62867

The output shows that the logical reasoning and the intake test are the most
important covariates. The gender of the student has only a marginal effect
on the final exam in comparison to the other variables.

54
Missing value replacement

Empty entries in the training data can be problematic for fitting random
forest. Missing data can be replaced with the most frequent value or the
average of the non-missing values for that variable. This might seem to be
a rude solution but actually gives a quite satisfying result in most cases as
will be shown in the next example.
The result can be boosted even more. After running random forest one can
calculate the error of the target values for each individual in the training
set, let’s call these proximities. Now each missing entry is again filled in by
the most frequent value or the average of the values, but now weighted by
the proximities. So accurate predictions are considered more reliable. From
experience is known that four to six iterations are enough.
The next example compares the two methods of value replacement. This
time data from baseball players is used to predict their annual income. A
small snapshot of the data is given in table 4.2.

Name Allanson Ashby Davis Dawson Griffin


At bat ’86 293 315 479 496 594
Hits ’86 66 81 130 141 169
Home runs ’86 1 7 18 20 4
Runs ’86 30 24 66 65 74
Batted in ’86 29 38 72 78 51
Walks ’86 14 39 76 37 35
Team ’86 Cle. Hou. Sea. Mon. . Oak.
Put outs ’86 446 632 880 200 282
Assists ’86 33 43 82 11 421
Errors ’86 20 10 14 3 24 ...
Years in ML 1 14 3 11 11
At bat 293 3449 1624 5628 4408
Hits 66 835 457 1575 1133
Home runs 1 69 63 225 19
Runs 30 321 224 828 501
Batted in 29 414 266 838 336
Walks 14 375 263 354 194
League ’87 A N A N A
Team ’87 Cle. Hou. Sea. Chi. Oak.
Income NA 475.0 480.0 500.0 750.0

Table 4.2: Data set with historical details of baseball players.

55
This data set is entirely filled out except for some incomes of first year players.
Since random forest can not handle missing target values, these players will be
removed. A total of 263 players remains from the original 322 players.
Next, some percentage p of the records is randomly removed and filled back
in with both methods. The first method uses the average or most common
value. The second method also uses the iterations described above. The data
is split into a training set and a test set. The random forest model is fitted
on the training set and used for predictions on the test set. The mean error
of the predictions is plotted.
1 > library ( randomForest )
2 >
3 > baseball <- read . delim ( " ~/ data / baseball . txt " )
4 > baseball <- baseball [! is . na ( baseball $ Y ) ,]
5 >
6 > RemoveData <- function ( percent , data ) {
7 + if ( percent != 0) {
8 + for ( i in colnames ( data ) ) {
9 + if ( i != " Y " ) {
10 + data [ sample ( nrow ( data ) , floor ( nrow ( data ) /100 * percent ) ) , i ] <- NA
11 + }
12 + }
13 + }
14 + return ( data )
15 + }
16 >
17 > MeanPredError <- function ( percent , data , impute = FALSE ) {
18 + set . seed (37)
19 +
20 + data <- RemoveData ( percent , data )
21 + if ( impute ) {
22 + data <- rfImpute ( Y ~ . , data = data )
23 + }
24 + else {
25 + data <- na . roughfix ( data )
26 + }
27 +
28 + samp <- sample ( nrow ( data ) , floor (0.7 * nrow ( data ) ) )
29 + trainset <- data [ samp , ]
30 + testset <- data [ - samp , ]
31 +
32 + model <- randomForest ( Y ~ . , data = trainset )
33 + testset $ PRED <- predict ( model , testset )
34 + testset $ DIFF <- abs ( testset $ PRED - testset $ Y )
35 + return ( mean ( testset $ DIFF ) )
36 + }
37 >
38 > error _ roughfix <- rep (0 , 100)
39 > error _ impute <- rep (0 , 100)
40 > for ( i in c (0:99) ) {
41 + error _ roughfix [ i + 1] <- MeanPredError (i , baseball )
42 + error _ impute [ i + 1] <- MeanPredError (i , baseball , TRUE )
43 + }
44 > plot ( error _ roughfix , col = " red " )
45 > points ( error _ impute , col = " blue " )

56
Figure 4.2: Mean prediction error after removing a percentage of the data.

Figure 4.3: Mean prediction error after removing a percentage of the data in
the training set. The test set is complete.

In both cases the extended method is slightly more successful. More surpris-
ing is that not much prediction power is lost after a substantial percentage
of the data is removed.

57
Chapter 5

Implementation

The previous chapters gave a precise and formal description of two methods
for survival analysis. As always, the practical implementation is not without
any obstacles and this chapter is an attempt to cover the application of the
methods. Moreover, this chapter reveals some additional technical details
that are worth mentioning.

5.1 Description of the data


Although it was mentioned in the introduction that the drive of this research
is the underground network, the developed methods are applicable in a much
wider spectrum. Actually, for assets which are located above ground, the
methods would give better results since the covariates can describe the state
of the asset better. Nevertheless, the data that is used for this thesis is
achieved from the underground network of Alliander.
Electricity is distributed along cables that are connected to each other by
joints. These are the two main components of the underground grid. Another
division of the data is due to the difference between the low voltage grid
(∼ 400 volt) and the medium voltage grid (∼ 10, 000 volt). Expressed in
quantity of components, the low voltage grid is roughly 10 times as large as
the medium voltage grid. Together this results in four different data sets:
low voltage cables, low voltage joints, medium voltage cables and medium
voltage joints.
The origin of the data sets is not quite as important as the covariates. Going
over them also reveals the meticulous policy of Alliander since of various as-

59
Figure 5.1: A cross section of a joint connecting two red medium voltage
cables.

pects detailed information is available. Rather than giving a list of covariates


for every of the four data sets, a more descriptive approach is chosen in which
all four data sets will be treated simultaneously. All the covariates will be
emphasized the first time they are mentioned.
To begin, all assets have multiple IDs originating from different systems to
uniquely identify the objects. Evidently these have no predictive power and
are immediately taken out of the data set.
Obviously survival data is essential for the Cox model. Because of recent
projects at Alliander, a lot of work has already been done to obtain reliable
survival data. Still, the year in which an asset is placed is unknown for
a substantial part of the data set; however this percentage is decreased by
deriving the year of birth for joints by looking at the surrounding cables.
This works because a joint can never be older than the cables it attaches.
From this study it became clear that some data is still unknown and that
not all the data that is available is correct. This is a common problem in
statistics which doesn’t necessarily mean that predictions can’t be valuable
anymore.
This project also benefits from an extensive method that hunts down the
assets that have failed in the past. This information is stored in the covariate
event and luckily for the company only a small percentage of the assets have
failed. However, this does make the analysis more difficult.
To complete the survival data the date of entering and the date of leaving
the study is captured for every asset. It is immediately clear that big groups
of assets have entered the study on the same day.
The joints and cables are all part of one big physical grid and Alliander knows
the exact coordinates of every asset. For joints this is just one pair of coor-

60
dinates, but for cables both the coordinates of the start and end point are
known. The grid is divided in rayons which of course is only a projection to a
smaller set knowing the coordinates of each asset. Nevertheless, knowing the
rayon can be valuable because they might hold historical information. Exam-
ples of rayons are: “Amsterdam”, “Rijnland” and “Gelderland-Oost”.

Figure 5.2: A plot of the coordinates of the medium voltage joints, a red dot
indicates the occurrence of an event.

Besides the exact location, the covariates also provide a detailed descrip-
tion of the surroundings, for example the type of soil and the subsidence
of the ground. Next to that weather-related information like the number of
dry days and the temperature is known. Even the distance to the nearest
weather station is added such that the accuracy of the weather information
can be estimated. All the just mentioned covariates are known to be influ-
ential. For instance, underground assets tend to decay faster in hot and dry
circumstances.
It is common sense that the distance to the nearest weather station should
not have a direct effect on the hazard rate. This covariate gave the inspiration
to investigate interactions between covariates. Weather based covariates are
probably less accurate if the distance to a weather station increases. Even-
tually these interactions are not taken into account. Still this covariate is
added to the analysis because the algorithms should be robust enough to
handle covariates with no direct influence.
The environment is furthermore described by the distances to the nearest
tree, nearest building and nearest railroad. Assets close to a tree are more

61
vulnerable for failure, since a storm could take down a tree which in turn
can damage the grid. Moreover, some numbers that indicate the population
density are available.
Other meaningful data concerns the construction of the asset. For joints only
the type and sort are written down. The type indicates for what purpose
the joint is placed and the sort tells how the joint itself is made. Especially
the latter is believed to be of great importance. For instance a joint of sort
“nekaldiet” has a bad reputation. It is the task of this thesis to give an
objective final judgement in this case. The construction of the cables has
more parameters such as length, diameter, conductor and isolator.
Even though it can probably be derived from the construction, there are
some covariates that store the usage profile. The voltage level and for cables
also the maximum and mean load fraction are known.
Finally, because defects in the grid can damage surrounding assets, for each
asset the number of close failures is known.

5.2 Application of the Cox model

First off, the data has to be prepared for the analysis. The first rigorous step
is to delete all entries of which the birth year is missing. As explained in the
previous section, the year at which an asset is placed is not always registered,
but this information is crucial for Cox regression.
The dates on which an asset enters and leaves the observation window are
both registered. For the regression however, only the years of these events
are used. This is because also only the year (and not the date) of placement
is known. This is in line with the explanation that the baselines handle time
as a discrete variable.
Unfortunately, there are still missing records and the method can not be
applied on these. Therefore they are imputed in a straightforward manner.
Numeric values are imputed by the mean and missing factors are replaced by
by a new level named “unknown”. An example of a factor is the covariate
“sort” which has “nekaldietmof” or “oliemof” as possible values. In the final
calculation of the hazard rate, this factor itself cannot be multiplied with the
corresponding β. Each of the values of the factor has its own separate β. To
abuse this property some of the numeric covariates are artificially transformed
to a factor. It turned out that this approach led to better results.

62
5.2.1 Glmnet

The Glmnet package is mainly used for making a selection of the covariates.
The package also returns the corresponding values for β, but these will not
be used. Fitting the Cox model again, but now on the selected subset with
the official Cox package, gives better predictions.
In chapter 3 it is explained that Glmnet uses a penalty to select covariates.
Figure 5.3 shows that the likelihood is optimal for roughly e−10.5 (in which
case there are still 107 of the 137 initial covariates left). However, Glmnet
selects the penalty with the fewest number of selected covariates with a like-
lihood that is still close enough to the optimal value. The selected penalty
lies around e−8 and 33 covariates have a coefficient other than zero.

Figure 5.3: The penalty factor plotted against the likelihood and the number
of selected covariates.

Glmnet can handle large numbers of covariates and is still able to select the
most important ones. This makes it possible to experiment a little bit. One
of the experiments was to feed Glmnet covariates after applying a function
to the covariates. The idea was that it would enhance fits to quadratic or
logarithmic relationships. This is not used in the final model because it was
not extremely successful.
In another experiment meaningful interactions between covariates were stud-
ied. Theoretically this would make it unnecessary to execute the method
separately on each asset type. Indeed, if a specific covariate would be more

63
important for a certain type, this method would select its combination. First,
the normal selection procedure of Glmnet was executed. The selected co-
variates were then pairwise multiplied which led to a quadratic number of
covariates. Now Glmnet was asked to perform another selection. Note that
it could still choose the normal covariates which are the ones multiplied by it-
self. Glmnet did actually select combinations of covariates, but, surprisingly,
the prediction power got worse.

5.2.2 Baseline hazard

Retrieving the baseline hazard is done by optimizing the likelihood as ex-


pressed in chapter 3. The optimization was done for each year separately
according to plan. However, because there were so few failures, there were
some ages with no failures. This was in particular the case for higher values
of j ∈ 100. It is not difficult to see that the optimum value for λj in equation
(3.16) is 0 if no assets died at the age of j. This is obviously not very realistic
and some kind of smoothing might be appropriate.
Eventually two ways of smoothing (described below) were implemented. On
top of that the original way of retrieving the baseline hazard was kept in use
to check whether the smoothing had a positive effect. Furthermore, the first
fifty λj (so with j ∈ {1, 2, . . . , 50}) were always calculated in the original
sense because these λj were pretty stable and no smoothing was required.
This can be seen in the plot of the baseline hazard on page 75.
The first way of smoothing restricted the outcome by requiring that groups
of five λj ’s in a row all had the same value. The likelihood was optimized
accordingly, so retrieving the optimal value of λj for the whole period of five
years.
The second way of smoothing was completely different. A quadratic func-
tion was fitted to the λj ’s with j ∈ {51, 52, . . . , 100}. The choice for a
quadratic function was based on nothing else than the appearance of the
first result.
It would be nice to compare these three different methods. Just calculating
the likelihood is not very helpful. The first way of smoothing is just a sub-
model of the original method and because the optimization is based on the
likelihood, the original will always score at least as high. The big difference is
the number of parameters, which are respectively 50, 10 and 3. Reducing the
amount of parameters has an impact on the optimal likelihood that can be
obtained. The Akaike Information Criterion (AIC) combines these two [28].

64
It deals with the trade-off between the goodness of fit and the complexity
of the model. Let L be the maximum value of the likelihood function for
the model and let k be the number of estimated parameters. Then the AIC
value of the model is

AIC = 2k − 2 ln(L). (5.1)

From the formula it follows that a smaller value for AIC indicates that the
model is better. The AIC is calculated for the last fifty values of j because
this is the part where the models are different.

Method AIC Normal likelihood


original 1522.033 −711.0164
segment 1486.945 −733.4723
quadratic 2869.579 −1431.79

Table 5.1

According to Akaike, the segmented smoothing is a better model and this will
be shown again later in this chapter. The index has a penalty for increasing
the number of parameters and this discourages overfitting. So although the
likelihood is higher the predictive power might be less. In the next chapter
only the most successful smoothing technique is applied.

5.3 Application of Random forest

Implementing random forest is less work because the method can handle big
unprepared data sets. Like the Cox model, Random forest can not handle
missing entries. So these are all imputed with the extended imputation from
chapter 4. Note that now also the year of placement is imputed instead
of deleting the specific records. Another difference is that now the whole
observation window dates are used, again because Random forest has less
restrictions on the data set.
After the whole data table is filled, Random forest is invited to build a model.
Stratified sampling is used to over-represent failed assets when a tree is build.
This makes the model slightly better. The stratification ratio is 10 : 1, so on
average the set used for building a tree contains a failed asset for every ten
cencored assets.

65
5.4 Objective test
Before applying the two methods to the whole data set, they are objectively
tested. The concept of this test was already mentioned in the beginning of
this thesis in chapter 2. Failures have been registered from 2007 till 2017.
The failures up to 2015 and the state of the assets at that point will be used
to train the model. Afterwards the models will predict the failure rates of
the set of assets that have lived in the period from 2015 to 2017. The test has
also been performed with other split points, but here only one split is shown.
This split gave the best results for both methods and all data sets.
After the predictions are made, the test set is ordered based on the predic-
tions. Now the following curve shows the percentage of assets that failed
between 2015 and 2017 plotted against the ordered data set. This is called
an ROC-curve and is often used in information theory. A higher area under
the curve means that on average failed assets have been assigned a higher
failure rate compared to cencored assets. So one aims for the highest possible
area under the curve.
It might be difficult to judge the outcome of this test by itself. Maybe one
would expect a higher number. But keep in mind that the models are fitted
to only a part of the real failures and have to predict which asset is going to
fail although less than 1 percent actually failed.

Method MV joints MV cables LV joints LV cables


Cox 0.6538 0.5691 0.6650 0.5723
Random forest 0.6915 0.6230 0.6497 0.6748

Table 5.2: Area under curve for the two methods and the four data sets split
at 2015.

The result of the test will not be discussed in this chapter. It will be taken
into account in the final judgement at the end of the next chapter. This
test is designed to compare the two different methods, but it does not fully
exploit the data in order to get the best fitting model. The values of (the
most accurate) models are interesting for Alliander. For this reason both
methods are applied on the full data set. This is of course done with cross-
validation; the data set is split into five parts, and to determine the failure
rate of one part the model is trained with the four other parts.

66
Figure 5.4: ROC-curve of the Cox model without smoothing method on
medium voltage joints. The area under the curve is 0.6538.

67
Chapter 6

Results & Conclusions

Following the description in the previous chapter will lead to a program. For
this thesis, it was possible to run this program on the server of Alliander.
The server had a faster processor and more memory available. This chapter
contains all the results of both the Cox model and Random forest generated
by the code. Some work is done to assist the reader with interpreting the
information in the right way. At the end of this chapter a final discussion
will conclude which model is better compared to the other.

6.1 Cox model


The first output of the Cox model is the vector β. It contains the effect that
each covariate has on the hazard rate. In the hazard rate they are multiplied
with the value of the covariate. Compensation is needed due to the different
ranges of the covariates. The normalized factors are obtained by taking the
(k) (k)
average of eXi β over all assets (here (k) indicates the kth coordinate which
corresponds with the kth covariate). Also the (normalized) 95%-confidence
bounds for the factor are given. This indicates the accuracy of the estimate.
(k)
Finally, it could be the case that all values of Xi are the same, then the
covariate has no effect. So, besides the mean factor, the variance of the factor
is also of interest.
Let’s start with two covariates of the medium voltage joints. A small descrip-
tion is given to guide the reader while interpreting the results. Immediately
giving an overload of numbers might be too overwhelming.
In the analysis the covariate “dry days” (DD) is treated as a factor in contrast

69
Two coefficients for medium voltage joints
Covariate Value Factor Lower Upper Variance
1110 1.345 1.042 1.735
dry days 1150 2.642 1.791 3.898 0.015
1200 0.993 0.822 1.200
main cable failures 1.063 0.984 1.148 0.429

Table 6.1

to the other covariate “main cable failures” (MCF). In the data set DD takes
on 13 different values, but Glmnet has selected only 3 of them. The Cox
regression has given them each their own factor. Apparently if DD is 1150
the hazard rate is multiplied by 2.642 resulting in an higher risk, while the
risk is lower if the value is 1200. The MCF of all assets is much more diverse,
but the mean impact is 1.063 which is relatively close to 1. However the
variance of MCF is significantly high. So all hazard rates are multiplied with
a wide variety of factors. This shows that MCF has a big impact on the
predictions.

At the start of the analyses there were roughly 150 covariates. Table 6.2
displays the full list of coefficients that is still left after the selection of Glm-
net. The most important covariate is “construction”. The “nekaldietmof” is
roughly four times as bad as the others in the table. Keep in mind that not
all values are mentioned, so for instance the “gietharsmof” is only twice as
good as the “nekaldietmof”, since values that aren’t mentioned in the table
basically have a factor equal to 1.

Joints with an unknown voltage have an incredibly high risk. But there are
negligible few assets with an unknown voltage; this can be deducted from the
big range of the lower and upper 95%-bound. The same thing is true for a
value of the covariates “number of dry days”. These situations are therefore
not significant for the predictions of the future. Nevertheless, it might be
worth to investigate these specific situations, because the underlying cause
of these extreme values could be a solvable flaw in the system. In all tables
the irrelevant extreme values are coloured red.

The covariates “rayon” and “voltage” are important because they have ex-
treme factors.

The last covariate worth mentioning is “main cable failures”. Despite the
fact that the factor is close to 1, the variance is what makes this covariate
important. Because each main cable failure increases the failure rate with

70
6.3 percent, the wide range of the covariate translates into a large variance
of the factor. In all tables the most relevant factors are coloured blue.
The tables with coefficients for the three other data sets are given but not
discussed any further.

Coefficients for medium voltage joints


Covariate Value Factor Lower Upper Variance
1110 1.345 1.042 1.735
dry days 0.015
1150 2.642 1.791 3.898
1 0.965 0.871 1.069
isolines 0.001
2 0.870 0.698 1.086
dinstance to tree > 10 1.204 1.048 1.384 0.005
kabeldon 0.444 0.366 0.538
krimpmof 0.549 0.444 0.678
kunststofmof 0.824 0.672 1.012
massamof 0.540 0.407 0.715
construction 0.127
nekaldietmof 1.988 1.664 2.377
oliemof 0.538 0.457 0.634
unknown 0.144 0.092 0.223
other 0.700 0.562 0.872
low miniral 0.499 0.258 0.967
mire 1.214 1.018 1.448
soil sand 0.870 0.750 1.008 0.009
heavy clay 1.159 0.974 1.379
main cable failures 1.063 0.984 1.148 0.429
digging activity 0.872 0.871 0.873 0.018
5 0.868 0.690 1.092
height isoline -2.5 1.261 1.112 1.429 0.012
-5 1.283 0.993 1.657
2 0.934 0.705 1.237
subsidence 0.002
8 1.141 0.970 1.343
generation unknown 1.322 1.187 1.472 0.015
Beneden NHN 0.747 0.600 0.931
GD-oost 0.752 0.611 0.925
rayon 0.089
Rijnland 2.052 1.661 2.536
Veluwe 0.705 0.557 0.893
3.000V 0.430 0.280 0.661
voltage 0.678
unknown 12.167 9.147 16.183

Table 6.2

71
Coefficients for medium voltage cables
Covariate Value Factor Lower Upper Variance
wires 3 2.563 1.788 3.674 0.225
distance to railroad [10, 100] 1.419 1.161 1.734 0.013
y coordinate 3.232 3.232 3.232 0.128
galvanic failures 0.908 0.891 0.925 0.009
copper 0.827 0.708 0.967
conductor 0.023
unknown 0.332 0.175 0.629
light clay 1.237 1.026 1.491
fine sand 0.805 0.640 1.011
soil 0.015
mire 1.302 1.053 1.610
heavy clay 1.276 1.039 1.568
main cable failures 1.079 1.015 1.147 0.773
digging activity near main cable 0.780 0.779 0.781 0.035
12.5 0.711 0.467 1.081
height isoline 0.007
-2.5 1.149 0.980 1.347
isolation XLPE 2.088 1.650 2.642 0.239
2 0.818 0.565 1.184
subsidence 4 0.829 0.710 0.967 0.023
8 1.475 1.205 1.805
Friesland 1.497 1.166 1.923
GD-oost 0.872 0.680 1.120
rayon 0.079
Rijnland 1.457 1.156 1.836
Veluwe 0.531 0.401 0.702
overflow 1.156 0.994 1.345 0.002
10.000V 1.183 0.806 1.737
voltage 0.028
unknown 0.122 0.028 0.524

Table 6.3

72
Coefficients for low voltage joints
Covariate Value Factor Lower Upper Variance
dry days 0.004 0.004 0.004 0.000
inhabitants 1.198 1.198 1.198 0.075
address density 1.013 1.013 1.013 0.000
krimpmof 1.460 0.355 6.001
kunstharsmof 0.391 0.222 0.688
massamof 0.501 0.274 0.914
nekaldietmof 1.301 0.554 3.057
construction 0.009
unknown 0.535 0.406 0.704
cellpack 0.706 0.522 0.956
filoform 0.921 0.677 1.254
other 0.720 0.539 0.961
low miniral 0.523 0.288 0.949
mire 0.705 0.626 0.794
soil 0.011
heavy clay 0.702 0.592 0.833
fine sand 0.711 0.617 0.820
2.5 1.245 1.155 1.343
52.5 3.650 1.735 7.681
height isoline 0.016
70 6.585 2.949 14.710
80 4.511 1.125 18.080
2 0.526 0.441 0.627
3 0.690 0.579 0.823
subsidence 0.051
4 0.671 0.608 0.742
8 1.409 1.297 1.530
digging activity 1.023 1.009 1.037 0.006
area neighbourhood 1.033 1.033 1.033 0.007
cars in area 0.773 0.773 0.773 0.023
south NHN 0.634 0.566 0.711
Friesland 0.568 0.465 0.694
GD-Oost 1.203 1.028 1.407
rayon 0.049
Gooi en Flevoland 0.706 0.627 0.796
NHN 0.555 0.498 0.620
Veluwe 0.630 0.540 0.734
sort connection 1.981 1.859 2.112 0.165
voltage unknown 11.270 8.918 14.240 0.206

Table 6.4

73
Coefficients for low voltage cables
Covariate Value Factor Lower Upper Variance
1 0.431 0.203 0.915
wires 7 0.113 0.047 0.272 0.018
unknown 0.356 0.277 0.459
diameter 0.996 0.994 0.997 0.000
loam 0.175 0.044 0.702
light clay 0.853 0.769 0.946
light sand 0.816 0.734 0.906
low miniral 0.966 0.746 1.250
soil 0.014
mire 0.767 0.697 0.843
water 1.227 0.996 1.511
sand 0.766 0.713 0.823
subsidence 1.453 1.333 1.582 0.041
digging activity near main cable 0.594 0.593 0.595 0.058
87.5 15.200 8.126 28.424
height isoline 0.022
90 4.312 1.074 17.309
plastic 0.827 0.772 0.886
isolation PVC 0.851 0.776 0.934 0.026
XLPE 0.632 0.587 0.680
5 1.308 1.211 1.412
subsidence 0.020
8 1.311 1.211 1.419
length 1.010 1.010 1.011 0.374
max load 1.600 1.598 1.601 0.493
area neighbourhood 1.012 1.012 1.012 0.001
GD-Oost 0.717 0.640 0.803
Gooi en Flevoland 0.788 0.721 0.862
rayon NHN 0.721 0.654 0.796 0.033
Rijnland 0.911 0.827 1.003
Veluwe 0.536 0.491 0.585
400V 2.287 1.810 2.890
voltage 0.066
unknown 0.381 0.139 1.043

Table 6.5

74
Besides the coefficients, the Cox model provides the valuable baseline hazard.
Figure 6.1 shows the baseline where the values for each year were approxi-
mated separately.

Figure 6.1: Baseline hazard functions of the medium voltage joints.

One can definitely recognise a bath tub curve as shown in figure 2.1. But
there is an additional peak around the age of 40. After some research this
is caused by a specific sort. For joints with nekaldiet as construction only
survival data from an interval around 40 years is available. Therefore it is
better to compensate the baseline hazard for predictions on other joints, but
this is beyond the scope of this project.
The main benefit of the baseline hazard is the ability to make predictions
for the coming years. Figure 6.2 shows the expected number of failures for
the next 40 years. With the right replacement policy Alliander can try to
prevent the increment of the number of failures.
The baseline hazard functions for the three other data sets are given but not
discussed any further.

75
Figure 6.2: Expected number of failed medium voltage joints for the next 40
years according to the Cox model.

Figure 6.3: Baseline hazard functions of the medium voltage cables.

76
Figure 6.4: Baseline hazard functions of the low voltage joints.

Figure 6.5: Baseline hazard functions of the low voltage cables.

77
6.2 Random forest
After fitting Random forest the resulting model is not immediately inter-
pretable. The collection of trees is not structured and all but compact.
However chapter 4 did mention a way to show the importance of variables.
The results are shown in figure 6.6.

Figure 6.6: Variable importance of Random forest model on medium voltage


joints.

A known flaw of the variable importance plot of Random forest is that it


favours variables with a wide variety of unique values. The three covariates
that are elected as most important are the perfect example. This bias origi-
nates from the fact that for these covariates there are a lot of split possibili-
ties. Being overrepresented as covariate for splits implies that the predictive
power of the model gets worse when deleting the covariate. Besides this bias
the result is not really surprising but still difficult to interpret.
The variable importance for the three other data sets are given but not
discussed any further.

78
Figure 6.7: Variable importance of Random forest on medium voltage cables.

Figure 6.8: Variable importance of Random forest on low voltage joints.

79
Figure 6.9: Variable importance of Random forest on low voltage cables.

Random forest itself does not model the lifetime of an asset. It can predict
whether an asset has failed or not at this very moment, but it has not a
build-in function to describe the coming years. The prediction is the average
over all the trees. This gives a number between 0 and 1 which is used as
the probability that the asset has failed. However, it is doubtful that this
number is a realistic hazard rate. Alliander has done some work to scale
the outcome of Random forest, by compensating for the stratification. This
scaling process will not be discussed in this thesis.

80
6.3 Conclusion
Both methods have been applied on the four entire data sets (with cross-
validation). Ordering the assets on hazard rate and calculating the ROC-
curves gives the following results:

Method MV joints MV cables LV joints LV cables


Cox 0.7322 0.6902 0.7341 0.7367
Random forest 0.8472 0.8130 0.8081 0.8273

Table 6.6: Area under curve for the two methods and the four data sets with
cross validation.

But especially in this paragraph the objective test should be considered.


Otherwise the whole time-dependent component of the analysis is rudely
ignored.

Method MV joints MV cables LV joints LV cables


Cox 0.6538 0.5691 0.6650 0.5723
Random forest 0.6915 0.6230 0.6497 0.6748

Table 6.7: Area under curve for the two methods and the four data sets split
at 2015.

Numbers don’t lie; Random forest is better at pointing out the assets with
a higher risk in a relative short period of time. However, as is shown in
this chapter, it falls short when it comes to interpretation. Not only are the
absolute hazard rates not the right order of magnitude, but also pointing out
the most influential covariates seems difficult.
The Cox model depends on assumptions made about the relation between
covariates and the hazard rate. Presumably these assumptions do not per-
fectly correspond to reality. Nevertheless, as a first-order estimate it gives a
valuable result. The hazard rates are in the right order of magnitude and the
model gives even more information. The coefficients together with the base-
line hazard function give you insight in the lifetime of the assets and enable
you to make accurate predictions for the future. For all this, a price has to
be paid; it takes roughly ten times as long to fit the Cox model compared to
fitting Random forest.
Despite the extra features of the Cox model the predictions of the objec-
tive test as well as the cross-validation are worse. This would indicate that

81
in Allianders’ situation the time-dependent features of the Cox model are
not decisive. The Random forest model main focus are the covariates and
apparently these are better predictors. The Cox model might perform bet-
ter in a totally different situation where the lifetime has more influence on
the hazard rate compared to the explanatory variables. To illustrate this,
the cross-validation has again been performed on the medium voltage joints.
This time though, all covariates but soil have been removed. Indeed the Cox
model is more powerfull in this situation with an area under the ROC-curve
of 0.6484 compared to 0.5352 of the Random forest.
Taking everything into consideration, the Cox model is the preferred method.

82
Chapter 7

Further research

Often the longer you work on a project the more you realise how little work
you have done. This is very true for this thesis. Some thoughts about further
research that came up during this project are listed below.
Let’s start with the baseline hazard function as part of the Cox model. Dur-
ing the implementation there were scattered values for the higher ages and
the baseline hazard function was zero for ages with missing data. Both facts
are warnings for overfitting and some kind of smoothing is wise. The seg-
mented version of the baseline hazard is a first attempt. Further research
could be done to find a better segmentation or even entirely different ways
of smoothing. Of course the likelihood based on the data is the highest for
the original method of retrieving the baseline hazard function. But reducing
overfitting could actually improve the predictions.
This thesis focuses on the differences between the Cox model and random
forest. They are compared by applying them to the same problem. Instead,
one could investigate if combining the two methods is lucrative. A construc-
tion that uses the strongest components of each method could be a solution
to get the best of both worlds. Some first ideas have been tested but are
beyond the scope of this thesis.
The final suggestion is about the time-dependence of the Cox model. The
Cox model separates the time-dependent baseline hazard function and the
proportional factor. Classically, this factor is only based on the covariates
and not on time. The book [29] introduces time-varying covariates and even
coefficients. This could be interesting because then the different life cycles
of for instance the joints do not influence the general baseline hazard.

83
Chapter 8

Bibliography

[1] Alliander annual report 2015, https://www.alliander.com/sites/default


/files/Alliander Annual Report 2015.pdf
[2] W. Sinke et al., NWA Route Energietransitie, www.nera.nl/wp-content
/uploads/2016/06/NWA-Energietransitie-routebeschrijving.pdf, NERA,
2016.
[3] Power Distribution in Europe, facts and figures, www.eurelectric.org/
media/113155/dso report-web final-2013-030-0764-01-e.pdf, eurelectrics,
2013.
[4] P.M. van Oirsouw, Netten voor distributie van elektriciteit, Phase to
Phase, Arnhem, 2012.
[5] http://www.energieleveranciers.nl/netbeheerders/elektriciteit.
[6] J. Heres, R. Stijl, F. Reinders, Machine Learning Methods for the Health-
Indexing and Ranking of Underground Distribution Cables and Joints,
CIRED, 2016.
[7] T. M. Therneau, T. Lumley, Survival Analysis, R package, June 26, 2016.
[8] H. Ishwaran, U. B. Kogalur, Random Forests for Survival, Regression and
Classification, R package, May 17, 2016.
[9] J.A. Rice, Mathematical Statistics and Data Analysis, Number p. 3 in
Advanced series. Brooks/Cole CENGAGE Learning, 2007.
[10] D. R. Cox, Regression Models and Life-Tables, Journal of the Royal
Statistical Society, Series B (Methodological), Volume 34, Number 2,
1972.

85
[11] K. E. Atkinson, An Introduction to Numerical Analysis, John Wiley &
Sons, Inc, ISBN 0-471-62489-6, 1989.

[12] B. Efron, The Efficiency of Cox’s Likelihood Function for Censored


Data, Journal of the American Statistical Association, Vol. 72, No. 359,
pp. 557-565, 1977.

[13] J. Borucka, Methods for handling tied events in the cox proportional
hazard model, Studia Oeconomica Posnaiensia, vol. 2, no. 2 (263), 2014.

[14] R. Tibshirani, Regression Shrinkage and Selection via the lasso, Journal
of the Royal Statistical Society, Series B (methodological) 58, 1996.

[15] J. Gui, H. Li, Penalized Cox Regression Analysis in the High-Dimen-


sional and Low Sample Size Settings, with Applications to Microarray
Gene Expression Data, Bioinformatics, 25(13), 30012008, 2005.

[16] H. Zou, T. Hastie, Regularization and Variable Selection via the Elastic
Net, Journal of the Royal Statistical Society B, 67(2), 301320, 2005.

[17] M. Y. Park, T. Hastie, L1-Regularization Path Algorithm for Generalized


Linear Models, Journal of the Royal Statistical Society B, 69, 659677,
2007.

[18] J. J. Goeman, L1 Penalized Estimation in the Cox Proportional Hazards


Model, Biometrical Journal, 52(1), 7084, 2010.

[19] J. A. Snyman, Practical Mathematical Optimization: An Introduction to


Basic Optimization Theory and Classical and New Gradient-Based Algo-
rithms, Springer Publishing, ISBN 0-387-24348-8, 2005.

[20] J. Friedman, T. Hastie, R. Tibshirani, Regularization Paths for Gener-


alized Linear Models via Coordinate Descent, Journal of Statistical Soft-
ware, Volume 33, Issue 1, 2010.

[21] J. Friedman, T. Hastie, N. Simon, R. Tibshirani, glmnet: Lasso and


Elastic-Net Regularized Generalized Linear Models, R package, March
17, 2016.

[22] M. Bramer, Principles of Data Mining, Springer Publishing, ISBN 978-


1-84628-765-7, 2007.

[23] T. K. Ho, Random Decision Forests, Proceedings of the 3rd Interna-


tional Conference on Document Analysis and Recognition, Montreal, QC,
278282, 1995.

86
[24] L. Breiman, Random Forests, Springer Publishing, Machine Learning
45: 5, 2001.
[25] Y. Amit, D. Geman, Shape Quantization and Recognition with Random-
ized Trees, Neural Computation. 9, 1545-1588, 1996.
[26] L. Breiman, A. Cutler, A. Liaw, M. Wiener, randomForest: Breiman and
Cutler’s Random Forests for Classification and Regression, R package,
Juli 10, 2015.
[27] L. Toloşi, T. Lengauer, Classification with correlated features: unrelia-
bility of feature ranking and solutions, Bioinformatics 27 (14): 1986-1994,
2011.
[28] H. Akaike, A new look at the statistical model identification, IEEE Trans-
actions on Automatic Control, 19 (6): 716723, 1974.
[29] T. Martinussen, T. H. Scheike, Dynamic Regression Models for Survival
Data, Springer, 2006.

87

Anda mungkin juga menyukai