Churn Prediction in The Mobile Telecommunications Industry

Churn Prediction in the Mobile
Telecommunications Industry
An application of Survival Analysis in Data Mining
Master Thesis
Author: L.J.S.M. Alberts, Bsc.

Supervisor: Dr. R.L. Westra
Graduation comity:
Dr. Ir. R.L.M. Peeters (Dep. of Mathematics, Maastricht University)
Prof. R. Braekers (Center for Statistics, Hasselt University)
C. Meijer (Vodafone Netherlands)
Maastricht University
Department of General Sciences
Maastricht, September 2006
Acknowledgements
For my master degree in Knowledge Engineering & Computer Science (Operations Research) at Maastricht University I conducted research at the mobile
telecommunication company Vodafone Netherlands at Maastricht. The main
aim of this research was to gain more insight into survival analysis and its
application as a predictive model. I considered translating theoretical knowledge to a real life context both challenging as well as very appealing. First,
I would like to thank my supervisor dr. Ronald Westra and Carel Meijer of
Vodafone for their guidance during my research and for giving me the opportunity to carry out this assignment. I would also like to thank professor
Roel Braekers of Hasselt University for answering my questions about survival analysis. Further, I would like to thank Driek Maas and everyone at
Information Management. Last but not least I would like thank my family
and my girlfriend for their support, interest and understanding during this
period.
Abstract
Recently, the mobile telecommunication market in the Netherlands has changed
from a rapidly growing market, into a state of saturation and fierce competition. The focus of telecommunication companies has therefore shifted from
building a large customer base into keeping customers in house. Customers
who switch to a competitor are so called churned customers. Churn prevention, through churn prediction, is one way to keep customers in house.
In this study we focus solely on prepaid customers. In contrast to postpaid customers, prepaid customers are not bound by a contract. The central
problem concerning prepaid customers is that the actual churn date in most
cases is difficult to assess. This is a direct consequence of the difficulty in
providing a unequivocal definition of churning and a lack of understanding in
churn behavior. To overcome this problem, here a custom and flexible churn
definition is proposed.
The predictive churn model presented in this study is based on the theory of survival analysis. Survival analysis is predominantly used in medical
sciences to examine the influence of variables on the length of survival of
patients. In survival analysis, the time until the occurance of a well-defined
event is modelled. In the present case, the event of interest is churn. In
this research the focus is on the extended Cox model. This is a variant of
the original proportional hazards model, that is used for churn modelling.
Since survival models are not designed to act as predictive models, some
adjustments had to be made.
To be able to compare the performance of the extended Cox model with
the established predictive models, a decision tree is also considered.
Both models performed approximately similar with a sensitivity ranging
from 93% to 99% and a specificity ranging from 92% to 97%, depending
on the model and the churn definition. The extended Cox model can be
considered as a perfect alternative to the established predictive models and
offers some unique qualities.
List of definitions
Call credit: prepaid customers pay their calls from this credit.
Churn: a term used by companies to denote the loss of customers.
Commitment date: the date a new customer is registrated.
Postpaid customer: a customer who is bound by a contract and pays
a monthly sum in exchange for free call minutes.
Prepaid customer: a customer who is not bound by a contract and who
only pays for the calls he makes.
Recharge: the term used for raising the call credit.
Sim-card: a small electronic chip on which the mobile phone number is
stored.
Contents
1 Introduction
1.1 Churn prediction modelling
1.2 Prepaid versus postpaid . .
1.3 Churning . . . . . . . . . . .
1.4 Research questions . . . . .
1.5 Outline . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
7
8
9
10
2 Data acquisition and preparation

11
2.1 Data acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Derived variables . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Operational churn definition
15
4 Theory of survival analysis

4.1 Survival and hazard functions . . . . . . . . . . . . . . . . . .
4.2 Recording survival data . . . . . . . . . . . . . . . . . . . . .
4.3 Censoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
17
18
18
5 A survival model as predictive churn model

5.1 Cox model . . . . . . . . . . . . . . . . . . .
5.2 Extended Cox model . . . . . . . . . . . . .
5.3 Survival data . . . . . . . . . . . . . . . . .
5.4 Discrete event times . . . . . . . . . . . . .
5.5 Lagging . . . . . . . . . . . . . . . . . . . .
5.6 Principal component regression . . . . . . .
5.7 Handling nonlinearity . . . . . . . . . . . . .
5.8 Predictive score method . . . . . . . . . . .
20
21
22
24
24
24
25
27
27
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 A decision tree as predictive churn model

30
6.1 Finding the optimal tree size . . . . . . . . . . . . . . . . . . . 32
6.2
Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7 Tests and results

34
7.1 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8 Conclusions and recommendations
37
8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.2 Recommendations for future research . . . . . . . . . . . . . . 39
Appendix A: Derived variables
43
Appendix B: Decision tree for churn definition 1
45
Appendix C: Decision tree for churn definition 2
46
Chapter 1
Introduction
Recently, the mobile telecommunication market in the Netherlands has changed
from a rapidly growing market, into a state of saturation and fierce competition. The focus of telecommunication companies has therefore shifted from
building a large customer base into keeping customers in house. For that
reason, it is valuable to know which customers are likely to switch to a competitor in the near future. Those customers are so called churned customers.
Since acquiring new customers is more expensive than retaining existing customers, churn prevention can be regarded as a popular way of reducing the
companys costs. With this objective, this research is carried out for Vodafone Netherlands. Vodafone is particularly interested in churn prediction of
prepaid customers. Prepaid customers are, as opposed to postpaid customers,
not bound by a contract. In this study, two different predictive churn models
are considered. The first model, the so-called extended Cox model, is based
on the theory of survival analysis, whereas the second model, a decision tree,
is commonly used in data mining.
In the present study, the focus will be on the extended cox model. However, in order to establish a direct comparison, we decided to consider the
decision tree as well. Both models are ultimately tested on a selection of
prepaid customers from the database provided by Vodafone.
1.1
Churn prediction modelling
Churn prediction is currently a relevant subject in data mining and has been
applied in the field of banking [5, 14], mobile telecommunication [10, 7], life
insurances [13], and others. In fact, all companies who are dealing with long
term customers can take advantage of churn prediction methods.
Models such as neural networks, logistic regression and decision trees
6
are common choices of data miners to tackle this churn prediction problem.
These models are trained by offering snapshots of churned customers and nonchurned customers. The goal is to distinguish churners from non-churners
as much as possible. When new customers are offered, the model attempts
to predict to which class each customer belongs. Although in general this
approach gives satisfying results, the time aspect often involved in these
problems is neglected. For instance, in the case of life insurances, customers
are more likely to churn after a year than after a period of 5 years.
We propose an approach which makes it possible to incorporate this time
aspect. In order to do so, a collection of statistical methods called survival
analysis is used. Survival analysis is predominantly used in medical sciences
to examine the influence of explanatory variables on the length of survival of
patients, hence the name survival analysis.
Survival analysis has different synonyms depending on its application.
For example, survival analysis is labeled reliability analysis in engineering
and is known under the term duration analysis in economics. In addition,
survival analysis applied in the field of data mining is also known as survival
data mining [15].
1.2
Prepaid versus postpaid
Although churn prediction is often associated with the mobile telecommunication market, the problem described in this thesis is not widely covered
in scientific literature. This is because almost all articles concentrate on
so-called postpaid customers, customers who are bound by a contract. In
contrast to postpaid customers, prepaid customers are far more difficult to
deal with when it comes to churn prediction.
Firstly, as opposed to postpaid customers, prepaid customers do not pay
a monthly sum and do not receive any free calling minutes. In order to
make outgoing calls, they have to buy call credit. When this call credit is
used up, they have to recharge their credit to make outgoing calls again.
Incoming calls, on the other hand, can always be received, regardless of the
current call credit. These properties result in calling behaviour that is very
hard to predict. Most prepaid customers are irregular users where months of
zero usage frequently occur. This behaviour can easily be explained by the
fact that they are not charged for not making any calls. Obviously, this in
contrast with postpaid customers who pay a monthly sum.
Secondly, postpaid customers are obligated to provide personal information like their name, gender and address. As a consequence, a postpaid phone
number is linked to a single customer. Prepaid customers, on the other hand,
7
do not necessarily need to provide this personal information and can therefore stay anonymous. Because of this, sim-cards, which contain the phone
number of the cusomer, can easily be switched between individuals without
having to inform the operator. Different calling behaviors are then registered
on a single sim-card and thus on a single phone number. Such numbers are
misleading if we assume that a phone number is linked to a single and unique
customer.
Finally, in most cases postpaid contracts have a fixed length of one or
two years. Since people are likely to churn when their contract expires,
the expiration date can be held as a very reliable predictor for churning.
Unfortunately, this predictor is unavailable for prepaid churn.
In the following section, the most important difference between postpaid
and prepaid customers will be addressed.
1.3
Churning
Churn is usually distinguished in voluntary and involuntary churn. Voluntary

churn refers to churn due to the customers choice. For example, switching to
a competitor or switching to a postpaid contract. Involuntary churn concerns
customers who are disconnected by the operator, typically due to nonpayment or fraud reasons. Since fraudulent customers are rare, they can be
neglected in this study. Involuntary churn due to nonpayment is not applicable here, as we are dealing with prepaid customers only. Though, involuntary
churn is possible when customers do not recharge for a long period of time.
Probably, one of the most important issues with prepaid churn is the lack
of a good definition for it. When considering postpaid churn, the deactivation
date, i.e. the date that a customer is disconnected from the network, is equal
to the churn date. After all, this is the actual date a customer stops using the
operators services. In the case of prepaid churn however, the deactivation
date does not necessarily have to match the churn date. This can be made
more clear by the different states a prepaid customer can be in. We can
distinguish four states, namely
Status 1: Normal use
Status 2: No credit (call credit has expired or call credit is zero)
Status 3: Recharge only
Status 4: Deactivation
Figure 1.1: The different states prepaid customers can pass.
The different states are shown in Figure 1.1. After recharging, a customer in
status 2 or 3 returns to status 1 again.
In general, it takes a long period before a prepaid customer is actually
disconnected from the network. In many cases customers are churned long
before they are disconnected from the network. This is exactly the reason
why the deactivation date is not a suitable indicator for churn. Thus, we
are interested in a churn definition which indicates when a customer has
permanently stopped using his prepaid sim-card. Furthermore, this stop
has to be indicated as early as possible and before the deactivation date is
reached. In Chapter 3 such churn definition is proposed. This definition is
used to train the predictive models.
1.4
Research questions
The research question is stated as follows:

Is it possible to make a prepaid churn model based on the theory of survival
analysis?
In order to address this question, the following three sub questions are formulated:
What is a proper, practical and measurable prepaid churn definition?
9
How well do survival models perform in comparison to the established

predictive models?
Do survival models have an added value compared to the established
predictive models?
We are thus interested in a survival model that can be used to predict prepaid churn. This model is proposed is Chapter 5. The problem addressed
in the first sub question is already mentioned in the previous section. To
overcome this problem a churn definition is proposed in Chapter 3. In order
to investigate the sub question 2 and 3, a second predictive model is required.
This model is discussed in Chapter 6. Tests are performed in order to answer
the second sub question. These are discussed in Chapter 7. In Chapter 8 we
return to the research question and the sub questions, where we attempt to
answer them.
1.5
Outline
This work is structured as follows. In Chapter 2 the data acquisition and

preparation procedure is described. The data is acquired from a database
provided by Vodafone Netherlands. A visualization tool is used to obtain
insight in the data. Further, a number of derived variables are proposed
which describe customer behaviour in a better way.
As a consequence of the inavailability of exact churn dates, a churn definition is required. This definition is proposed in Chapter 3 and is used to
provide the dataset with labels indicating churn or non-churn. These labels
are essential to train the predictive models.
In Chapter 4 a brief theoretical background of survival analysis is given.
This is done in order to gain a better understanding of the terminology used
in Chapter 5, where the survival model is introduced. Chapter 5 discusses
the theoretical background of the model and its application as a predictive
model. In Chapter 6 the second predictive model is described. This model is
called a decision tree and is commonly used in churn prediction modelling.
Therefore, the decision tree is here used to compare its performance with the
performance of the surival model. Chapter 7 provides the tests and results of
the comparison between the two models. The tests are based on a selection
of prepaid customers from the database provided by Vodafone. Finally, in
Chapter 8 the conclusions and recommendations are given.
10
Chapter 2
Data acquisition and
preparation
The first step in predictive modeling is the acquisition and preparation of
data. Having the correct data, is as important as having the correct method
[2].
2.1
Data acquisition
Vodafone has provided a database containing all prepaid and postpaid customers. The data stored in this database is used for predictive modelling.
The data is already monthly aggregated. For modelling purposes this is the
desired level of aggregation. Daily or weekly aggregated data will not offer
any advantages over monthly aggregated data, since prepaid customers are
characterized by inconstant and irregular behaviour as previously discussed
in Section 1.2.
Not all fields of the database are suitable for modelling purposes. Fields
with unique values, like addresses or personal unlock codes are left out. These
do not have predictive value as they uniquely identify each row [2]. Also fields
with only one value are left out, as these represent a negligible part of the
data [2]. Finally, fields with too many null values are also excluded. The
latter applies to fields like sex and age, since prepaid customers are not
obliged to provide personal information. In addition, personal data that
is available is not necessarily reliable. Geographical and personal data are
therefore excluded. In sum, the selected data is thus only related to usage
and billing information. For instance, number of outgoing calls per month
or number of recharges per month.
The database is updated with the latest available data every month. How11
ever, if a customer does not show any activity in a month, nothing is added to
the database. Although this may increase storage efficiency, it also makes it
very difficult to detect a possible update failure. After all, an update failure
can undeservedly implicate total inactivity of a customer. In spite of this,
this problem has further been disregarded in this study.
A selection of 20.000 customers with a commitment date between April
and July 2005 are taken from the database. This is comparable to 12 to 15
months of history per customer. Longer histories were not possible due to
the limited history of the recharge data.
2.2
Data preparation
The large flat data file obtained from the database is represented in an object oriented manner. That is, each customer is stored as an object where
the attributes are the selected fields from the database. These attributes
will be also referred to as variables from now on. Furthermore, customer
attributes can be time dependent or time independent. The attribute customer id for instance is a time independent variable. On the other hand, the
variable number of outgoing calls per month differs from month to month
and is therefore time dependent. This object oriented representation gives us
greater flexibility and a better overview of the data. In addition, the data is
better organised and it creates the possibility to examine a single customer
very easily. Figure 2.1 depicts a diagram of the object oriented representation
of the data. A visualization tool has been built to further benefit from this
object oriented representation. It plots two variables of a single customer
over time. The different variables and customers can be chosen from a list.
Figure 2.2 is a screenshot of the user interface which is built with Matlab 7.
This tool makes it possible to visually examine the behaviour of hundreds of
customers. It provides insight in potentially interesting and worthwhile variables. Furthermore, the visualization tool is also used to examine different
types of churn definitions.
The set of customer objects needs to be turned into the right format
before it can be used by the predictive models. For both models, the data
has to be represented differently. In the case of the decision tree model all
data must be in a single table where each row corresponds to a single month
of a customer. Each row can hold different customers and different months.
In the case of the survival model, the dataset must contain the total history
of a customer.
12
Figure 2.1: Object oriented representation of the data.
2.3
Derived variables
Derived variables are new variables based on original variables. The most
effective derived variables are those that represent something in the real
world, such as a description of some underlying customer behaviour [2]. In
fact, all original variables are already derived variables, since these are all
monthly aggretated.
There are some general classes of derived variables, like total values, average values, and ratios. We have chosen to employ the average value over
the last three months as a derived variable type. Also, the ratio between the
average over the last three months and the average over all months before is
used as a derived variable. In addition a number of specific derived variables
are used. Some examples are:
The number of months since the last recharge.
The number of months since the last voicemail call.
The ratio of incoming and outgoing calls.
A list of all derived variables can be found in appendix A. The derived variables explain customer behaviour in a better way than the original variables.
For instance, knowing how many months it was since a customer called his
voicemail is much more informative than knowing if a customer called his
voicemail this month.
13
Figure 2.2: A screenshot of the data visualization tool.
14
Chapter 3
Operational churn definition
In Chapter 1 we stated that we prefer a churn definition that indicates when
a customer has permanently stopped using his sim-card as early as possible.
In this chapter such a churn definition is proposed. This is called the operational churn definition. An operational churn definition is necessary since
the proposed models in this study are supervised models. Supervised models
require a labeled dataset for training purposes.
The most obvious operational definition of prepaid churn is based on a
number of successive months with zero usage. After all, this definition bares
great similarity with our intended churn definition. For instance, a churned
customer could be defined as not having have any incoming and outgoing
calls in the past three months. Although this is an intuitive operational
definition, we do have to take the following considerations into account.
First, the definition of zero usage has to be refined. Incoming calls are
also registered when the user does not answer his phone. Besides, incoming
duration is registered when a message is recorded on voicemail. This means
that a prepaid customer who has already churned, still can have incoming
calls and incoming duration from uninformed outsiders. Therefore, defining
zero usage as zero outgoing duration plus zero incoming duration can not
be considered optimal. With regard to incoming duration, we manage a
threshold. This threshold is set to the value of 30 seconds per month which
turned out to be a practical value. In sum, zero usage can be defined as a
total incoming duration of less than 30 seconds per month in combination
with zero outgoing duration.
Secondly, the operational churn definition has to be refined. Most prepaid
customers are irregular users that often display zero usage. In addition, several successive months of zero usage often occurs as well. As a consequence,
the above mentioned churn definition will not hold for every customer. Low
usage customers could be quickly labeled as churned without actually being
15
churned. To overcome this problem, a flexible churn definition is proposed.

This definition allows for a larger range of customer behaviours. The definition consists of two parts, and , where is a fixed value and is
equal to the maximum number of successive months with zero usage. +
is then used as a threshold to distinguish churn behaviour from non-churn
behaviour. An example is provided in Figure 3.1.
In this example is set to a value of 3 and is marked in grey. is equal to
Figure 3.1: An example of the operational churn definition.

2 and is marked in black. At month 13 the customer is labeled as churned
since there are (2 + 3 =) 5 successive months with zero usage.
The churn definition can be interpreted as follows. Although the customer
of Figure 3.1 did not show any usage during the fifth and sixth month, at
month 7 he proved that he was not churned. This means that another two
successive months of zero usage would not guarantee any churn behaviour.
This is indicated with the variable . The constant is added to indicate a
point where the customer would churn. Two different values for are used
to label the training set, that is = 2 and = 3. From now on these will be
referred to as churn definition 1 and 2 respectively. Customers with 5
are excluded, as the churn definition becomes less accurate and intuitive after
longer durations.
16
Chapter 4
Theory of survival analysis
Survival analysis is a collection of statistical methods which model time-toevent data. A thorough introduction can be found in [12]. Central is the
occurrence of a well-defined event. The variable of interest is the time until
this event occurs. This in contrast with approaches like regression methods
and neural networks which model the probability of an event. Depending on
its application, the event of interest can be the failure of a physical component
or the time to death. In the context of data mining the event of interest is
typically the time until churn or the time until the next purchase [4].
4.1
Survival and hazard functions
Let T 0 be the random variable that denotes the time at which the event
occurs and let it have density f (t) and distribution function F (t). The survival function is then defined as
S(t) = P r(T > t) = 1 F (t) =
f (x)dx.
(4.1)
We note that S(t) is a monotone decreasing function with S(0) = 1. The

survival at time t is the probability that a subject will survive to that point
in time. The hazard rate function (t) is defined as
(t) =
limt0 P r(t < T < t + t|T > t)

f (t)
=
.
t
1 F (t)
(4.2)
The hazard function is also known as the instantaneous failure rate, force
of mortality, and age-specific failure rate. (t)t can be interpreted as the
instantaneous probability of having an event at time t given that one has
survived (i.e. not had an event) up to time t. The functions f (t), F (t), S(t)
17
and (t) give mathematically equivalent specifications of the distribution of

T.
4.2
Recording survival data
Survival data is recorded in the following way. Subjects are observed for a
certain period of time. During this period, the time of the event of interest
is registered. However, it is possible that the event cannot be registered
because it does not occur during the period of observation. This is called
censoring and is discussed in the following section.
Survival data requires both a well defined origin of time as well as a
time scale. The origin of time is the moment from which the observation
starts. This can be any particular event, for example birth, or in our case
the commitment date of a customer. The time scale is the frequency by
which a subject is checked on the occurrence of the event. Common scales
are based on years or months, depending on the nature of its application. In
this study we apply a time scale based on months, since the data we use is
monthly aggregated.
Let Ti denote the time at which the event occurs for the ith subject and
let Ti denote the observed time when dealing with censoring. Let i be a
0/1 indicator which is 1 if Ti is observed and 0 if the observation is censored.
The pair (Ti , i ) is then used to describe a single observation.
4.3
Censoring
The concept of censoring entails an essential element of survival analysis.

Censoring occurs when there are incomplete observations. Three types of
censoring can be distinguished:
Right censoring
Left censoring
Interval censoring
These are illustrated in Figure 4.1. The most common type of random censoring is right censoring. Right censoring occurs when the event time is larger
than the period of observation. In that case, it is only known that the event
time is larger then the last observed time. In the case of left censoring, the
event occurs before the period of observation. Then, the event time is only
known to be before a certain point in time. Finally, interval censoring occurs
18
Figure 4.1: Different types of censoring.
when the exact event times are not available, and are only considered to be
in a certain interval. In the present study we are dealing with right censored
data since not every customer churns during observation.
19
Chapter 5
A survival model as predictive
churn model
In this chapter, a survival model for churn prediction is proposed. There are
many different types of survival models. In this approach, we are particularly
interested in survival models that incorporate a regression component, since
these regression models can be used to examine the influence of explanatory
variables on the event time. In this context, such explanatory variables are
often called covariates. There are two commonly used classes of regression
models, namely so-called accelerated failure time models and proportional
hazard models. Accelerated failure time models are based on a survival
distribution. Common employed distributions are Weibull, exponential and
log-logistic. In accelerated failure time models the regression component
affects survival time by rescaling the time axis. This is illustrated in Figure
5.1. The proportional hazard model is also known as the Cox model. It is the
most popular survival regression model available since it does not makes any
assumptions on the survival function as opposed to accelerated failure time
models. It has been introduced by David Cox in 1972. In the Cox model, the
regression component affects the hazard curve through multiplication. This
is illustrated in Figure 5.1. Many improvements and adjustments have been
made to the Cox model since the introduction of the model. Because of these
improvements, which will be discussed later on, the Cox model is a strong
candidate for churn prediction.
20
Figure 5.1: Left: The accelerated failure time model. Rigth: The proportional hazard model.
5.1
Cox model
Let Xij denote the jth covariate of the ith subject, where i = 1, 2, ..., n and
j = 1, 2, ..., p. One can think of X as an n p matrix where Xi denotes the
covariate vector of subject i. The Cox model specifies the hazard rate as
i (t) = 0 (t)eXi ,
(5.1)
where 0 is called the baseline hazard and is a p 1 vector of regression

coefficients. The latter is familiar from multivariate regression. The baseline hazard 0 is an unspecified nonnegative function over time which can
be interpreted as the average hazard at time t. In contrast to accelerated
failure time model, the baseline hazard does not have to follow a particular
distribution. This is a significant advantage over parametric survival models
which are restricted by the shape of a particular distribution. The hazard
rate at time t is thus the product of a scalar, eXi and the baseline hazard at
time t. To put it differently, the covariates increase or decrease the hazard
function by a constant relative to the baseline hazard function.
The Cox model is also referred to as the proportional hazard model since
the hazard ratio for two subjects with covariate vector Xi and Xj , given by
i (t)
0 (t)eXi
eXi
=
=
,
j (t)
0 (t)eXj
eXj
(5.2)
is constant over time. This implies that the covariates must have the same
effect on the hazard at any point in time.
21
Estimation of parameter is based on the partial likelihood function and

is performed without specifying or even estimating the baseline hazard. The
term partial likelihood is used because the likelihood formula considers probabilities only for those subjects who have an event, and does not explicitly
consider probabilities for those subjects who are censored [9]. The likelihood
can be written as a product of several likelihoods, one for each event time.
The likelihood at time j denotes the likelihood of having an event at time j,
given survival up to time j. The set of individuals at risk is called the risk
set and is denoted by R(j). The partial likelihood is then given by
L() =
Y
i
eXi
.
P
Xj
jR(i) e
(5.3)
This equation holds as long as there are no tied event times, or ties in short.
Ties are events that occur exactly at the same time. Adjustments to Equation
5.3 are necessary to deal with this. This will not be discussed here, but an
extensive explanation can be found here [12].
5.2
Extended Cox model
The Cox model, as discussed in the previous section, is not suitable as a

predictive model for prepaid churn. The main reason is that the explanatory
variables, or covariates, are fixed over time. This implies that we can only
use explanatory variables that are time-independent, such as address or sex.
Since we do not have these kind of variables to our disposal (due to reasons
explained in Chapter 2) and, more importantly, because we want to use
variables that capture the (current) behaviour of customers, the use of timeindependent covariates is far from optimal.
Fortunately, several improvements and adjustments have been made to
the Cox model in the last several years. With regard to the present work, we
are particularly interested to the improvements made by Gill and Anderson
[17]. This adapted model is often referred to as the extended Cox model.
The extended Cox model includes the ability to accommodate left-censored
data, time-varying covariates, and multiple events [17]. The ability to include
time-varying covariates makes it possible to use covariates that capture the
behaviour of customers much better. Another advantage is that the proportionality assumption does not have to hold in this case, since the covariates
are dependent on time. Equation 5.1 is slightly adjusted to let Xi depend on
time. The extended Cox model can be defined as
p1
p2
X
X
i (t) = 0 (t)exp Xi +
Xj (t)) .
i=1
22
j=1
(5.4)
Notice that covariates are now split in p1 time-independent covariates and

p2 time-dependent covariates. In order to include time-varying covariates
in the Cox model, a counting process formulation is required. A counting
process is a stochastic process starting at 0 and whose sample paths are right
continuous step functions with height 1 [17]. Recall that the pair (Ti , i ) is
normally used to represent an observation. The counting process formulation
replaces this with (Ni (t), Yi (t)) where Ni (t) the number of observed events
in [0, t] for subject i, whereas Yi (t) is 1 if subject i is under observation and
at risk at time t and 0 otherwise. Right censored data is a special case of
this formulation. Two examples are shown in Figure 5.2. These correspond
to (T , ) pairs (3,0) and (4,1). The counting process formulation makes
it immediately possible to include multiple event times and multiple at-risk
intervals. The application of this new formulation is not restricted to survival
analysis anymore and can be used to model non-stationary Poisson processes,
Markov processes and other processes [17].
Figure 5.2: The counting process formulation. Left: Right censoring. Right:
No censoring.
The extended Cox model is already implemented in the statistical modelling
language R. R features many additional packages and a large on-line community. The event history analysis package is used here for modelling.
23
5.3
Survival data
Since we are dealing with survival analysis, the data must be represented
as survival data. As discussed in Section 4.2, there are several recording
properties that have to be defined:
The origin of time is chosen to be equal to the commitment date of the
customer.
The customers are followed for a maximum period of 15 months.
The time scale is set to months.
The proportion of churned and censored customers in the training set is made
equal, because this appears to provide the best results [14].
5.4
Discrete event times
Ties are events that occur exactly at the same time. In order to compute
the partial likelihood, the event times themselves are not important, only
the ordering is of importance. Due to the process of ordering, tied event
times are only registrated once. This results in a misrepresentation of the
survival data. There are several ways to deal with this problem. The methods
will be only mentioned here, a detailed explanation can be found here [17].
According to the so-called exact method, time is considered to be continuous.
Ties are then a result of imprecise measurement and will rarely occur. The
discrete method, on the other hand, assumes that time is discrete, and that
ties really occur at the same time. Since events are only measured at the
end of each month, we are dealing with discrete event times. The discrete
method provided in R is used to handle tied event times.
5.5
Lagging
An important assumption of the extended Cox model is that the effect of a

time-dependent covariate Xi (t) on the survival probability at time t depends
on the value of this covariate at that same time t, and not on a value at an
earlier or later time [9]. Since the covariate values Xi (t) are not available
until time t, the model can only predict the hazard for time t. In our churn
application this can be easily interpreted. The data of a customer is only
completely available when a month has ended. The hazard is then predicted
for that month. In fact, the prediction is never up to date. If the customer
24
would be churned during that month, it could only be predicted afterwards

when it is already too late. Our goal is thus to forecast a future time t + L,
where L is typically 1 or 2. Notice that this would not be an issue if we
would not incorporate time-dependent covariates. After all, Equation 5.1
shows that the time aspect is only represented by the baseline hazard, and
not by the covariates. To be able to forecast a future time using timedependent covariates, the time-dependent covariates must be forecasted or
lagged by L units. We have chosen for a lag-time of 1. This means that the
data of the previous month is used to predict the hazard of this month. The
extended Cox model can be written to allow for a lag-time modification of
any time-dependent covariate as follows,
p1
p2
X
X
i (t) = 0 (t)exp Xi +
Xj (t Lj ) ,
i=1
(5.5)
j=1
where Lj denotes the lag-time for time-dependent covariate j.
5.6
Principal component regression
The main aim of principal component analysis (PCA) is to reduce the dimensionality of the dataset while retaining as much as possible of the variation
present in the dataset. This is achieved by transforming the original variables
into a new set of variables, called the principal components. Each principal
component is a linear combination of the original variables. The principal
components are uncorrelated and ordered, which means that the first principal components retain most of the variation present in the dataset [8]. The
principal components are the eigenvectors of the correlation matrix of the
mean extracted dataset. The corresponding eigenvalues indicate the importance of the principal components. A more indepth explanation of principal
component analysis can be found here [8]. Figure 5.3 shows a scree plot of
the PCA performed on the variables of Appendix A. A screeplot is a plot of
the eigenvalues of the ordered principal components.
In principal component regression the principal components are used as
regressors instead of the original explanatory variables. There are two reasons
to use the principal components as regressors rather than the explanatory
variables. Firstly, the explanatory variables are often highly correlated which
may cause inaccurate estimations of the regression coefficients [3]. This can
be avoided by using the principal components in place of the original variables since the principal components are uncorrelated. Hierarchical variable
clustering performed on the original variables and the first 20 principal components is used to visualize this. Variable clustering is a technique used for
25
Figure 5.3: The principal components versus the explained variance.
assessing collinearity and redundancy. The Spearman correlation measure

is used as a distance measure. The results of this are shown in Figure 5.4
Notice that the most correlated cluster of the principal components has a
value of 0.3, whereas the most correlated cluster of the original variables is
near a value of 0.9. Collinearity is thus greatly reduced.
Secondly, the dimensionality of the regressors is reduced by taking only
a subset of principal components for prediction [3]. We used the first 20
components as regressors, since the contribution of the last 12 components
is negligible (Figure 5.3). Although this can be considered a safe margin,
it is known that principal components with the largest variances are not
necessarily the best predictors [3]. In the extended Cox model based on churn
definition 1, all principal components are significant. In the model based on
churn definition 2 there are two non-significant principal components. These
are removed from the model. There is no risk of overfitting since we utilize a
large dataset. A rule of thumb for how many variables can be used without
m
the risk of overfitting is given by P < 10
, where P denotes the number of
variables and m the number of events in the dataset [6]. Since we have a
dataset with thousands of events, this is not an issue.
26
Figure 5.4: Hierarchical variable clustering performed on the original variables and the principal components.
5.7
Handling nonlinearity
In this section the Cox model is tested for nonlinearity. Nonlinearity occurs
when the regression part of the model has an incorrectly specified functional
form. The linearity assumption is then violated. This disadvantage of the
Cox model is normally associated with linear regression models. Nonlinearity
can be detected by plotting the martingale residuals against the covariates.
Martingale residuals measure excess events in a subject. For example, a
customer who churned earlier than expected by the model will have a positive residual. A customer who churned later will have a negative martingale
residual. Figure 5.5 provides an example of a variable that meets the linearity
assumption and one that does not. Smoothing of the plots is done by local
linear regression. The nonlinear variables are not removed but are transformed using restricted cubic splines. An explanation about the application
of splines can be found here [6].
5.8
Predictive score method
Survival models are not designed to be used for classification or prediction

[1]. Therefore, a specific procedure is required. A predictive score method is
used to classify churn events.
27
Figure 5.5: Left: Nonlinearity. Right: Linearity.
The lag-time extended Cox model as discussed above is used for predicting
churn in the next month. The predictive score method is based on the hazard
function . Recall that the (t)t is the instantaneous probability of having
an event at time t given that one has survived (i.e. not had an event) up
to time t. The hazard is thus a good candidate for measuring churn. A
threshold is set to distinguish churn behaviour from non-churn behaviour.
This treshold is empirically determined. If the hazard exceeds the threshold,
the customer is classified as churned.
Figure 5.6 shows the estimated survival curve and the estimated baseline
hazard for churn definition 2. Notice that since we used = 3 for churn
definition 2, every customer survives the first three months. Figure 5.7 shows
an example of a customer that churns at month 11 and its corresponding
hazard values including the threshold.
28
Figure 5.6: Left: The estimated survival curve. Right: The estimated baseline hazard.
Figure 5.7: Left: An illustration of customer behaviour. Right: The corresponding hazard values including the threshold.
29
Chapter 6
A decision tree as predictive
churn model
So far we have discussed the extended Cox model. Since we are also interested
in its performance compared to commonly used predictive models, a second
predictive model is discussed, the so-called decision tree. Decision tree models
are widely used in the field of data mining. In Chapter 7, the performance of
the decision tree is compared to the performance of the survival model. This
is done by applying both models on a new unseen test set. If for instance,
the survival model turns out to perform much worse than the decision tree,
we can conlude that the survival model is a poor predictive model for this
specific problem. The decision tree is thus used to put the performance of
the survival model in perspective.
Decision trees can be split into classification and regression trees. Classification trees are used to predict a categorical outcome, whereas regression
trees are used in case of a continuous outcome. Since we are dealing with a
binary outcome, i.e. churn, a classification tree is used. In a decision tree
each interior node corresponds to a variable. An arc to a child represents
a possible value of that variable. A leaf represents the outcome given the
values of the variables represented by the path from the root.
One of the advantages of decision trees is that they can be very easily interpreted, since they produce a set of understandable rules. Neural networks,
on the other hand, are so called black boxes. A trained neural network contains several optimized parameters and weights which cannot be interpreted
easily. It is therefore not possible to understand why a neural network gives
a particular outcome.
A decision tree is a supervised model and thus requires a labeled training
set. In this study, the training set contains all the variables listed in Appendix
A. Each record, or observation, contains a snapshot of a single month of a
30
customer. The outcome of an observation, churn or non-churn, is indicated

by a 1 or 0 respectively. As with the extended Cox model, the variables here
are also lagged to be able to predict churn in the next month.
Decision trees are built through a process known as recursive partitioning.
Recursive partitioning is an iterative process of splitting the data up into two
(or more) partitions [11]. An high-level view of this procedure is provided
by Figure 6.1. A splitting criterion is required to evaluate a split made by
Figure 6.1: An overview of the recursive partitioning procedure.

an explanatory variable. The splitting criterion used in this study is the
Gini-index. The Gini-index is a measure of impurity of a split at a particular
node. The Gini-index is defined as
1
p2ik ,
(6.1)
where k indicate the different classes and pik denotes the relative frequency
of class k in node i. The lowest value for the gini-index is used for splitting
the nodes observations.
31
6.1
Finding the optimal tree size
When decision trees do not incorporate a stopping criterion they are likely to
overfit the training set. Overfitting occurs when the model tries to capture
artefacts and noise present in the training set, by adding more and more
splits. Splits can be added until the whole trainingsset is explained. The
problem of overfitting is therefore more dependent on the number of splits
than on the number of explanatory variables [11]. Since an overfitted tree
has lost its ability to generalize, it will only have limited predictive power on
new unseen observations.
Overfitting can be avoided by means of two approaches, namely prepruning and postpruning. Prepruning is done by setting a maximum tree-depth
or by setting a minimum increase in fit per split. This will hold the tree to
grow too large. Postpruning is used in this study and is performed by removing branches from a fully grown, and thus probably overfitted, tree. To
decide which pruned tree performs best, 10-fold cross-validation is applied.
10-fold cross-validation works as follows. The training set is split into 10 subsets. Each of the 10 subsets is left out in turn. The tree is learned with the
remaining data and the result is used to predict the outcome for the subset
that has been left out. Each prediction is independent of the data to which
it is applied. As a consequence, cross-validation gives an unbiased estimate
of the predictive power [11]. Figure 6.2 provides plots of the relative error
versus the tree size.
Figure 6.2: Relative error versus tree size. Left: Churn definition 1. Right:
Churn definition 2.
32
These plots show that the optimal tree size, in the case of churn definition
1, is equal to 18 splits and, in the case of churn definition 2, equal to 15 splits.
The final trees are shown in Appendix B and C.
6.2
Oversampling
Oversampling is a technique that alters the proportion of the outcomes in the

training set. More specifically, it increases the proportion of the less frequent
outcome. This will make a model more sensible to the outcome that is least
represented [2]. Imagine that there are 100.000 observations of which 99.000
are labeled 0, and only 1000 are labeled 1. Almost any model will classify a
new observation as 0, as it will be right in 99% of the time. This is exactly
the case in our churn application. Churn outcomes are under-represented
when compared to non-churning outcomes. This can be explained by the
fact that churn only occurs in a single month, whereas non-churn occurs in
many months. Oversampling is therefore used to increase the frequency of
the churn outcomes. The proportion of churn and non-churn represented in
the trainingset is 1/3 and 2/3 respectively. This is a typical split that works
well [2]. The actual training set is built with all its churn observations and
is filled up with twice as much, randomly selected, non-churn observations.
33
Chapter 7
Tests and results
The goal of the tests discussed in this chapter, is to gain insight into the
performance of the extended Cox model discussed in Chapter 5. This is
achieved by offering the same test set to the extended Cox model and the
decision tree. By doing so, a direct comparison is made possible. Other
tests, like determining the optimal tree size, have been discussed earlier in
the appropriate sections.
7.1
Tests
The tests are performed on an AMD Athlon 3000+ processor, running Windows XP and R 2.3.1. Both models are tested with churn definition 1 and 2
as explained in Chapter 3. This is done in order to see if the churn definition
has a significant influence on the performance of the models.
The dataset of 20.000 customers has been split into a training set of
15.000 customers and a test set of 5000 customers. The test set consists of
all the months of history of the 5000 customers and contains 1313 churned
customers, 3403 non-churned customers and 284 outliers. The outliers are
removed from the test set.
The optimization procedure of the extended Cox model with 10.000 customers approximately takes 18 seconds, whereas the training procedure of
the decision tree with 15.000 observations approximately takes 17 seconds.
Prediction on the test set takes approximately 12 seconds for both the decision tree and the extended Cox model. Thus, regarding the time complexity,
the models are roughly equal to each other.
Confusion matrices are commonly used to summarize the performance
of predictive models Table 7.1 to 7.4 show the confusion matrices for both
models and churn definitions. The upper left and the lower right values are
34
also known as the sensitivity and specificity measure respectively.
7.2
Results
True class
churn
non-churn
Predicted class
churn non-churn
0.99
0.01
0.06
0.93
Table 7.1: Confusion matrix of the decision tree (churn definition 1).
True class
churn
non-churn
Predicted class
churn non-churn
0.93
0.07
0.08
0.92
Table 7.2: Confusion matrix of the extended Cox model (churn definition 1).
True class
churn
non-churn
Predicted class
churn non-churn
0.99
0.01
0.2
0.98
Table 7.3: Confusion matrix of the decision tree (churn definition 2).
35
True class
churn
non-churn
Predicted class
churn non-churn
0.96
0.04
0.03
0.97
Table 7.4: Confusion matrix of the extended Cox model (churn definition 2).
We can state that the extended Cox model gives satisfying results with both
a high sensitivity and specificity. However, the decision tree performs even
slightly better. Evidently, the time aspect incorporated by the extended Cox
model does not provide an advantage over the decision tree in this particular
problem. Figure 5.6 gives a possible explanation for this. The survival curve
is almost a linear decreasing line after the fourth month, which indicates
that the number of churned customers is approximately constant over time.
Probably, the Cox model could have had an advantage over the decision tree
when the problem contained a less smooth survival curve.
At this point, it is necessary to put the results in perspective, since they
depend on the churn definition used. This is already revealed in the difference
between churn definition 1 and 2. Both models find it more difficult to predict
churn which is defined by churn definition 1, as Table 7.1 and 7.2 show. After
all, the longer a customer is allowed to have zero usage before he is considered
as churned, the easier it is to predict it. Moreover, a new and different churn
definition is likely to yield a different pattern of results. Notice however that
if the obtained results could be ascribed to the simplicity of the problem,
the optimal decision trees shown in Appendix B and C would be far less
complicated.
36
Chapter 8
Conclusions and
recommendations
The conclusions of the present study are based on the research question and
sub questions stated in Chapter 1. By answering these questions we not only
provide an overview of the work presented in this study, but we consider the
extent to which we succeeded answering these questions as well. Moreover,
a number of recommendations for further research in this field are discussed.
8.1
Conclusions
The research question formulated in Chapter 1 is:

analysis?
The sub questions are formulated as:
How well do survival models perform in comparison to the established
predictive models?
Do survival models have an added value compared to the established
predictive models?
First, these three questions will be discussed before the central research question is addressed.
37

In Chapter 1 is described why the deactivation date stored in the database is
not a proper definition for prepaid churn. To overcome this problem a churn
definition is proposed. This churn definition should indicate when a customer
has permanently stopped using his prepaid sim-card as early as possible. After extensively examining the behaviour of prepaid customer, an intuitive
operational churn definition is proposed in Chapter 3. This definition allows
for a large range of customer behaviours. The definition provides the data
with churn labels in a consistent and understandable manner. However, for
larger periods of zero usage the definition becomes less reliable.
How well do survival models perform in comparison to the established predictive models?
The answer on this question is based on the following situation. The survival
model proposed in this study is an extended Cox model. This model is used
to predict prepaid churn. In order to make a direct comparison, a decision
tree is also considered. This model represents the established predictive
models.
In terms of predictive power, the extended Cox model scores very well,
with sensitivities ranging from 93% to 96% and specificities ranging from
92% to 97%. However, this is slightly less than the performance of the decision tree which has in both cases a sensitivity of 99% and specificities ranging
from 93% to 98%. The advantage of the extended Cox model over models like
regression models, neural networks and decision trees is that it incorporates
the time aspect by means of the baseline hazard. It is thus able to capture
aspects of customer behaviour at specific points in time. Unfortenately, the
extended Cox model cannot fully benefit from this feature in this problem.
Probably for this reason, the extended Cox model does not outperform the
decision tree.
Do survival models have an added value compared to the established predictive models?
Since survival analysis is originally designed to analyse data, it can be perfectly used for that purpose. It gives insight in the behaviour of customers
over time. This is a useful extension to exploratory data analysis. This type
of data analysis is normally not considered by predictive modelling. Besides,
after analysis a survival model can be applied to model the influence of customer behaviour on the event time. This is an elegant approach which works
38
intuitively. Another advantage of survival models is the ability to capture

aspects of customer behaviour at specific points in time. Common predictive
models on the other hand, use snapshots of single months to classify churn.
They take no notice of the time aspect involved. Furthermore, survival models can handle censored data and stratification. In stratification a categorical
variable is used to create seperate baseline hazards. An advantage is that this
time-independent covariate does not have to meet the proportional hazards
assumption. Stratification can be for instance used to subdivide different
customer groups. Finally, survival models can predict the probability of an
event at a future point in time, given that only time-independent covariates
are used.
analysis?
We have shown that it is possible to make a prepaid churn model using the
extended Cox model. This survival model is able to predict churn in the
next month. It yields very good results, also in comparison to the decision
tree. It does however not outperform the decision tree on this prepaid churn
problem.
8.2
Recommendations for future research
The proposed churn definition provides a plausible indication of churn. However, there are cases where this definition is not suitable. It is in particular
for customers with a very low level of consumption hard to indicate when
they are going to churn. Therefore, some improvements could be made to the
churn definition. In addition, mobile telecommunication companies should
gain more insight into the (churn) behaviour of prepaid customers. This
information could be, for instance, obtained by qualitative studies. Churn
definitions will then become more reliable, and so will predictive churn models.
Another prepaid phenomenon which could be examined in the future is
the switching of sim-cards between individuals. Sim-cards are often handed
over to relatives or friends when they are not used anymore. Although these
customers are actually churned, this cannot be observed. This causes a
distorted view of churn behaviour and should be acknowledged.
In this study we have shown that survival analysis, and in particular the
extended Cox model, provides very satisfying results. Though, since the
application of survival analysis in data mining is relatively new, there is still
39
considerable space for improvement. Current research [16] shows that neural
networks are also appropriate to model survival data. This could further
improve the accuracy of survival models, as neural networks can handle nonlinear relations.
Furthermore, stratification could be applied to distinguish different customer profiles and so gain a better performance. Since no such profiles were
available, this has not been carried out in this study.
Finally, more research could be performed into the predictive scoring
methods. In this work we used the hazard in combination with a threshold
to identify churn. This is an efficient method which gave satisfying results.
However, new score methods or combinations of existing methods could be
examined in the future.
40
Bibliography
[1] S. Balcaen and H. Ooghe. Alternative methodologies in studies on business failure: do they produce better results than the classic statistical
methods? Vlerick Leuven Gent Management School Working Paper
Series, 2004.
[2] M. Berry and G. Linoff. Mastering Data Mining. John Wiley and Sons,
New York, USA, 2000.
[3] P. Filzmoser. Robust principal component regression. In Proceedings
of the Sixth International Conference on Computer Data Analysis and
Modeling, pages 132137, Minsk, Belarus, 2001.
[4] P. F. G. Bijwaard and R. Paap. Modeling purchases as repeated events.
Econometric Institute Reports, 2003.
[5] M. Halling and E. Hayden. Bank failure prediction: A two-step survival
time approach. http://ssrn.com/abstract=904255, 2006.
[6] F. Harrell. Regression Modeling Strategies: With Applications to Linear
Models, Logistic Regression, and Survival Analysis. Springer, New York,
USA, 2001.
[7] M. P. J. Ferreira, M. Vellasco and C. Barbosa. Data mining techniques
on the evaluation of wireless churn. In ESANN, pages 483488, 2004.
[8] I. Jolliffe. Principal Component Analysis. Springer, New York, USA,
2002.
[9] D. Kleinbaum and M. Klein. Survival Analysis: A Self-Learning Text.
Springer, New York, USA, 2005.
[10] D. G. M. Mozer, R. Wolniewicz and H. Kaushansky. Predicting subscriber dissatisfaction and improving retention in the wireless telecommunications industry. IEEE Transactions on Neural Networks, 11:690
696, 2000.
41
[11] J. Maindonald and J. Braun. Data Analysis and Graphics Using R.

Cambridge University Press, Cambridge, UK, 2003.
[12] R. Miller. Survival Analysis. John Wiley and Sons, New York, USA,
1981.
[13] K. Morik and H. Kopcke. Analysing customer churn in insurance data a
case study. In Proceedings of the 8th European Conference on Principles
and Practice of Knowledge Discovery in Databases, pages 325336, New
York, USA, 2004.
[14] D. V. D. Poel and B. Larivire. Customer attrition analysis for financial
services using proportional hazard models. Working Papers of Faculty
of Economics and Business Administration, Ghent University, 2003.
[15] W. Potts. Survival data mining. http://www.data-miners.com, 2001.
[16] B. Ripley and R. Ripley. Neural networks as statistical methods in
survival analysis. In Artificial Neural Networks: Prospects for Medicine.
Landes Biosciences Publishers, 1998.
[17] T. Therneau and P. Grambsch. Modeling Survival Data: Extending the
Cox Model. Springer, New York, USA, 2000.
42
Appendix A: Derived variables

average out dur per call = average duration of a single outgoing call
average in dur per call = average duration of a single outgoing call
ratio dur per call = ratio between outgoing and incoming duration
total rev = sum of incoming revenue and outgoing revenue
non usage interval = the current number of successive non usage months
max non usage interval = the maximum number of successive non usage
months
compare = a comparison between the non usage interval and max non usage
interval
total rev sum = cumulative total rev
non usage sum = cumulative number of non usage months
total recharge num sum = cumulative number of recharges
total in dur avg = average duration of incoming calls over all past months
total out dur avg = average duration of outcoming calls over all past months
length last voicemail = the number of months since the last voicemail call
length last recharge = the number of months since the last recharge
last recharge val = the last recharge amount
average recharge val = the average recharge amount of all past months
3 months average = the average over the last three months
diff = the ratio between the average over the last three months and the
average over all months before
sms point to point call = number of sms messages
dur = duration in seconds
call = number of calls
total
total
total
total
total
out dur moving diff

out dur 3 months average
out call moving diff
out call 3 months average
in dur moving diff
43
total in dur 3 months average

total in call moving diff
total in call 3 months average
international dur moving diff
international dur 3 months average
international call moving diff
international call 3 months average
total rev moving diff
total rev 3 months average
sms point to point call moving diff
sms point to point call 3 months average
44
Appendix B: Decision tree for

churn definition 1
45
Appendix C: Decision tree for

churn definition 2
46

Churn Prediction in The Mobile Telecommunications Industry

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Churn Prediction in The Mobile Telecommunications Industry

Diunggah oleh

Hak Cipta:

Format Tersedia

Churn Prediction in the Mobile

Author: L.J.S.M. Alberts, Bsc.

2 Data acquisition and preparation

4 Theory of survival analysis

5 A survival model as predictive churn model

6 A decision tree as predictive churn model

7 Tests and results

Appendix B: Decision tree for churn definition 1

Appendix C: Decision tree for churn definition 2

Churn prediction modelling

Prepaid versus postpaid

Churn is usually distinguished in voluntary and involuntary churn. Voluntary

Figure 1.1: The different states prepaid customers can pass.

The research question is stated as follows:

How well do survival models perform in comparison to the established

This work is structured as follows. In Chapter 2 the data acquisition and

Figure 2.1: Object oriented representation of the data.

Figure 2.2: A screenshot of the data visualization tool.

churned. To overcome this problem, a flexible churn definition is proposed.

Figure 3.1: An example of the operational churn definition.

Survival and hazard functions

We note that S(t) is a monotone decreasing function with S(0) = 1. The

limt0 P r(t < T < t + t|T > t)

and (t) give mathematically equivalent specifications of the distribution of

Recording survival data

The concept of censoring entails an essential element of survival analysis.

Figure 4.1: Different types of censoring.

where 0 is called the baseline hazard and is a p 1 vector of regression

Estimation of parameter is based on the partial likelihood function and

Extended Cox model

The Cox model, as discussed in the previous section, is not suitable as a

Notice that covariates are now split in p1 time-independent covariates and

Discrete event times

An important assumption of the extended Cox model is that the effect of a

would be churned during that month, it could only be predicted afterwards

where Lj denotes the lag-time for time-dependent covariate j.

Principal component regression

Figure 5.3: The principal components versus the explained variance.

assessing collinearity and redundancy. The Spearman correlation measure

Predictive score method

Survival models are not designed to be used for classification or prediction

Figure 5.5: Left: Nonlinearity. Right: Linearity.

customer. The outcome of an observation, churn or non-churn, is indicated

Figure 6.1: An overview of the recursive partitioning procedure.

Finding the optimal tree size

Oversampling is a technique that alters the proportion of the outcomes in the

also known as the sensitivity and specificity measure respectively.

The research question formulated in Chapter 1 is:

What is a proper, practical and measurable prepaid churn definition?

intuitively. Another advantage of survival models is the ability to capture

Recommendations for future research

[11] J. Maindonald and J. Braun. Data Analysis and Graphics Using R.

Appendix A: Derived variables

out dur moving diff

total in dur 3 months average

Appendix B: Decision tree for

Appendix C: Decision tree for

Anda mungkin juga menyukai