Case Text Mining Insurance Company

i
Text mining for insurance claim cost

prediction Case Study
PwCs New Powerful Service
Text Mining emerging area of data mining.
Up to 80% of data stored by organisations is in the free text form

(Feldman 2003, p.481). The data that is contained in text fields
holds huge untapped value. The difference between regular
data mining and text mining is that in text mining rather than
from structured databases of facts.
Text mining is a process that translates text into numeric form by

extracting the patterns from natural language text and therefore
allows us to directly incorporate textual information into
predictive modelling.
Page 2
PricewaterhouseCoopers May 05
Text mining is a new area and many questions remain
unanswered.
For example, further work is needed to understand:
How to best set text mining parameters such as synonym

dictionaries, word stemming, allowing specific word
combinations etc
What is the optimal process ofutilising textual information
after the text mining has been performed for a given
business case
This project is a pioneering research into the benefits of text

mining and the optimal ways to use the text mining output in
predictive models.
Page 3
Text analysis has a huge potential for insurance as
shown by research and industrial studies.
.
There has been recognition for some time now that data about
incidents contain information which allows for a proactive risk
management approach (Feyer and Williamson 1998).
Realising the potential value of information resident in this textual data,

there is growing interest by insurers in the application of new text
mining techniques (Feyer, Stout et al. 2001).
For example, text analysis of narrative fields about claims, resulted in

beneficial claims management and fraud detection in the occupational
injury insurance domain (Stout 1998). In the example, the benefits
stemmed from information in the textual narrative data, which was not
present in the existing coding system.
Page 4
PwC Australia Case Study. Using Text Mining in Insurance.
Client: Major Australian Insurer
Revisit data & assumptions
Analysis,
Data design, Implement,
Client
issue
Agree value Predictive strategy

drivers collection monitor
modelling formulation
& verification & review
& testing
Client Issue:
Perceived inadequacies in the level of information captured by current injury coding
system led to the need to assess the potential value that using textual information and
text mining facilities could add to the organisation:
To explore the possibilities and benefits of augmenting their existing accident coding
system using free text
To see if adding textual information would result in increased precision of claim cost
prediction
To suggest how text mining could be used for improvement in other areas of the
business
To assist in making decision regarding investing in a commercial text mining software
package that would suit clients needs best.
Page 5
Assessing value of textual information for the client.
.
Our approach was to create a model identifying at the

time of the incident report, whether the incident would
result in a weekly claim pay-out value within the top
10%, by the end of the next quarter.
We assessed the model:

in terms of the predictive power of textual information
on its own
in terms of textual information adding predictive power
to other, non-textual predictors.
Page 6
Data Description.
.
The data sets represented all claims reported between 30 September

2002 and 31 March 2004, which were still open at the time of the
research, an 18 month data history. This provided approximately
56,000 records.
The data comprised:

features about claimant demographics,claims payment information,
and codings about various aspects of the incident (injury location,
nature and mechanism).
textual data: unstructured free-typetext fields of about 200 characters
each. These fields described the incident and the resulting injury.
The target variable for prediction was a binary indicator (yes/no) of

whether or not that injury report had resulted in a claim pay-out value
within the top 10 percent by the end of the next quarter.
Page 7
Text Mining Process.
Stage 1. Does textual information

have predictive value? 1. Prepare
TextData
2. Discover
concepts
3. Reduce
concepts
(text mining)
About 8000 concepts were discovered Prepared

in the Discover concepts phase TransData
Filtered out concepts which had a

frequency of less than 50. After 6. Predictive
modelling
5. Derive domain 4. Select predictive
filtering 860 concepts remained. with concepts only
relevant concepts concepts
Selected the most important concepts

using decision trees and TreeNet.
Added domain expertise to create
enriched concepts 8. Predictive 9. Predictive
modelling modelling
Built predictive model using textual 7. Evaluate results
with features with concepts
information only. The model was only and features
correct in 75.7% cases!
10. Compare
results and
conclude
Page 8
Stage 1. Does textual information have predictive value?
1. Prepare 2. Discover
3. Reduce
TextData concepts
concepts
(text mining)
Step 2. Discover concepts
Concepts are words or word Prepared
TransData
combinations resident in the
text. 6. Predictive
The mining of TextData required modelling
with concepts only
not only considerable time and

effort, but also expertise in the
insurance domain, and in the
used software packages.
8. Predictive 9. Predictive
The mining process was 7. Evaluate results modelling
with features
modelling
with concepts
characterised by iterative only and features
experimentation to find optimal

algorithm settings. These
settings included both
language and mathematical
weightings. 10. Compare
results and
conclude
Page 9
3. Reduce
TextData concepts
concepts
(text mining)
Step 3. Reduce the number of
concepts. Prepared
TransData
About 8000 concepts were
discovered in the Discover 6. Predictive
concepts phase. modelling
with concepts only
5. Derive domain
relevant concepts
4. Select predictive
concepts
Difficult to make sense of so many

concepts
Concepts with a low frequency
would not be relevant within 8. Predictive 9. Predictive
our context. Therefore at point 7. Evaluate results modelling
with features
modelling
with concepts
three the researchers filtered only and features
out those concepts which had a

frequency of less than 50 in
TextData.
After filtering 860 concepts 10. Compare
remained. results and
conclude
Page 10
Step 4. Select predictive 1. Prepare 2. Discover

concepts
3. Reduce
concepts TextData
(text mining)
concepts
The researchers resolvedConcept the

Prepared
issueConcept
of concept
name importance TransData
predictability
LEG
by100using
LACERATED 99.43
TreeNet
FRACTURE
to identify
92.56
the 6. Predictive
modelling
most predictive
STRESS concepts
92.27
with concepts only
in the 860EYE concepts.

86.56
HERNIA 84.11
TRUCK 82.62 Use TreeNet to identify about 60 most
Concept predictive concepts out of the 860 available
BURN 73.06
Concept name importance
LADDER
LEG 58 100 8. Predictive 9. Predictive
7. Evaluate results modelling modelling
... ...
LACERATED 99.43 with features with concepts
only and features
FRACTURE 92.56
STRESS 92.27
EYE 86.56
HERNIA 84.11
TRUCK 82.62
BURN 73.06
LADDER 58
10. Compare
... ... results and
conclude
Page 11
TreeNet Overview
A TreeNet model normally consists of from several dozen to several hundred

small trees, each typically no larger than two to eight terminal nodes. The model
is similar in spirit to a long series expansion (such as a Fourier or Taylor's
series) - a sum of factors that becomes progressively more accurate as the
expansion continues. The expansion can be written as:
where each Ti is a small tree. The first tree in the series contributes a relatively
large amount to the model, while subsequent trees contribute successively
smaller corrections. A model normally consists of 400 to 800 small
trees, each typically no larger than four to eight terminal nodes.
The final model is a collection of weighted and summed trees.
Page 12
TreeNet vs boosting
TreeNet uses gradient boosting to achieve the benefit

of boosting (accuracy) without the drawback of a
tendency to be fooled by bad data.
In boosting, each tree grown would often be a fully
articulated stand alone model, with each boosted tree
combined with other trees via weighted voting or
averaging.
In contrast, each TreeNet component is a small tree.
Trees are summed together with small weights on each
component.
Page 13
TreeNet
The MART/TreeNet model is similar in spirit to a very

long series expansion (such as a Fourier or Taylor's
series) - a sum of factors that becomes progressively
more accurate as the expansion continues.
The first tree in the series contributes a relatively large amount to the model,
while subsequent trees contribute successively smaller corrections.
A model normally consists of 400 to 800 small trees, each typically no larger
than four to eight terminal nodes.
The final model is a collection of weighted and summed trees.
Page 14
MART or TreeNet
In any Predictive Modeling situation:
Y Target or Response Variable
X Inputs or Predictors
F( X) -values predicted by the model.
Loss Function L( Y, F) measures errors between Y and F( X). Typical choices of L( Y,
F) are
Squared error L( Y, F) =(Y-F(X))2
Absolute error L( Y, F) =|(Y-F(X)|
Page 15
TreeNet Optimization Strategy:
Make an initial guess {Fo(X i )} for example, assuming that

all Fo(Xi) are the same
Compute the negative gradient as the vector of partial
derivatives of L with respect to F(X i ) for i =1,2,,N
The negative gradient gives us the direction of the steepest
descent
Make a step towards the steepest descent direction
Page 16
MART or TreeNet
Difficulty with optimisation:

There are only N points; therefore, direct optimization in N
free parameters will result in a dramatic overfitting. A
direct step towards the negative gradient direction would
assume freely changing all N parameters F(Xi )
We will have to somehow limit the total number of free
parameters.
Page 17
MART or TreeNet
Lets limit the number of free parameters to

change down to a fixed small number K
Page 18
MART or TreeNet
Naturally, we want the next optimization step to be as close to the

free steepest descent direction as possible. This means that we will
need to find how to partition N possibly distinct components of the
negative gradient into K mutually exclusive groups such that the
within-group variation of the components will be as small as
possible.
But this means building an K- node regression tree with the target
being the components of the negative gradient.
Page 19
MART Algorithm for the given estimate of loss function, K and M
1. Make an initial guess {F( X i )}={ Fo(X i )}

2. FOR m = 1 TO M
Compute the negative gradient gm by taking the derivative of the expected loss
with respect to F( Xi ) taken at Fm- 1(X i )
Fit an K- node regression tree to the components of the negative gradient, this
will partition observations into K mutually exclusive groups
Find the within-node optimal constant hm(X i ) by performing K univariate
optimisations of the node contributions to the estimated loss (see the exact
formula for hm(X i) in the Hastie, Tibshirani and Friedman)
Do the update: {Fm(X i )} = {Fm- 1(X i )} + hm(X i )
Page 20
MART Algorithm . Example for Least Squares Loss
function (linear regression):
Initial guess {F 0 (Xi)}={ mean( Yi)}
FOR m = 1 TO M
Negative gradient gm is the vector of residuals, {Yi Fm- 1(Xi)} = {Residuali}
Fit an K-node regression tree to the current residuals.This will partition
observations into K mutually exclusive groups
For each given node: hm(X i ) = within-node mean( Residuali )
Update: {Fm(X i )} = {Fm- 1(X i )} + hm (X i )
END FOR
Page 21
TreeNet: further guard against overfitting
It turns out that it is beneficial to by slowing down the learning rate
and introducing the shrinkage parameter v, 0< v <1 into the update
step:
{Fm(X i )} = {Fm-1(X i )}+ v hm(X i )
Parameters v and M are connected: for the same level of
accuracy,small v require larger M. The best strategy appears to be to
set v to be less than 0.1 and choose M by early stopping (Friedman,
2001).
Page 22
MART or TreeNet
Accuracy of MART.
(Hastie, Friedman, Tibshirani, 2001):
Classification problem: spam vs email
CART: 8.7% error rate
MARS: 5.5% error rate
MART: 4% error rate
Page 23
TreeNet advantages:
Ability to handle data without preprocessing, automatic handling of missing

values, resistance to outliers in predictors or the target variable, speed
Automatic selection from thousands of candidate predictors

focusing on the data that is not easily predictable as model
evolves
as additional trees are grown less and less data needs to be
processed
In many cases, TreeNet is able to train effectively on 20% of
the data
Resistance to OverTraining
Page 24
TreeNet disadvantages:
Automatic selection from thousands of candidate

predictors
Interpretability of the prediction
Page 25
Step 5. Derive domain-relevant 1. Prepare 2. Discover

concepts
3. Reduce
concepts. TextData
(text mining)
concepts
The researchers depended Concept on

Prepared
insurance
Concept namedomain
importance expertise for TransData
deriving LEG
additional
100
features at
LACERATED 99.43
point FRACTURE
five. 92.56
6. Predictive
modelling
with concepts only
This encompassed
STRESS
the grouping and
92.27
EYE 86.56
combining
HERNIA
of concepts
84.11
so that the
most predictive
TRUCK concepts were
82.62
combined BURN with 73.06those similar in the
meaningLADDER
(eg, stress
58
and anxiety, 7. Evaluate results
8. Predictive
modelling
9. Predictive
modelling
... ...
laceration and abrasion) to with features
only
with concepts
and features
increase frequencies
10. Compare
results and
conclude
Page 26
Step 6. Discover and 1. Prepare

TextData
2. Discover
concepts
3. Reduce
concepts
interpret any (text mining)
predictive potential Concept

Prepared
of the textual
Concept name
LEG
importance
100
TransData
concepts only.
LACERATED 99.43
6. Predictive
FRACTURE 92.56 modelling
Build CART predictive
STRESS 92.27
with concepts only
model for claims cost

EYE
HERNIA
86.56
84.11
for interpretability of
TRUCK 82.62 Use CART to build a tree using the
derived concepts
results BURN
LADDER
73.06
58 8. Predictive 9. Predictive
modelling modelling
using only the predictive
... ... 7. Evaluate results
with features
only
with concepts
and features
concepts identified
by TreeNet and the
derived concepts
10. Compare
results and
conclude
Page 27
Step 7 Evaluate results

3. Reduce
TextData concepts
concepts
(text mining )
Prepared
evaluating models based on TransData
the concepts alone by

6. Predictive
Concept modelling
with concepts only
referring LEG
to
Concept name importance
100
gains charts 99.43
LACERATED 7. Evaluate results
8. Predictive
modelling
with features
only
9. Predictive
modelling
with concepts
and features
FRACTURE 92.56
modelSTRESS
precision.92.27
EYE 86.56
10. Compare
HERNIA 84.11 results and
conclude
The TreeNet
TRUCK model
82.62 using
concepts
BURN
LADDER
only was
73.06
58
75.7%
precise...on test ...data.
Page 28
2. Discover
Stage 2. Does textual information 1. Prepare
TextData concepts
3. Reduce
concepts
(text mining)
add value to existing injury
codings?
Prepared
Created models with demographic TransData
and injury codings information

only and compared them to the 6. Predictive
modelling
5. Derive domain
relevant concepts
4. Select predictive
concepts
models with added textual with concepts only
information.
8. Predictive 9. Predictive
7. Evaluate results modelling modelling
with features with concepts
only and features
10. Compare
results and
conclude
Page 29
Textual information adds predictive power to the model:
Medical Benefits claim cost for sprains for the next 6 months (top 5%)
Model with no textual information included (left) compared to model with

textual information included
Some important textual concepts:
BOX , TRUCK , INJURY, STRAINNECK , SOFTTISSUEINJURY,
STRAINED_RIGHTWRIST, ANKLESPRAIN,GROUND , WRISTSTRAIN,
STRAINEDSHOULDER
Page 30
i
2004 PricewaterhouseCoopers. All rights reserved. PricewaterhouseCoopers refers to the network of member firms of
PricewaterhouseCoopers International Limited, each of which is a separate and independent legal entity.

Case Text Mining Insurance Company

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Case Text Mining Insurance Company

Diunggah oleh

Hak Cipta:

Format Tersedia

i

Text mining for insurance claim cost

Up to 80% of data stored by organisations is in the free text form

Text mining is a process that translates text into numeric form by

How to best set text mining parameters such as synonym

This project is a pioneering research into the benefits of text

Realising the potential value of information resident in this textual data,

For example, text analysis of narrative fields about claims, resulted in

Agree value Predictive strategy

Our approach was to create a model identifying at the

We assessed the model:

The data sets represented all claims reported between 30 September

The data comprised:

The target variable for prediction was a binary indicator (yes/no) of

Stage 1. Does textual information

About 8000 concepts were discovered Prepared

Filtered out concepts which had a

Selected the most important concepts

correct in 75.7% cases!

not only considerable time and

experimentation to find optimal

Difficult to make sense of so many

out those concepts which had a

Step 4. Select predictive 1. Prepare 2. Discover

The researchers resolvedConcept the

in the 860EYE concepts.

A TreeNet model normally consists of from several dozen to several hundred

TreeNet uses gradient boosting to achieve the benefit

The MART/TreeNet model is similar in spirit to a very

Make an initial guess {Fo(X i )} for example, assuming that

Difficulty with optimisation:

Lets limit the number of free parameters to

Naturally, we want the next optimization step to be as close to the

1. Make an initial guess {F( X i )}={ Fo(X i )}

Ability to handle data without preprocessing, automatic handling of missing

Automatic selection from thousands of candidate predictors

Automatic selection from thousands of candidate

Step 5. Derive domain-relevant 1. Prepare 2. Discover

The researchers depended Concept on

Step 6. Discover and 1. Prepare

predictive potential Concept

model for claims cost

Step 7 Evaluate results

evaluating models based on TransData

the concepts alone by

and injury codings information

Model with no textual information included (left) compared to model with

Anda mungkin juga menyukai