By Louise A. Francis
Francis Analytics and Actuarial Data Mining, Inc.
Objectives
Address question: Why use new method, PRIDIT? Introduce other methods used in similar circumstances Explain how PRIDIT adds to methods available Explain limitations of PRIDIT/RIDIT
Need sample of data where claims have been determined to be fraudulent or legitimate
Unsupervised learning
Another approach that does not require a dependent variable Two Key Kinds
Cluster Analysis Principal Components/Factor Analysis
Pridit uses this approach It is applied to ordered categorical variables
Cluster Analysis
Records are grouped in categories that have similar values on the variables Examples
Marketing: People with similar values on demographic variables (i.e., age, gender, income) may be grouped together for marketing Text analysis: Use words that tend to occur together to classify documents
Clustering
Common Method: k-means, hierarchical No dependent variable records are grouped into classes with similar values on the variable Start with a measure of similarity or dissimilarity Maximize dissimilarity between members of different clusters
dij
Manhattan Distance
dij
m xik k 1
x jk
Binary Variables
Column Variable
Binary Variables
Sample Matching
bc d abcd
Police Medical At Legal SIU Number Cluster Report Audit Fault Rep Investigation Providers Percentage Yes 1 46.7% 0.1% 42.2% 6.1% 0.0% 2 2 49.8% 5.9% 2.4% 96.0% 6.5% 4
Principal Components
These variables are correlated but not perfectly correlated We replace many variables with a weighted sum of the variables
RIDIT
Variables are ordered so that lowest value is associated with highest probability of fraud Use Cumulative distribution of claims at each value, i, to create RIDIT statistic for claim t, value i
Rti
tj p
j i
tj p
j i
Value
Yes No
PRIDIT
Use RIDIT statistics in Principal Components Analysis
Component Matri xa
C om pon e n t 1 S IU Pol i ce Re port At Faul t Le gal Re p Medi cal Audi t Pri or C l ai m .248 .220 .709 .752 .341 .406
Extracti on Me th od: Pri n ci pal Com pon e n t An al ys i s. a. 1 component s ext r act ed.
Scoring
Assign a score to each claim The score can be used to sort claims
More effort expended on claims more likely to be fraudulent or abusive
In the case of AIB data, we can use additional information to test how well PRIDIT did, using the PRIDIT score
A suspicion score was assigned to each claim by an expert
PRIDIT Score
0.50 0.00
0. 00
1. 00
2. 00
3. 00
4. 00
5. 00
6. 00
7. 00
8. 00
9. 00
10 .0 0
(0.50)
Result
There appears to be a strong relationship between PRIDIT score and suspicion that claim is fraudulent or abusive The clusters resulting from the cluster procedure also appeared to be effective in separating legitimate from fraudulent or abusive claims
Comparison
Unordered categorical variables with many values (i.e., injury type):
Clustering has a procedure for measuring dissimilarity for these variables and can use them in clustering If the values for the variables contain no meaningful order, PRIDIT will not help in creating variables to use in Principal Components Analysis.