By Xiaogang Dong
Stanford ID # 05638478
Abstract
The main target of this project is to detect anomaly online credit card transactions. I try six
different classifiers: Nave Bayes, Decision Tree, K-Nearest Neighbor, Support Vector
Machine, Random Forest, and AdaBoosting with Decision Tree. Random Forest performs
best based on my experiments. A fine-tuning of Random Forest is performed to check if a
better performance can be achieved.
1. Introduction
The goal of this project is to identify anomaly transactions from large amount of online credit
card transactions. A training set of 19 attributes and 94682 observations is provided. In a
separate file, the class labels are given: 1 for being anomaly and 0 for not being. In addition, a
test data set with 36019 observations is provided for a final evaluation of the project.
2. Data Observation
Both training and test data files have been examined and no missing data are found. There
are about 2.2% anomaly transactions in the training set. There are a total number of 19
attributes for each transaction. amount may be the amount of transaction and its distribution
is included in the following figures. Note total is highly correlated with amount, the
correlation coefficient is about 0.9994217. Either hour1 or hour2 may be the transaction
time and these two have a strong correlation, 0.994708. The histogram of hour1 is plotted
below. state1 is likely the shipping (or billing) state. zip1 is the corresponding zip code, but
with the last two digits masked. domain1 is likely the web domain from where the transaction
was done. The rest of variables are hard to determine their exact meanings. There are some
binary variables: field2, flag1, flag2, flag3, flag4, indicator1, and indicator2. field1
has four levels, field4 has 38 levels, field5 has 26 levels, and flag5 has 36 levels. field3
has so many levels that it can be treated as a continuous variable.
Ideally, we should check the correlations between variables to determine which variables to
be used in classifications. Pearsons chi-square test can be used to check the independence
of categorical variables. Here given the limited time, these steps are skipped. However, I try to
trim some variables when I fine-tune the Random Forest classifier.
STATS202
Xiaogang Dong
Stanford ID # 05638478
Final Report
Page 1 of 7
STATS202
Xiaogang Dong
Stanford ID # 05638478
Final Report
Page 2 of 7
3. Solution Evaluations
I try six different classifiers in this project. In order to evaluate different classifiers, the training
data are divided into five folds and cross validation is performed. The details are as follows.
First, the training data is divided into two groups: training data with positive labels (or label 1)
and training data with negative labels (or label 0). All positive training data are randomly
divided into five folds of approximately equal sizes. So are the negative training data. The first
fold of positive training data and the first fold of negative training data are grouped together,
named Group 1. Similarly, rest of data forms Group 2 to Group 5. Then Group 1 is taken as
the test data and the rest four groups as the training data. After testing, each transaction in
Group 1 is assigned a probability of being positive. The transactions with top 20% probability
of being positive are taken out, and percentage of true positives among these transactions is
calculated based on the ground-truth labels. Repeat the same procedure for Group 2 to
Group 5. The average percentage of true positives among top 20% transactions is used as
the final metric for performance evaluation.
6.991599
7.757317
All data
4.789717
4.610169
K=1
4.995667
K=3
1.753231
K=5
1.446945
Support Vector
Machine
Radial Kernel
6.379245
Linear Kernel
3.216024
Random Forest
10.176170
Adaboosting
6.199699
Nave Bayes
Decision Tree
K Nearest Neighbor
a)
Percentage (100%)
Nave Bayes classifier is probably the most straight-forward one and it can deal with both
STATS202
Xiaogang Dong
Stanford ID # 05638478
Final Report
Page 3 of 7
numerical and categorical attributes. In this project, I am using the function naiveBayes()
included in the package e1071. An important aspect of Nave Bayes classifier is the Laplace
smoothing, which helps to deal with unseen and seldom occurred observations. In the above
table, the results are listed for the case without using Laplace smoothing and the case without
using Laplace smoothing. In conclusion, Laplace smoothing significantly improve the
classification result by about 0.75%.
b)
Decision Tree classifier generally performs less effectively. Here I still include the results from
Decision Tree so that they can be used as a benchmark for ensemble methods such as
Random Forest and AdaBoosting. One advantage of Decision Tree is that it can handle both
numerical and categorical variables. Here I use rpart() from the package rpart to perform the
classification. The default setting of rpart() gives a percentage of 4.79%. By examining the
trees generated, I find out that Decision Trees use the variable domain1 a lot. To further
investigate, I exclude the variable domain1 from data and it gives slightly worse results:
about 4.61%.
c)
K-Nearest Neighbor is another popular classifier used in practice. I use knn() from the
package class to implement K-Nearest Neighbor. The key issue here is to define distance
and scale variables. Note that it may potentially improve the classification performance by
defining appropriate distances for categorical variables. However for simplicity, both
categorical variables state1 and domain1 are discarded. Strictly speaking, zip1 is also a
categorical variable. But it is treated like a numerical variable here since these numbers
somewhat describe the distances between different areas. The variables are scaled by
normalization, i.e. each variable has zero mean and unit variance after scaling. The
percentage of true positive is about 5.00% when k=1. Given any true positive, it is very likely
to be labeled as negative since the number of true positives is significantly smaller than the
number of true negatives, only about 2.2% of total data. This is probably the reason why
K-nearest Neighbor classifier doesnt work well here. It is further evidenced by the dramatic
performance drops when k=3 and k=5.
d)
Support Vector Machine classifier is also very popular. It faces the same problem as
K-nearest Neighbor, i.e. defining distance and scaling variables. For the exactly same reason
as in K-nearest Neighbor, state1 and domain1 are discarded and zip1 is treated as a
numerical attribute. svm() in the package e1071 is used for actual classification and it
handles variable scaling inside the function. Two different kernels, radial and linear, are tried
in the classifications. Radial kernel gives a better performance, a percentage about 6.38%.
STATS202
Xiaogang Dong
Stanford ID # 05638478
Final Report
Page 4 of 7
e)
Random Forest classifier is a very useful ensemble method based on Decision Tree classifier.
Here I choose randomForest() in the package randomForest for the actual classifications.
Random Forest classifier supposes to work with both categorical and numerical variables
since it is based on Decision Tree. However, randomForest() has the restriction for only being
able to handle a categorical variable with less than or equal to 32 values. So state1 and
domain1 are both discarded for that reason. zip1 is also treated as a numerical variable.
An alternative way of dealing with this restriction is to group some less frequent values into
one value so that the total number of values is less than 32. It may improve the performance
and I dont try it due to the limited time. To work around memory issue caused by
randomForest(), all decision trees generated during the classification process are discarded
by setting keep.forest=False. The Random Forest Classifier obtains the best result among
all others, at a percentage of 10.17%.
f)
AdaBoosting is another popular ensemble method. Here Decision Tree is used as the weaker
classifier. I use the R-code in lecture notes for the implementation. It is an interesting problem
to evaluate the classification error during the process of re-adjusting weights. The method
used in the lecture notes (label as positive when prob > 0.5) doesnt work well here. Since
the total number of true positives is known in the training set when performing cross validation,
I choose the same of number of transaction with higher probabilities of being positive and flag
them as positive. The classification errors include both false positives and false negatives.
There should be better ways to evaluate the classification error. I dont try any further due to
the limited time.
5. Fine Tuning
After identifying the Random Forest being the best classifier, I try to tweak both the
parameters and input data to see if the performance is improved.
STATS202
Xiaogang Dong
Stanford ID # 05638478
Final Report
Page 5 of 7
# of Trees
Percentage (100%)
100
10.03359
300
10.14977
500
10.17617
700
10.21314
STATS202
Xiaogang Dong
Percentage (100%)
None
10.17617
Amount
10.19201
Hour1
10.19202
Zip1
10.03887
Field1
9.93853
Field2
10.17617
Hour2
10.17617
Flag1
10.15505
Total
10.20258
Field3
9.83820
Field4
10.04943
Field5
10.18145
Indicator1
10.16033
Indicator2
10.19729
Flag2
10.17617
Flag3
10.10224
Flag4
10.20258
Flag5
10.21842
Stanford ID # 05638478
Final Report
Page 6 of 7
STATS202
Xiaogang Dong
Stanford ID # 05638478
Final Report
Page 7 of 7