Anda di halaman 1dari 20

Technical Solution Document

For
Course# DS-610
Big Data Analytics

Version Number: 0.0


Version Date: May 9, 2016

Error! Reference source not found.Error! Reference source not found.Error! Reference source not
found.

Document History
Document Details:
Service Request #:
Project Title
Professor
Author
Date Prepared:
Completion Date:

Credit Card Fraud Detection Using Data Analytics


Gerardo Menegaz
Aakash Parwani
04/17/2016
05/08/2016

Revision History
Revision
Number
(#)

Revision Summary of Changes


Date
(-)
(Describe change)

Peer Review
This document requires following approvals.
Name
Sumit Sameriya

Title
Student

Distribution
This document has been distributed to
Name
Prof. Gerardo
Menegaz

Title
Professor

Changes
marked
(N)

Contents
1. Introduction ............................................................................................................... 4
1.1

Project Name Overview ....................................................................................................................4

1.2

Project Name Scope .........................................................................................................................5

2. Architecture Overview ............................................................................................... 7


2.1

IT System Level ................................................................................................................................7

2.1.1

Diagram ......................................................................................................................................7

2.1.2

Key Concepts .............................................................................................................................9

2.1.3

Key Components......................................................................................................................10

3. Architectural Decisions ............................................................................................ 11


3.1

<Decision 1> Subject ......................................................................................................................11

3.2

<Decision 1> Subject ......................................................................................................................11

4. Statistical Model and Method .................................................................................. 13


4.1

Technology 1 ..................................................................................................................................13

4.1.1

Statistical Method .....................................................................................................................13

4.1.2

Analysis Attributes ...................................................................................................................14

5. Solution Components .............................................................................................. 15


5.1.1

Hardware ..................................................................................................................................15

5.1.2

Software ...................................................................................................................................16

5.1.3

Network ....................................................................................................................................16

5.1.4

Monitoring ................................................................................................................................17

5.1.5

Backup and Recovery ..............................................................................................................17

5.1.6

Disaster Recovery ....................................................................................................................17

6. Viability Assessment ............................................................................................... 18


6.1

Functional Requirements ................................................................................................................18

6.2

Non-Functional Requirements ........................................................................................................18

6.3

Risks ...............................................................................................................................................19

6.4

Assumptions ...................................................................................................................................19

6.5

Dependencies .................................................................................................................................20

1. Introduction
1.1 Credit Card Fraud Detection Using Data Analytics Overview
With the advent of communication techniques, e-commerce as well as online payment transactions
are increasing day by day. Along with this financial frauds associated with these transactions are also
intensifying which result in loss of billions of dollars every year globally. Among the various financial
frauds, credit card fraud is the most old, common and dangerous one due to its widespread usage
because of the convenience it offers to the customers. Also the various types of benefits like cash
back, reward points, interest-free credit, discount offers on purchases made at selected stores, and so
forth tempt the customers to use credit card instead of cash for their purchases. In year 2013 40% of
the total financial fraud was related to credit card and the loss of amount due to credit card fraud
worldwide was $5.55 billion. Fraudster gets access to credit card information in many ways. According
to a latest report by CBC News (http:\\www.huffingtonpost.ca/2013/04/24/smartphones-steal-creditcard-%20data_n_3148170.html), smart phones are used to skim credit card data easily with a free
Google application. With compromised credit cards and data breaches dominating the headlines in
the past couple of years, data breaches totaled 1,540 worldwide in 2014 -- up 46 percent from the
year before -- and led to the compromise of more than one billion data records. Twelve percent of
breaches occurred in the financial services sector; 11 percent happened in the retail sector. Malicious
outsiders were the culprits in 55 percent of data breaches, while malicious insiders accounted for 15
percent.

Figure 1: Breach Report.

U.S. credit card fraud is on the rise. About 31.8 million U.S. consumers had their credit cards
breached in 2014, more than three times the number affected in 2013. That fraud isn't cheap. Nearly
90 percent of card breach victims in 2014 received replacement credit cards, costing issuers as much
as $12.75 per card (http://www.creditcards.com/credit-card-news/credit-card-security-id-theft-fraudstatistics-1276.php#ixzz467E1FXcl ).
To accurately predict a credit card transaction as a fraud or a legitimate transaction, I propose a fraud
miner. Using frequent item set mining technique, which we will discuss in detail in section 2.1.2 during
training phase legitimate transaction pattern and fraud transaction pattern of each customer are
created from their previous transactions in the database. And during testing phase, the matching
algorithm detects to which pattern the incoming transaction matches more. If the incoming transaction

matches with legitimate pattern then the algorithm will return 0 (legitimate transaction) and if it
matches with fraud pattern then the algorithm will return 1 (fraud transaction). Also the transaction
record will be maintained in the transaction database so that algorithm can treat future transactions
properly. Model is proposed in Figure2 below.

Figure 2: Fraud transaction detection application model.

1.2 Scope of Credit Card Fraud Detection with Big Data


Detection of credit card fraud transaction using Predictive analytics on Big Transaction Data is intended to
stop erosion of publics trust in the payment platform and helping giant payment networks in the world like
VISA, MASTERCARD, AMERICAN EXPRESS etc. to identify transactions that indicate fraudulent activity.
Latest VISA statistics reveals VISA together with MASTERCARD receives an astronomical 62 billion
transactions per year, which is very huge number. Applying data visualization techniques on this big data
using software like TABLEAU can help organizations understand different facts for example: which country or
place is responsible for most fraud transactions OR which payment network has highest transaction security
breach.
Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in
data that may be used to make a valid prediction. The six basic steps of data mining process are defining the
problem, preparing data, exploring data, building models, exploring and validating models, deploying and
updating models. With proposed algorithm and data analysis tools will enable auditors and fraud examiners

to analyze an organizations business data to gain insight into how well internal controls are operating and to
identify transactions that indicate the heightened risk of fraud. Data analysis can be applied anywhere in the
organization where electronic transactions are recorded and stored. And in future algorithm can also be
improved from captured transactions as it is based on supervised machine learning technique. So as variety
of data will increase robustness in algorithm will increase.

2. Architecture Overview
2.1 IT System Level
The application will reside in the Amazon Web Services (AWS). The implemented architecture consists of five
subsystems: Front End Interface, Network Interface, Application/Web Server, Database Interface, Credit
Card fraud detection engine, Business Intelligence tool.

Front End Interface: Supporting device (POS Point Of Sale), input to the application will be
transferred using this interface in binary format when user will swipe credit card.

Network Interface: Supporting devices (Routers, Switches), it will be responsible for routing
transaction information to Issuing Bank, Acquirer Bank, Application server, Business Intelligence tool,
Database Interface.

Application/Web Server: Will be used to link all components and communicate.

Database Interface: The database interface subsystem is the entry point through which the
transactions are read into the system.

BI Tool: Set of techniques and tools for the transformation of raw data into meaningful and useful
information for business intelligence purposes by the management.

2.1.1 Diagram
2.1.1.1 Logical Design

Figure 3: Logical Architecture

2.1.1.2 Physical Design

Figure 4: Physical Architecture

Front End Interface: This component will be at the client end; it could be in the form of POS (Point Of Sale)
and will allow user to swipe the card. It reads the client information by reading chip/magnetic stripe and will
route the transaction information in the binary format to issuing and acquirer bank for further processing.
Network Interface: After information is processed from POS. Merchant securely transfers order information
to proper Payment Gateway. Payment Gateway receives order information and appropriately routes
transaction to processor. Processor immediately submits the request to credit card interchange. The
transaction is then routed to the issuing bank (purchaser's bank) to request transaction authorization.
Application/Web Server: The fraud detection application will be developed using Java/Python. And this
application will be scheduled on this web server, so application server will handle the communication with
database or other components to fetch data and perform analysis.
Transaction Database, DB Server: MySQL Workbench will be used to maintain customer transaction
database, legitimate pattern database, and fraud pattern database. And MySQL Workbench will reside on
Data Base Server. Fraud Application and BI tools will communicate with database using this Data Base
Server.

Credit Card Fraud Detection Engine: In the credit card fraud detection subsystem, each transaction
entering into the system will be passed to the host server where the corresponding transaction profile is
further checked using transactions business rules.
Business intelligence (BI): BI is the set of techniques and tools for the transformation of raw data into
meaningful and useful information for business intelligence purposes. BI technologies are capable of handling
large amounts of unstructured data to help identify, develop and otherwise create new strategic business
opportunities. The goal of BI is to allow for the easy interpretation of these large volumes of data. Identifying
new opportunities and implementing an effective strategy based on insights can provide businesses with a
competitive market advantage and long-term stability. BI technologies provide historical, current and
predictive views of business operations. Common functions of business intelligence technologies are
reporting.

2.1.2 Key Concepts


The key concept is how algorithm is going to understand the fraud transaction by mining transaction patterns.
For predicting fraud transactions, I am using Frequent Item set Mining technique. Frequent item sets are sets
of items that occur simultaneously in as many transactions as the user defined minimum support. The metric
support (Y) is defined as the fraction of records of database D that contains the item set Y as a subset:
Support (Y) = Count (Y)
|D|
For example if database D contains 1000 transactions and the item set Y appears in 700 records, then
support for Y will be 700/1000 = 0.7, 70% of transactions supports item set Y.
In our scenario of credit card transaction data, the legitimate pattern of a customer is the set of attribute
values that is specific to particular customer when legitimate transaction was performed which shows the
behaviour of customer. However fraudsters always trys to breach customers account and following same
genuine transaction pattern. Thus to handle this case we have constructed two patterns for each customerlegitimate pattern and fraud pattern. The longest group containing maximum number of attributes is selected
as that particular customers legitimate pattern. Below is the training algorithm.
1) Group each customers transaction data from complete dataset.
2) Separate each customers legitimate and fraud transactions.
3) Apply Apriori algorithm to the set of legitimate transactions of customer. Algorithm returns a set of
frequent item sets. Take largest frequent item set as legitimate pattern corresponding to that customer.
Store them in legitimate transaction data.
4) Apply Apriori algorithm to fraud transactions of customer. Take largest frequent item set as fraud pattern
corresponding to that customer. Store them in fraud transaction data.
5) Match attribute values of incoming transaction with legitimate and fraud transaction data created above
and mark transaction accordingly.

The separation of functions


Business Intelligence tools will be used to interact with customer transaction database to get stats on
fraud transactions.
Web Server/Application Server is used to accept and make calls from Business Intelligence Tools,
Fraud Detection Model and fetch data from Database Server.
Database Server will store all the data about customer credit card transactions.
This is Three-tier architectural model. Presentation Tier, Logic Tier, and Data Tier.
No Hardware will be deployed.
There is no legacy system to be considered.

Statistical Methods
Statistical model: see Section 4.1.1
Statistical tools to be employed: see Section 4.1

2.1.3 Key Components


The application will use 3-tiered approach. The technologies will be mapped as follows:

2.1.3.1 Tier 1 Technologies

TABLEAU Desktop

2.1.3.2 Tier 2 Technologies

PYTHON

JAVA

2.1.3.3 Tier 3 Technologies

MYSQL

HADOOP

PIG

HIVE

3. Architectural Decisions
Ref
AD1

Decision
Deploy the application on cloud server to improve response time and maximize scalability.

3.1 Transaction Processing


Subject Area
Design
Decision

Issue or
Problem
Statement
Assumptions

Motivation
Alternatives
Decision
Justification
Implications
Derived
Requirements
Related
Decisions
Ref
AD2

Topic
Customer transaction processing
Upload complete fraud detection application on
cloud server to improve transaction processing and
response time. Storage of all customer transaction
Id.
AD1
data files on remove servers will help in maintaining
data security and quick accessibility for visualization
purpose.
Credit card transaction is a giant industry each payment network receives billions of
transaction requests per second for processing. So processing these requests as well
as applying analytics to distinguish between legitimate and fraud transaction is a
challenging task, which requires trusted application environment.
Access is required to application and transaction database 24 hours a day, 7 days a
week with minimal disruption caused by any downtime of "legacy" systems.
Response time need to be reasonable (that is, less than 3 seconds) for all users
wherever they are placed.
No time lag in analysing incoming transactions as fraud or legitimate.
Option 1 - Deploy application and transaction database to cloud server distributed
throughout the network to help improve response time and maximize scalability.
Option 1
This is viable option given the highly distributed and trusted nature of this application.
Requires Cloud Server technology to be identified and procured.
None
None

Decision
Use RAID level 10 to handle disaster recovery.

3.2 Disaster Recovery


Subject Area
Design
Decision
Issue or
Problem
Statement
Assumptions
Motivation
Alternatives
Decision

Disaster recovery
Using RAID level 10 disaster recovery mechanism.

Topic
Id.

AD2

Transaction database is the important component of fraud detection application. So in


case disk, which stores this data, goes down the complete application will fall. Solid
mechanism should be there to recover the application from this disaster.
Access is required to application and transaction database 24 hours a day, 7 days a
week with minimal disruption caused by any downtime of "legacy" systems.
Creating highly reliable data stores.
Option 1 Use RAID level 10 disaster recovery mechanism.
Option 1

Justification
Implications
Derived
Requirements
Related
Decisions

With RAID 10 replica of data is created in additional disk, which takes charge if primary
disk goes down. This option is viable with designed application.
Requires additional disk for replicating the data from primary disk.
None
None

4. Statistical Model and Method


4.1 Technology 1
Presentation Layer Front End Interface

TABLEAU Desktop To present


interactive report about customer
transactions on dashboard, which
can help in understanding fraud and
legitimate transactions clearly.

Web/Application Server

PYTHON & JAVA For


programming Fraud Transaction
Detection application.

Database Interface

MYSQL- For storing customer


transactions. From which Application
will fetch data and perform analysis.

HADOOP, PIG, and HIVE - For


processing large data sets in
distributed computing environment.

4.1.1 Statistical Method


I will use classification statistical method for predicting incoming transaction as legitimate or fraud
transaction. In classification specifically I will use KNN (K-Nearest Neighbour) and Random Forest
technique.
Classification: In machine learning and statistics, classification is the problem of identifying to which of a set
of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing
observations (or instances) whose category membership is known. Like in our case would be assigning a
given transaction into Fraud or Legitimate classes by observing its characteristics. Classification is an
example of pattern recognition.
In the terminology of machine learning, classification is considered an instance of supervised learning, i.e.
learning where training set of correctly identified observations is available. In our case customers previous
transactions will be stored in the database which we can call as training dataset, and algorithm will apply
pattern recognition on incoming transaction, accordingly will mark it as fraud or legitimate.
KNN: The K-nearest neighbor (KNN) technique is a simple algorithm, which stores all available instances;
then it classifies any new instances based on a similarity measure. The KNN algorithm is example of an
instance-based learner. In the nearest neighbor classification method, using a distance metric compares
each new instance, and the closest existing instance is used to assign the class to the new one. Sometimes
more than one nearest neighbor is used, and the majority class of the closest K neighbors is assigned to the
new instance.
Among the various credit card fraud detection methods, the KNN achieves consistently high performance,
without a priori assumptions about the distributions from which the training examples are drawn. In the
process of KNN, we classify any incoming transaction by calculating nearest point to new incoming
transaction. If the nearest neighbor is a fraud transaction, then the transaction is classified as fraudulent and
if the nearest neighbor is legitimate transaction, then it is classified as legitimate.

Random Forest: Random forest is an ensemble of decision trees. The basic principle behind ensemble
methods is that a group of weak learners can come together to form a strong learner. Random forests
grow many decision trees. Here each individual decision tree is a weak learner, while all the decision trees
taken together are a strong learner. When a new object is to be classified, it is run down in each of the trees
in the forest. Each tree gives a classification output or vote for a class. The forest classifies the new object
into the class having maximum votes. Random forests are fast and they can efficiently handle unbalanced
and large databases with thousands of features.
In my case this algorithm will come into picture if incoming transaction is not from existing customer. In that
case incoming transaction attributes are analyzed with existing customer transactions attributes.

4.1.2 Analysis Attributes


i)

Customer ID: - Customer Unique ID.

ii)

Transaction Amount: - Amount of transaction customer performed.

iii) Balance Amount: - Balance left in the account.


iv) Country: - Customers country of residence.
v) State: - Customers state of residence.
vi) Zip Code: - Zip Code of residence.
vii) Email ID: - Customer email id.
viii) Hour: - Time of transaction.
ix) POS ID: - Point of sale ID where transaction performed.
x) BIN: - Bank Identification Number.
xi) Transaction Code: - Type of transaction.
xii) Indicator: - Fraud transaction or Legitimate transaction.

5. Solution Components
5.1.1 Hardware
My application will be cloud based. Every credit card transaction (geospatial data) performed through any
POS Terminal will be pushed to the cloud and data will pass through each tier (database, application,
presentation). Below is the cloud description that will clear out the processing of geospatial data.
Cloud Description:

Figure 5: Architecture of fraud detection application in cloud

1. Apache Kafka (Network data source) is chosen to feed credit card swipe messages into the
architecture. Real-time data is published by Payment Processing systems over Kafka queues. Each
of the transactions has 100s of attributes that can be analyzed in real time to detect patterns of
usage. We leverage Kafka integration with Apache Storm to read one value at a time and perform
some kind of storage like persist the data into a HBASE cluster. Storm is a stream-processing
framework that also does micro-batching.
2. Once the machine learning models are defined, incoming data received from the Storm/Spark tier will
be ingested into the models to predict outlier transactions or potential fraud.
3. Data that has business relevance and needs to be kept offline can be handled using storage platform
based on Hadoop Distributed File System (HDFS). Historical data can be fed into the Machine
Learning models to understand the fraud pattern.
4. Output data elements can be written out to HDFS, and managed by HBASE. From here, reports and
visualizations can easily be constructed.

5. Some data needs to be pulled in near real-time, accessed in a low latency pattern as well as have
calculations performed on this data. In memory technology based on Spark is very suitable for this
use case as it not only supports a very high write rate but also gives users the ability to store, access,
modify and transfer extremely large amounts of distributed data.
6. The second data access pattern that needs to be supported is storage for data that is older. This is
typically large-scale historical data. This layer contains the immutable, constantly growing master
dataset stored on a distributed file system like HDFS. Besides being a storage mechanism, the data
stored in this layer can be formatted in a manner suitable for consumption from any tool within the
Apache Hadoop ecosystem like Hive or Pig or MySQL.

5.1.2 Software
5.1.2.1 Solution Specific Software
Manufacturer
Hive

Title

Procurement
Open Source

Ownership
N/A

Installation
DBA

Support
DBA

MySQL

Open Source

N/A

DBA

DBA

Tableau

License

Self

SA

TABLEAU

Python

Open Source

N/A

SA

SA

Apache Storm

Open Source

N/A

SA

HORTONWORKS,
MAPR

Spark

Open Source

N/A

SA

CLOUDERA,
HORTONWORKS,
MAPR

5.1.3 Network

EN0: Public Access

Description

Subnet Mask

Masking Bits

VLAN

Subnet Mask

Masking Bits

VLAN

Subnet Mask

Masking Bits

VLAN

EN1: Backup and Recovery Interface

Description

IP Address

IP Address

RSA: Remote Supervisor Adapter

Description

IP Address

5.1.4 Monitoring
For monitoring of my application I will use Amazon Cloud Watch, which monitors AWS cloud resources and
the application that we run on AWS. With Cloud Watch, we can collect and tract metrics, collect and monitor
log files and set alarms. Amazon Cloud Watch can monitor AWS resources such as Amazon EC2 instances,
Amazon Dynamo DB tables, and Amazon RDS DB instances, as well as custom metrics generated by
applications and services in EC2, and any log files the applications generate.

5.1.5 Backup and Recovery


For backup and recovery we use Amazon Simple Storage Services (Amazon S3), which provides a highly
durable storage infrastructure designed for mission critical and primary data storage. Objects are redundantly
stored on multiple devices across multiple facilities within a region, designed to provide a high durability.
AWS provides further protection for data retention and archiving through versioning in Amazon S3, AWS
multi-factor authentication (AWS MFA), bucket policies, and AWS Identity and Access Management (IAM).
For archiving we use Amazon Glacier, which provides extremely low-cost storage for data archiving and
backup. Objects are optimized for infrequent access, for which retrieval times of several hours are adequate.
Amazon Glacier is designed for the same durability as Amazon S3.

5.1.6 Disaster Recovery


I was not able to decide which RAID level should be selected RAID 10 or RAID 5. As both the levels were
suiting well with my application. But here primary goal is to create highly reliable data stores. Hence decision
is made to go with RAID 10 (mirroring without parity and block-level striping) and its instances for disaster
recovery of the databases. RAID 10 is a fault-tolerance configuration known as "stripe of mirrors".
RAID 10 is a RAID 0 array of mirrors, which may be two- or three-way mirrors and requires a
minimum of four drives. It is known for excellent performance and redundancy.

6. Viability Assessment
6.1 Functional Requirements
Risk
ID

Finding / Risk
Description

The system shall


do RAID Level
FR01
10 for system
availability.
There shall be
redundancy in
FR02 the servers for
backup and
storage.
Data that has
business
relevance
FR03 should be kept
offline for
security
purpose.
System should
accurately
FR04 process a fraud
and legitimate
transaction.

Contingency /
Probability Effort Impact
Mitigation
(H/M/L)
/ Cost (H/M/L)
Recommendation
H

Person
Responsible

Make sure stand by


infrastructure takes over
when there is a main
system failure.

SA

AWS has disaster


recovery and backup
system in place to
mitigate this requirement.

SA

This requirement can be


handled using the
storage platform based
on Hadoop Distributed
File System (HDFS).

SA

Fraud detection model


should be properly tested
Project Team
over test data before
deploy.

Review
Date

6.2 Non-Functional Requirements


Risk
ID

Finding /
Risk
Description

The system
shall be
NFR01
accessible
24/7
The system
NFR02 shall be highly
available

Probability
(H/M/L)

Contingency /
Mitigation
Recommendation

Person
Responsible

Make sure with KAFKA +


Storm integration as it is
responsible for
transaction streaming to
HBASE cluster.

Support
Team

Make sure stand by


infrastructure takes over
when there is a main
system failure.

Support
Team

Effort Impact
/ Cost (H/M/L)

Review
Date

The system
NFR03 shall be
auditable

Make sure there are


proper documentations in
place and all access and
authentication are
accurate.

Project
Manager

6.3 Risks
Risk Finding / Risk Probability
ID
Description
(H/M/L)

Contingency /
Mitigation
Recommendation

Person
Responsible

Make sure there are


local or other
alternatives available.

Configuration
and Networking
Team.

Model should be
designed after properly
analyzing attributes of
transaction data.

Project Team

Project Team

Effort Impact
/ Cost (H/M/L)

R01

AWS
Unavailability

R02

Fraud
detection
model
accuracy

R03

Regulatory
Environment

Ensure that no
identifiable personal data
is stored if the law
prohibits in any state.

R04

Message loss
from KAFKA

Ensure the replication


level setting with KAFKA
configuration.

Review
Date

Configuration
Team

6.4 Assumptions
Assumption
ID

Finding / Assumption
Description

A01

Confidence
Level (H/M/L)

Impact
(H/M/L)

Assumption
Identified by

AWS cloud is secure and


H
reliable

Self

A02

KNN is good
classification approach
for tracking fraud
transactions

Self

A03

Kafka + Storm integration


is trustworthy for
M
transaction streaming

Self

Review
Date

Closed
Date

6.5 Dependencies
Finding /
Dependency
Dependency
ID
Description

Effect on Plan

Required
by Date

Owner

Associated
Risk ID

D01

AWS
availability

Application cannot be
useful if unavailable

Starting of
Project

Amazon

R01

D02

Proper
transaction
streaming

Message cannot reach to


cluster for analyses

Starting of
Project

Self

R04

Regulatory
environment

There are no prohibitions


in state laws that prohibit
use of cloud-based
infrastructure to store
customer level data.

From Start Legal

D03

R01

Closed
Date

Anda mungkin juga menyukai