Anda di halaman 1dari 53

Textual And Visual Content Based Anti-Phishing A SVM Approach

Abstract
A novel framework using a SVM [Support Vector Machine] approach for contentbased phishing web page detection is presented. Our model takes into account
textual and visual contents to measure the similarity between the protected web
page and suspicious web pages. A text classifier, an image classifier, and an
algorithm fusing the results from classifiers are introduced. An outstanding feature
of this paper is the exploration of a SVM model to estimate the matching threshold.
This is required in the classifier for determining the class of the web page and
identifying whether the web page is phishing or not. In the text classifier, the naive
SVM rule is used to calculate the probability that a web page is phishing. In the
image classifier, the earth movers distance is employed to measure the visual
similarity, and our SVM model is designed to determine the threshold. In the data
fusion algorithm, the SVM theory is used to synthesize the classification results
from textual and visual content. The effectiveness of our proposed approach was
examined in a large-scale dataset collected from real phishing cases. Experimental
results demonstrated that the text classifier and the image classifier we designed
deliver promising results, the fusion algorithm outperforms either of the individual
classifiers, and our model can be adapted to different phishing cases.

Introduction
Malicious people, also known as phishers, create phishing web pages, i.e.,
forgeries of real web pages, to steal individuals personal information such as bank
account, password, credit card number, and other financial data. Unwary online
users can be easily deceived by these phishing web pages because of their high
similarities to the real ones. The Anti-Phishing Working Group reported that there
were at least 55, 698 phishing attacks between January 1, 2009, and June 30, 2009.
The latest statistics show that phishing remains a major criminal activity involving
great losses of money and personal data.
Automatically detecting phishing web pages has attracted much attention from
security and software providers, financial institutions, to academic researchers.
Methods for detecting phishing web pages can be classified into industrial toolbar
based anti-phishing, user-interface-based anti-phishing, and web page contentbased anti-phishing. To date, techniques for phishing detection used by the industry
mainly include authentication, filtering, attack tracing and analyzing, phishing
report generating, and network law enforcement. These anti-phishing internet
services are built into e-mail servers and web browsers and available as web
browser toolbars.
These industrial services, however, do not efficiently know all phishing attacks.
Wu et al. conducted thorough study and analysis on the effectiveness of antiphishing toolbars, which consist of three security toolbars and other mostly used
browser security indicators. The study indicates that all examined toolbars in were
ineffective to prevent web pages from phishing attacks. Reports show that 20 out
of 30 subjects were spoofed by at least one phishing attack, 85% of the spoofed
subjects indicated that the websites look legitimate or exactly same as they visited

before, and 40% of the spoofed subjects were tricked due to poorly designed web
sites. Cranor et al. performed another study on an evaluation of 10 anti-phishing
tools. They indicated that only one tool could consistently detect more than 60% of
phishing web sites without a high rate of false positives, whilst four tools were not
able to recognize 50% of the tested web sites. Apart from these studies on the
effectiveness of anti-phishing toolbars, Li and Helenius investigated usability of
five typical anti-phishing toolbars.
They found that the main user interface of the toolbar, warnings, and help system
are the three basic components that should be well designed. They also found that
it is beneficial to apply white-list and blacklist methods together. Also, due to the
quality of the online traffic the applications from the anti-phishing client side
should not rely merely on the Internet. Recently, Aburrous et al. developed a
resilient model by using fuzzy logic to quantify and qualify the website phishing
characteristics with a layered structure and to study the influence of the phishing
characteristics at different layers on the final phishing website rate.
Content-based anti-phishing, which is referred to as using the features of web
pages, consists of surface level characteristics, textual content, and visual content.
We clarify that the content of a web page we discuss here include the whole
information of a web page such as a domain name, URL, hyperlinks, terms,
images, and forms embedded in the web page. Surface-level characteristics have
been commonly used by industrial toolbars to detect phishing. For example, the
Spoof-Guard makes use of inspecting the age of domain, well-known logos, URL,
and links to acquire the characteristics of phishing web pages. Liu et al. proposed
the use of semantic link network (SLN) to automatically identify the phishing
target of a given webpage.

The method works by first finding the associated web pages of the given webpage
and then constructing a SLN from all those web pages. A mechanism of reasoning
on the SLN is exploited to identify the phishing target. Zhang et al. developed a
content-based approach, i.e., Carnegie Mellon Anti-phishing and Network Analysis
Tool, for anti-phishing by employing the idea of robust hyperlinks [16]. Given a
web page, this method first calculates the TF-IDF of each term, an algorithm
usually used in information retrieval, generates a lexical signature4 by selecting a
few terms, supplies this signature to a search engine, and then matches the domain
name of current web page and several top search results to evaluate the current
web page is legitimate or not. Another content-based technique, BAPT, is designed
to identify phishing websites by using an open-source Bayesian filter on the basis
of tokens which are extracted by a document object module (DOM) analyzer.
The concept of visual approach to phishing detection was first introduced by Liu et
al. This approach, which is oriented by the DOM-based visual similarity of web
pages, first decomposes the web pages into salient block regions. The visual
similarity between two web pages is then evaluated by three metrics, namely, block
level similarity, layout similarity, and overall style similarity, which are based on
the matching of the salient block regions. Fu et al. followed the overall strategy,
but proposed another method to calculate the visual similarity of web pages. They
first converted HTML web pages into images and then employed the earth movers
distance method to calculate the similarity of the images. This approach only
investigates phishing detection at the pixel level of web pages without considering
the text level. Apart from these approaches to detect phishing web pages, contentbased methods for detecting phishing emails have also been widely studied,
especially using machine learning techniques.

Objective

The main objective of this project is as follows:


To detect the phishing web pages by using the SVM algorithm.
To classify the webpage by using textual and visual SVM classification
algorithms.
To combine the classified results like textual and visual content by using
fusion algorithm.
To compare the true and false web page fused results by finding the
probability, to find the given web page is phishing or not.

Scope Of Project
The main scope of the project is as follows:
To detect the website is a phishing website or not.
To detect the website is hacked by the attacker or not.
To compare the true and attacked websites by detecting its fusion results.

Project Description

Existing System
Phishing technique used by the existing system mainly includes authentication,
filtering, attack tracing and analyzing. Toolbar based anti-phishing which
guides the user to interact with trusted website. The toolbars like security
toolbars and browser security toolbars are used in the system. Methods for
detecting phishing web pages can be classified into industrial toolbar-based
anti-phishing, user-interface-based anti-phishing, and web page content-based
anti-phishing. Techniques for phishing detection used by the industry mainly
include authentication, filtering, attack tracing and analyzing, phishing report
generating, and network law enforcement. These anti-phishing internet services
are built into e-mail servers and web browsers and available as web browser
toolbars.
Content-based anti-phishing, which is referred to as using the features of web
pages, consists of surface level characteristics, textual content, and visual
content. We clarify that the content of a web page we discuss here include the
whole information of a web page such as a domain name, URL, hyperlinks,
terms, images, and forms embedded in the web page. Surface-level
characteristics have been commonly used by industrial toolbars to detect
phishing. For example, the Spoof-Guard makes use of inspecting the age of
domain, well known logos, URL, and links to acquire the characteristics of
phishing web pages. Liu et al. proposed the use of semantic link network to
automatically identify the phishing target of a given webpage.
The method works by first finding the associated web pages of the given
webpage and then constructing a SLN from all those web pages. A mechanism
of reasoning on the SLN is exploited to identify the phishing target. Zhang et al.

developed a content-based approach, i.e., Carnegie Mellon Anti-phishing and


Network Analysis Tool, for anti-phishing by employing the idea of robust
hyperlinks. Given a web page, this method first calculates the TF-IDF of each
term, an algorithm usually used in information retrieval, generates a lexical
signature4 by selecting a few terms, supplies this signature to a search engine,
and then matches the domain name of current web page and several top search
results to evaluate the current web page is legitimate or not. Another contentbased technique, BAPT, is designed to identify phishing websites by using an
open-source Bayesian filter on the basis of tokens which are extracted by a
document object module analyzer.
The concept of visual approach to phishing detection was first introduced by
Liu et al. This approach, which is oriented by the DOM-based visual similarity
of web pages, first decomposes the web pages into salient block regions. The
visual similarity between two web pages is then evaluated by three metrics,
namely, block level similarity, layout similarity, and overall style similarity,
which are based on the matching of the salient block regions. Fu et al. followed
the overall strategy, but proposed another method to calculate the visual
similarity of web pages. They first converted HTML web pages into images and
then employed the earth movers distance method to calculate the similarity of
the images. This approach only investigates phishing detection at the pixel level
of web pages without considering the text level. Apart from these approaches to
detect phishing web pages, content-based methods for detecting phishing emails
have also been widely studied, especially using machine learning techniques.

Disadvantages

All phishing attacks will not be detected by using the detection techniques.
The toolbar technique is an ineffective way to prevent web pages from
phishing attacks.
The online traffic will decrease the quality of web pages and its applications.
The existing approach only investigates phishing detection at the pixel level
of web pages without considering the text level.
The existing systems like CANTINA, Tool-bar based technique is very
difficult to implement.
All the phishing web pages will not be detected by using the CANTINA and
Tool-bar based systems.

Proposed System

The content representation of proposed system is divided into two categories.


1) Textual content: Textual content in this paper is defined as the terms or
words that appear in a given web page, except for the stop words. We first separate
the main text content from HTML tags and apply stemming to each word. Stems
are used as basic features instead of original words. For example, program,
programs, and programming are stemmed into program and considered as
the same word.
2) Visual content: Visual content refers to the characteristics with respect to the
overall style, the layout, and the block regions including the logos, images, and
forms. Visual content also can be further specified to the color of the web page
background, the font size, the font style, the locations of images and logos, etc. In
addition, the visual content is also user-dependent. On the other hand, we can
consider the web page at the pixel level, i.e., an image that enables the total
representation of the visual content of the web page.
The proposed anti-phishing approach contains the following components.
1) A text classifier using the SVM rules to handle the text content extracted from a
given web page.
2) An image classifier using the SVM similarity assessment to handle the pixel
level content of a given web page that has been transformed into an image.
3) A SVM approach to estimate the threshold used in classifiers through offline
training.
4) A data fusion algorithm to combine the results from the text classifier and the
image classifier. The algorithm employs the SVM approach as well.

The system includes a training section, which is to estimate the statistics of


historical data, and a testing section, which is to examine the incoming testing web
pages. The statistics of the web page training set consists of the probabilities that a
textual web page belongs to the categories, the matching thresholds of classifiers,
and the posterior probability of data fusion. Through the preprocessing, content
representations, i.e., textual and visual, are rapidly extracted from a given testing
web page. The text classifier is used to classify the given web page into the
corresponding category based on the textual features. The image classifier is used
to classify the given web page into the corresponding category based on the visual
content. Then the fusion algorithm is used to combine the detection results
delivered by the two classifiers. The detection results are eventually transmitted to
the online users or the web browsers.
Preprocessing is the main contexts of a given web page are firstly separated from
HTML tags. In order to form a histogram vector for each web page, we construct a
word vocabulary. In this system, we extract all the words from a given protected
web page and apply stemming to each word. It is worth noting that using the SVM
word-based extraction may deliver more discriminative information than
employing this stemming-based extraction. But we must point out that the SVM
word-based extraction will heavily increase the vocabulary size. In addition, using
stemming will deliver more robustness of detection, because phishers may
manipulate the textual content through the change of tense and active to passive.
The use of either the stemming-or naive word-based extraction depends on
different objectives.
For exact matching of textual content, we suggest using the SVM word-based
extraction, whilst for smaller vocabulary and more robust detection size we
recommend using stemming. In this paper, stems are used as basic features instead

of original words. We store the stemmed words to construct the vocabulary. Given
a web page, we then form a histogram vector, where each component represents
the term frequency and n denotes the total number of components in the vector. We
explain three points here.
1) We do not extract words from all the web pages in a dataset to construct the
vocabulary, because phishers usually only use the words from a targeted web page
to scam unwary users.
2) For the sake of simplicity, we do not use any feature extraction algorithms in
the process of vocabulary construction.
3) We do not take the semantic associations of web pages into account, because
the sizes of most phishing web pages are small.
In reality, using only text content is insufficient to detect phishing web pages. This
method will usually result in high false positives, because phishing web pages are
highly similar to the targeted web pages not only in textual content but also in
visual content such as famous logos, layout, and overall style. In this system, we
use the same approach as in using the SVM to measure the visual similarity
between an incoming web page and a protected web page.
First, we retrieve the suspected web pages and protected web pages from the web.
Second, we generate their signatures, which are used for the calculation of the
SVM between them. Thus all the web page images are normalized into fixed-size
square images. We use these normalized images to generate the signature of each
web page.
The image classifier is implemented by setting a threshold, which is later
estimated in the subsequent section. If the visual similarity between a suspected

web page and the protected web page exceeds the threshold, the web page is
classified as phishing, otherwise.

The overall implementation process of image classifier is summarized as


follows.
Step 1: Obtain the images of a web page from its URL and perform
normalization.
Step 2: Generate visual signature of the input image including the color and
coordinate features.
Step 3: Calculate the visual similarity between the input web page image and
the protected web page image using SVM approach.
Step 4: Classify the input web page into corresponding category according to
the comparison of the visual similarity and the threshold.

The overall implementation procedures of fusion algorithm are summarized as


follows.

Step 1: Input the training set, train a text classifier and an image classifier, and
then collect similarity measurements from different classifiers.
Step 2: Partition the interval of similarity measurements into sub-intervals.
Step 3: Estimate the posterior probabilities conditioning on all the sub-intervals
for the image classifier.
Step 4: Estimate the posterior probabilities conditioning on all the sub-intervals
for the image classifier.
Step 5: For a new testing web page, classify it into corresponding category by
using the text classifier and the image classifier.
Step 6: Display the results whether the given web page is phishing or not.

Advantages
The data fusion framework enables us to directly incorporate the multiple
results produced by different classifiers.
The SVM algorithm is used for classifying both the textual and visual
content.
All phishing websites will be detected by using this approach.

Literature Survey

Detecting phishing web pages with visual similarity assessment based on


earth movers distance
An effective approach to phishing Web page detection is proposed, which uses
Earth Movers Distance (EMD) to measure Web page visual similarity. We first
convert the involved Web pages into low resolution images and then use color
and coordinate features to represent the image signatures. We use EMD to
calculate the signature distances of the images of the Web pages. We train an
EMD threshold vector for classifying a Web page as a phishing or a normal one.
Large-scale experiments with 10,281 suspected Web pages are carried out to
show high classification precision, phishing recall, and applicable time
performance for online enterprise solution. We also compare our method with
two others to manifest its advantage. We also built up a real system which is
already used online and it has caught many real phishing cases.
Phishing web pages are forged web pages that are created by malicious people
to mimic web pages of real web sites. Most of these kinds of web pages have
high visual similarities to scam their victims. Some of these kinds of web pages
look exactly like the real ones. Unwary Internet users may be easily deceived
by this kind of scam. Victims of phishing web pages may expose their bank
account, password, credit card number, or other important information to the
phishing Web page owners. Phishing is a relatively new Internet crime in
comparison with other forms, e.g., virus and hacking. More and more phishing
Web pages have been found in recent years in an accelerative way. A report
from the Anti-Phishing Working Group shows that the number of phishing Web
pages is increasing each month by 50 percent and usually 5 percent of the
phishing e-mail receivers will respond to the scams. Also, there were 15,050
phishing cases reported simply in one month in June 2005. This problem has

drawn high attention from both industry and the academic research domain
since it is a severe security and privacy problem and has caused huge negative
impacts on the Internet world. It is threatening peoples confidence to use the
Web to conduct online finance-related activities.
In this system, we propose an effective approach for detecting phishing Web
pages, which employs the Earth Movers Distance (EMD) to calculate the
visual similarity of Web pages. The most important reason that Internet users
could become phishing victims is that phishing Web pages always have high
visual similarity with the real Web pages, such as visually similar block layouts,
dominant colors, images, and fonts, etc. We follow the anti-phishing strategy in
to obtain suspected Web pages, which are supposed to be collected from URLs
in those e-mails containing keywords associated with protected Web pages. We
first convert them into normalized images and then represent their image
signatures with features composed of dominant color category and its
corresponding centroid coordinate to calculate the visual similarity of two Web
pages.
The linear programming algorithm for EMD is applied to visual similarity
computation of the two signatures. An anti-phishing system may be requested to
protect many Web pages. A threshold is calculated for each protected Web page
using supervised training. If the EMD-based visual similarity of a Web page
exceeds the threshold of a protected Web page, we classify the Web page as a
phishing one.
Evolving with the anti-phishing techniques, various phishing techniques and
more complicated and hard-to-detect methods are used by phi-shers. The most

straightforward way for a phi-sher to scam people is to make the phishing Web
pages similar to their targets.
A phishing strategy includes both Web link obfuscation and Web page
obfuscation. Web link obfuscation can be carried out in four basic ways: adding
a suffix to a domain name of the URL, using an actual link different from the
visible link, utilizing system bugs in real Web sites to redirect the link to the
phishing Web pages. Previous research works on duplicated document detection
approaches focus on plain text documents and use pure text features in
similarity measure, such as collection statistics, syntactic analysis, displaying
structure, visual-based understanding, vector space model, etc. Hoad and Zobel
have surveyed various methods on plagiarized document detection. However, as
Liu et al. demonstrated, pure text features are not sufficient for phishing Web
page detection since phishing Web pages mainly employ visual similarity to
scam users.
There are many anti-phishing techniques popularly used by the industry. The
most popular and major anti-phishing methods include authentication, which
includes e-mail authentication, and Web page authentication, filtering, which
includes e-mail filtering and Web page filtering, attack analyzing and tracing,
immune-system-like phishing report and detection, and network law
enforcement. Many user interface based anti-phishing approaches, including
Web browser toolbars, e-mail client agent toolbars, and distributed server
applications, are also commonly used.

Usability evaluation of anti-phishing toolbars

Phishing is considered as one of the most serious threats for the Internet and ecommerce. Phishing attacks abuse trust with the help of deceptive e-mails,
fraudulent web sites and malware. In order to prevent phishing attacks some
organizations have implemented Internet browser tool-bars for identifying
deceptive activities. However, the levels of usability and user interfaces are
varying. Some of the toolbars have obvious usability problems, which can
affect the performance of these toolbars ultimately. For the sake of future
improvement, usability evaluation is indispensable. We will discuss usability of
ve typical anti-phishing tool- bars: built-in phishing prevention in the Internet
Explorer 7.0, Google toolbar, Net-craft Anti-phishing toolbar and Spoof- Guard.
In addition, we included Internet Explorer plug-in we have developed, Antiphishing IEPlug. Our hypothesis was that usability of anti-phishing toolbars,
and as a con- sequence also security of the toolbars, could be improved. Indeed,
according to the heuristic usability evaluation, a num- ber of usability issues
were found. In this article, we will describe the anti-phishing toolbars, we will
discuss anti- phishing toolbar usability evaluation approach and we will present
our ndings.
Phishing has become as one of the most serious network threats. Similar to
other malicious attacks, phishing can cause loss for both nancial institutions
and consumers. However, unlike most crackers, phishers gain benets by
accessing credential information, instead of system or network damage.
Moreover, phishing damages the trust of e-commerce. A devastating attack does
not require any emerging techniques. According to the March 2006 report of the
Anti-Phishing Working Group, the most frequently used artices are deceptive
e-mails or web pages, Trojan horses and key loggers. Moreover, more than 80%
of fraudulent web domains contain ambiguous names, for example, some form

of target name in the URL or only IP address without host name. These
ambiguous domain names are hazardous for careless consumers.
Because of the careless usability security design, phishers can easily take
advantage of poor usability design. In order to offer more reliable security, antiphishing tool-bars should be easier to use. Moreover, as end-users must be able
to use the toolbars and make correct choices, usability evaluation of these
toolbars is important. Our research objective was to nd out general usability
design principles for anti-phishing client side applications. Such information
may result in valuable information for improving usability and security of antiphishing applications. Based on this motivation, we conducted the heuristic
usability evaluation of ve toolbars. However, we must advice the reader that
we are not making a comparison of the toolbars in this system. An objective
comparison would require a different approach and should concentrate on
assessing phishing prevention capabilities.
In this system, we will present our evaluation and discuss the issues found
during the evaluation. In the following parts of this paper, we will rst
introduce the features and characteristics of these ve toolbars in order to make
readers aware of basic functionalities from a technical perspective. After that,
we will present the heuristic evaluation methodology and the evaluation results
we found. Based on the results, we will give advices for improving the toolbars
usability design. In conclusion, the usability evaluation is summarized, and the
impact of weak usability performance of the toolbars is discussed.
1. Toolbars based on clientserver architecture and anti- phishing prevention
combined with other functionalities. These types of toolbars need to
communicate with their servers, in order to protect users from being

spoofed. However, these kinds of toolbars are not tailored just for phishing
prevention. Instead, there are other functionalities that are not related to antiphishing. For example, Googles Safe Browsing functionality is only one of
the toolbars features. The other features include such as Enhanced Search
Box, AutoFill, etc.
2. Toolbars based on clientserver architecture and designed only for
phishing prevention. These are also based on the clientserver structure, but
the functionality is only phishing prevention. Therefore, users can only nd
phishing related functionalities from their interfaces. For example, Net craft
toolbar is designed only for phishing prevention. Even though some of its
functionalities are not directly associated with anti-phishing, these are
designed to support identication of fraud web pages.
3. Toolbars installed on the local computer and detecting fraud websites by
users browsing information. Because of the lack of the server side, these
kinds of toolbars have to use the browsing information or the browsing
history for detection. This kind of data cannot be managed by toolbars
themselves, but by web browsers. Therefore, it is required for users to
congure the browsing records carefully.
4. Toolbars installed on the local computer and detecting fraud websites.
Different from the previous type, these toolbars must use some other
methods to identify spoofing websites, like a whitelist or general detection.
Com- pared with the third type of toolbar, users may more freely customize
their own preferences, e.g. authentic web sites.

Web wallet: Preventing phishing attacks by revealing user intentions


We introduce a new anti-phishing solution, the Web Wallet. The Web Wallet is a
browser sidebar which users can use to submit their sensitive information
online. It detects phishing attacks by determining where users intend to submit
their information and suggests an alternative safe path to their intended site if
the current site does not match it. It integrates security questions into the users
workflow so that its protection cannot be ignored by the user. We conducted a
user study on the Web Wallet prototype and found that the Web Wallet is a
promising approach. In the study, it significantly decreased the spoof rate of
typical phishing attacks from 63% to 7%, and it effectively prevented all
phishing attacks as long as it was used. A majority of the subjects successfully
learned to depend on the Web Wallet to submit their login information.
However, the study also found that spoofing the Web Wallet interface itself was
an effective attack. Moreover, it was not easy to completely stop all subjects
from typing sensitive information directly into web forms.
Phishing has become a significant threat to Internet users. Phishing attacks
typically use legitimate-looking but fake emails and websites to deceive users
into disclosing private information to the attacker. Phishing keeps growing:
according to the Anti- Phishing Working Group, 15244 unique phishing attacks
and 7197 unique phishing sites were reported in December 2005, with 121
legitimate brands being hijacked.
To solve the phishing problems that we have observed in controlled studies and
in real life, we have designed a new solution, called the Web Wallet, to prevent
phishing attacks. The main part of the Web Wallet is a browser sidebar for
entering sensitive information. When a user sees a web form requesting her

sensitive data, she presses a dedicated security key on the keyboard to open the
Web Wallet. Using the Web Wallet, she may type her data or retrieve her stored
data. The data is then filled into the web form. But before the fill-in, the Web
Wallet checks if the current site is good enough to receive the sensitive data. If
the current site is not qualified, the Web Wallet requires the user to explicitly
indicate where she wants the data to go. If the users intended site is not the
current site, the Web Wallet shows a warning to the user about this discrepancy,
and gives her a safe path to her intended site. There is one simple rule to
correctly use the Web Wallet: Always use the Web Wallet to submit sensitive
information by pressing the security key first. Equivalently, never submit
sensitive information directly through a web form because it is not a secure
practice.
We have run a user study to test the Web Wallet interface. The results are
promising:
The Web Wallet significantly decreased the spoof rate of normal phishing
attacks from 63% to 7%.
All the simulated phishing attacks in the study were effectively prevented by
the Web Wallet as long as it was used.
By disabling direct input into web forms and thus making itself the only way
to input sensitive information, the Web Wallet successfully trained a majority of
the subjects to use it to protect their sensitive information submission.
But there are also negative results which we plan to deal with in future research:
The subjects totally failed to differentiate the authentic Web Wallet interface
from a fake Web Wallet presented by a phishing site. This is a new type of

phishing attack. Instead of mimicking a legitimate sites appearance, the


attacker fakes the interface of security software that is run by the user.
It is not easy to completely stop all subjects from typing sensitive information
directly into web forms. Users are familiar with web form submission and have
a strong tendency to use it.
Phishing attacks exploit the gap between the way a user perceives a
communication and the actual effect of the communication. The computer
system and the human user have two different understandings of a web site. The
user recognizes a site based on its visual appearance and the semantic meaning
of its content. But the browser recognizes a site based on system properties,
e.g., whether the site has an SSL certificate, when and where this site registered,
etc. As a result, neither the computer system nor the human user alone can
effectively prevent phishing attacks.
On the one hand, it is hard, if not impossible, for the computer to always
correctly derive the semantic meaning of the content. On the other hand,
ordinary users do not know how to correctly interpret the system properties.
The user interface is thus the exact place to bridge the gap between the users
mental model and the system model by letting the human user and the system
share what they individually know about the current site. The Web Wallet helps
the users transfer their real intention to the browser, especially when they are
doing phishing-critical actions, such as submitting sensitive data to web sites.
When a user uses the Web Wallet a dedicated interface for sensitive information
submission she implicitly indicates that the submitting data is sensitive. The
user further indicates the sensitive data type by using the appropriate card in the
Web Wallet.

Intelligent phishing website detection system using fuzzy techniques


Detecting and identifying e-banking Phishing websites is really a complex and
dynamic problem involving many factors and criteria. Because of the subjective
considerations and the ambiguities involved in the detection, Fuzzy Data
Mining Techniques can be an effective tool in assessing and identifying ebanking phishing websites since it offers a more natural way of dealing with
quality factors rather than exact values. In this system, we present novel
approach to overcome the fuzziness in the e-banking phishing website
assessment and propose an intelligent resilient and effective model for detecting
e-banking phishing websites. The proposed model is based on Fuzzy logic
combined with Data Mining algorithms to characterize the e-banking phishing
website factors and to investigate its techniques by classifying there phishing
types and defining six e-banking phishing website attack criterias with a layer
structure. A Case study was applied to illustrate and simulate the phishing
process. Our experimental results showed the significance and importance of
the e-banking phishing website criteria represented by layer one and the variety
influence of the phishing characteristic layers on the final e-banking phishing
website rate.
E-banking Phishing websites are forged website that is created by malicious
people to mimic real e-banking websites. Most of these kinds of Web pages
have high visual similarities to scam their victims. Some of these Web pages
look exactly like the real ones. Unwary Internet users may be easily deceived
by this kind of scam. Victims of e-banking phishing Websites may expose their
bank account, password, credit card number, or other important information to
the phishing Web page owners. The impact is the breach of information security
through the compromise of confidential data and the victims may finally suffer

losses of money or other kinds. Phishing is a relatively new Internet crime in


comparison with other forms, e.g., virus and hacking.
E-banking Phishing website is a very complex issue to understand and to
analyze, since it is joining technical and social problem with each other for
which there is no known single silver bullet to entirely solve it. The motivation
behind this study is to create a resilient and effective method that uses Fuzzy
Data Mining algorithms and tools to detect e-banking phishing websites in an
automated manner. DM approaches such as neural networks, rule induction, and
decision trees can be a useful addition to the fuzzy logic model. It can deliver
answers to business questions that traditionally were too time consuming to
resolve such as, "Which are most important e-banking Phishing website
Characteristic Indicators and why?" by analyzing massive databases and
historical data for training purposes.
Fuzzy Data Mining Algorithms & Techniques
The approach described here is to apply fuzzy logic and data mining algorithms
to assess e-banking phishing website risk on the 27 characteristics and factors
which stamp the forged website. The essential advantage offered by fuzzy logic
techniques is the use of linguistic variables to represent Key Phishing
characteristic indicators and relating e-banking phishing website probability.
1) Fuzzification
In this step, linguistic descriptors such as High, Low, Medium, for example,
are assigned to a range of values for each key phishing characteristic
indicators. Valid ranges of the inputs are considered and divided into classes,
or fuzzy sets. For example, length of URL address can range from low to
high with other values in between. We cannot specify clear boundaries

between classes. The degree of belongingness of the values of the variables


to any selected class is called the degree of membership; Membership
function is designed for each Phishing characteristic indicator, which is a
curve that defines how each point in the input space is mapped to a
membership value between [0, 1]. Linguistic values are assigned for each
Phishing indicator as Low, Moderate, and High while for e-banking Phishing
website risk rate as Very legitimate, Legitimate, Suspicious, Phishy, and
Very phishy. For each input their values ranges from 0 to 10 while for
output, ranges from 0 to 100.
2) Rule Generation using Classification Algorithms
Having specified the risk of e-banking phishing website and its key
phishing characteristic indicators, the next step is to specify how the ebanking phishing website probability varies. Experts provide fuzzy rules in
the form of if..then statements that relate e-banking phishing website
probability to various levels of key phishing characteristic indicators based
on their knowledge and experience. On that matter and instead of employing
an expert system, we utilized data mining classification and association rule
approaches in our new e-banking phishing website risk assessment model is
used to automatically find significant patterns of phishing characteristic or
factors in the e-banking phishing website archive data. Particularly, we used
a number of different existing data mining classification techniques
implemented

within

WEKA

and

CBA

packages.

JRip

WEKA's

implementation of RIPPER, PART, Prism algorithms are selected to learn


the relationships of the selected different phishing features.

3) Aggregation of the rule outputs


This is the process of unifying the outputs of all discovered rules.
Combining the membership functions of all the rules consequents previously
scaled into single fuzzy sets.
4)

De-fuzzification
This is the process of transforming a fuzzy output of a fuzzy inference
system into a crisp output. Fuzziness helps to evaluate the rules, but the final
output has to be a crisp number. The input for the de-fuzzification process is
the aggregate output fuzzy set and the output is a number. This step was
done using Centroid technique since it is a commonly used method.

There are a number of challenges posed by doing post- hoc classification of ebanking phishing websites. Most of these challenges only apply to the e-banking
phishing websites data and materialize as a form of information, which has the net
effect of increasing the false negative rate. The age of the dataset is the most
significant problem, which is particularly relevant with the phishing corpus. Ebanking Phishing websites are short-lived, often lasting only in the order of 48
hours. Some of our features can therefore not be extracted from older websites,
making our tests difficult. The average phishing site stays live for approximately
2.25 days. Furthermore, the process of transforming the original e- banking
phishing website archives into record feature datasets is not without error. It
requires the use of heuristics at several steps. Thus high accuracy from the data
mining algorithms cannot be expected. However, the evidence supporting the
golden nuggets comes from a number different algorithms and feature sets and we
believe it is compelling.

CANTINA: A content-based approach to detecting phishing web sites


Phishing is a significant problem involving fraudulent email and web sites that
trick unsuspecting users into revealing private information. In this paper, we
present the design, implementation, and evaluation of CANTINA, a novel,
content-based approach to detecting phishing web sites, based on the TF-IDF
information retrieval algorithm. We also discuss the design and evaluation of
several heuristics we developed to reduce false positives. Our experiments show
that CANTINA is good at detecting phishing sites, correctly labeling
approximately 95% of phishing sites.
Recently, there has been a dramatic increase in phishing, a kind of attack in
which victims are tricked by spoofed emails and fraudulent web sites into
giving up personal information. Phishing is a rapidly growing problem, with
9,255 unique phishing sites reported in June of 2006 alone. It is unknown
precisely how much phishing costs each year since impacted industries are
reluctant to release figures; estimates range from $1 billion to 2.8 billion per
year. To respond to this threat, software vendors and companies have released a
variety of anti-phishing toolbars.
For example, eBay offers a free toolbar that can positively identify eBay-owned
sites, and Google offers a free toolbar aimed at identifying any fraudulent site.
As of September 2006, the free software download site download.com, listed 84
anti-phishing toolbars. However, when we conducted an evaluation of ten antiphishing tools for a previous study, we found that only one tool could
consistently detect more than 60% of phishing web sites without a high rate of
false positives. Thus, we argue that there is a strong need for better automated
detection algorithms.

In this system, we present the design, implementation, and evaluation of


CANTINA, a novel content-based approach for detecting phishing web sites.
CANTINA examines the content of a web page to determine whether it is
legitimate or not, in contrast to other approaches that look at surface
characteristics of a web page, for example the URL and its domain name.
CANTINA makes use of the well-known TF-IDF algorithm used in information
retrieval, and more specifically, the Robust Hyperlinks algorithm previously
developed by Phelps and Wilensky for overcoming broken hyperlinks. Our
results show that CANTINA is quite good at detecting phishing sites, detecting
94-97% of phishing sites.
We also show that we can use CANTINA in conjunction with heuristics used
by other tools to reduce false positives, while lowering phish detection rates
only slightly. We present a summary evaluation, comparing CANTINA to two
popular anti-phishing toolbars that are representative of the most effective tools
for detecting phishing sites currently available. Our experiments show that
CANTINA has comparable or better performance to Spoof-Guard with far
fewer false positives, and does about as well as Net Craft. Finally, we show that
CANTINA combined with heuristics is effective at detecting phishing URLs in
users' actual email, and that its most frequent mistake is labeling spam-related
URLs as phishing.
TF-IDF is an algorithm often used in information retrieval and text mining. TFIDF yields a weight that measures how important a word is to a document in a
corpus. The importance increases proportionally to the number of times a word
appears in the document, but is offset by the frequency of the word in the
corpus. The term frequency (TF) is simply the number of times a given term
appears in a specific document. This count is usually normalized to prevent a

bias towards longer documents to give a measure of the importance of the term
within the particular document. The inverse document frequency (IDF) is a
measure of the general importance of the term. Roughly speaking, the IDF
measures how common a term is across an entire collection of documents.
Thus, a term has a high TF-IDF weight by having a high term frequency in a
given document.
CANTINA works as follows:

Given a web page, calculate the TF-IDF scores of each term on that web
page. Generate a lexical signature by taking the five terms with highest
TF-IDF weights.

Feed this lexical signature to a search engine, which in our case is


Google.

If the domain name of the current web page matches the domain name of
the N top search results, we consider it to be a legitimate web site.
Otherwise, we consider it a phishing site.

Our technique makes the assumption that Google indexes the vast majority
of legitimate web sites, and that legitimate sites will be ranked higher than
phishing sites. Combined suggest that a phishing scam will rarely, if ever, be
highly ranked. At the end of this paper, however, we discuss some ways of
possibly subverting CANTINA.
Age of Domain
This heuristic checks the age of the domain name. Many phishing sites have
domains that are registered only a few days before phishing emails are sent
out. We use a WHOIS search to implement this heuristic. This heuristic

measures the number of months from when the domain name was first
registered. If the page has been registered longer than 12 months, the
heuristic will return +1, deeming it as legitimate and otherwise returns -1,
deeming it as phishing. If the WHOIS server cannot find the domain, the
heuristic will simply return -1, deeming it as a phishing page. The Net craft
and Spoof-Guard toolbars use a similar heuristic based on the time since a
domain name was registered. Note that this heuristic does not account for
phishing sites based on existing web sites where criminals have broken into
the web server, nor does it account for phishing sites hosted on otherwise
legitimate domains, for example in space provided by an ISP for personal
homepages.
Known Images
This heuristic checks whether a page contains inconsistent well-known
logos. For example, if a WWW 2007 / Track: Security, Privacy, Reliability,
and Ethics Session: Passwords and Phishing 642 page contains eBay logos
but is not on an eBay domain, then this heuristic labels the site as a probable
phishing page. Currently we store nine popular logos locally, including
eBay, PayPal, Citibank, Bank of America, Fifth Third Bank, Barclays Bank,
ANZ Bank, Chase Bank, and Wells Fargo Bank. Eight of these nine
legitimate sites are included in the PhishTank.com list of Top 10 Identified
Targets. A similar heuristic is used by the Spoof-Guard toolbar.
Suspicious URL
This heuristic checks if a pages URL contains an at (@) or a dash (-) in
the domain name. A @ symbol in a URL causes the string to the left to be
disregarded, with the string on the right treated as the actual URL for

retrieving the page. Combined with the limited size of the browser address
bar, this makes it possible to write URLs that appear legitimate within the
address bar, but actually cause the browser to retrieve a different page. This
heuristic is used by Mozilla Fire-Fox. Dashes are also rarely used by
legitimate sites, so we use this as another heuristic. Spoof-Guard checks for
both at symbols and dashes in URLs.
Suspicious Links
This heuristic applies the URL check above to all the links on the page. If
any link on a page fails this URL check, then the page is labeled as a
possible phishing scam. This heuristic is also used by Spoof-Guard.
IP Address
This heuristic checks if a pages domain name is an IP address. This
heuristic is also used in PILFER.
Dots in URL
This heuristic check the number of dots in a pages URL. We found that
phishing pages tend to use many dots in their URLs but legitimate sites
usually do not. Currently, this heuristic labels a page as phish if there are 5
or more dots. This heuristic is also used in PILFER.
Forms
This heuristic checks if a page contains any HTML text entry forms asking
for personal data from people, such as password and credit card number. We
scan the HTML for <input> tags that accept text and are accompanied by
labels such as credit card and password.

Software Description
Java
Java is a programming language originally developed by James Gosling at Sun
Microsystems (now a subsidiary of Oracle Corporation) and released in 1995 as a
core component of Sun Microsystems' Java platform. The language derives much
of its syntax from C and C++ but has a simpler object model and fewer low-level
facilities. Java applications are typically compiled to byte code (class file) that can
run on any Java Virtual Machine (JVM) regardless of computer architecture. Java
is a general-purpose, concurrent, class-based, object-oriented language that is
specifically designed to have as few implementation dependencies as possible. It is
intended to let application developers "write once, run anywhere." Java is currently
one of the most popular programming languages in use, particularly for clientserver web applications.
The original and reference implementation Java compilers, virtual machines, and
class libraries were developed by Sun from 1995. As of May 2007, in compliance
with the specifications of the Java Community Process, Sun relicensed most of its
Java technologies under the GNU General Public License. Others have also
developed alternative implementations of these Sun technologies, such as the GNU
Compiler for Java and GNU Class path.
Java Platform:
One characteristic of Java is portability, which means that computer programs
written in the Java language must run similarly on any hardware/operating-system
platform. This is achieved by compiling the Java language code to an intermediate
representation called Java byte code, instead of directly to platform-specific

machine code. Java byte code instructions are analogous to machine code, but are
intended to be interpreted by a virtual machine (VM) written specifically for the
host hardware. End-users commonly use a Java Runtime Environment (JRE)
installed on their own machine for standalone Java applications, or in a Web
browser for Java applets.
Standardized libraries provide a generic way to access host-specific features such
as graphics, threading, and networking.
A major benefit of using byte code is porting. However, the overhead of
interpretation means that interpreted programs almost always run more slowly than
programs compiled to native executables would. Just-in-Time compilers were
introduced from an early stage that compiles byte codes to machine code during
runtime.
Just as application servers such as Glassfish provide lifecycle services to web
applications, the Net Beans runtime container provides them to Swing applications.
Application servers understand how to compose web modules, EJB modules, and
so on, into a single web application, just as the Net Beans runtime container
understands how to compose Net Beans modules into a single Swing application.
Modularity offers a solution to "JAR hell" by letting developers organize their code
into strictly separated and versioned modules. Only those that have explicitly
declared dependencies on each other are able to use code from each other's
exposed packages. This strict organization is of particular relevance to large
applications developed by engineers in distributed environments, during the
development as well as the maintenance of their shared codebase.
End users of the application benefit too because they are able to install modules
into their running applications, since modularity makes them pluggable. In short,

the Net Beans runtime container is an execution environment that understands


what a module is, handles its lifecycle, and enables it to interact with other
modules in the same application.
Registration of various objects, files and hints into layer is pretty central to the way
Net Beans based applications handle communication between modules. This page
summarizes the list of such extension points defined by modules with API.
Context menu actions are read from the layer folder Loaders/text/xant+xml/Actions.
Key maps folder contains subfolders for individual key maps (Emacs, JBuilder,
Net

Beans).

The

name

of

key

map

can

be

localized.

Use

"SystemFileSystem.localizingBundle" attribute of your folder for this purpose.


Individual key map folder contains shadows to actions. Shortcut is mapped to the
name of file. Emacs shortcut format is used, multikeys are separated by space chars
("C-X P" means Ctrl+X followed by P). "currentKeymap" property of "Key maps"
folder contains original (not localized) name of current key map.
This folder contains registration of shortcuts. Its supported for backward
compatibility purpose only. All new shortcuts should be registerred in
"Keymaps/NetBeans" folder. Shortcuts installed ins Shortcuts folder will be added
to all keymaps, if there is no conflict. It means that if the same shortcut is mapped
to different actions in Shortcut folder and current keymap folder (like
Keymap/NetBeans), the Shortcuts folder mapping will be ignored.
* DatabaseExplorerLayerAPI in Database Explorer
* Loaders-text-dbschema-Actions in Database Explorer
* Loaders-text-sql-Actions in Database Explorer

* PluginRegistration in Java EE Server Registry


XML layer contract for registration of server plug-ins and instances that
implement optional capabilities of server plug-ins. Plug-ins with server-specific
deployment descriptor files should declare the full list in XML layer as specified in
the document plugin-layer-file.html from the above link.
"Projects/org-netbeans-modules-java-j2seproject/Customizer" folder's content
is used to construct the project's customizer. It's content is expected to be
ProjectCustomizer.CompositeCategoryProvider instances. The lookup passed to
the

panels

contains

an

instance

of

Project

and

org.netbeans.modules.java.j2seproject.ui.customizer.J2SEProjectProperties Please
note that the latter is not part of any public APIs and you need implementation
dependency to make use of it.
"Projects/org-netbeans-modules-java-j2seproject/Nodes" folder's content is
used to construct the project's child nodes. It's content is expected to be Node
Factory instances.
"Projects/org-netbeans-modules-java-j2seproject/Lookup" folder's content is
used to construct the project's additional lookup. It's content is expected to be
Lookup Provider instances. J2SE project provides Lookup Mergers for Sources,
Privileged Templates and Recommended Templates. Implementations added by 3rd
parties will be merged into a single instance in the project's lookup.
Use Options Dialog folder for registration of custom top level options panels.
Register your implementation of Options Category there (*.instance file). Standard
file systems sorting mechanism is used.

Use Options Dialog/Advanced folder for registration of custom panels to


Miscellaneous Panel. Register your implementation of AdvancedCategory there
(*.instance file). Standard file systems sorting mechanism is used.
Use Options Export/<My Category> folder for registration of items for
export/import of options. Registration in layers looks as follows
Source files must be named after the public class they contain, appending the suffix
.java, for example, HelloWorldApp.java. It must first be compiled into byte code,
using a Java compiler, producing a file named HelloWorldApp.class. Only then can
it be executed, or 'launched'. The Java source file may only contain one public
class but can contain multiple classes with less than public access and any number
of public inner classes.
A class that is not declared public may be stored in any .java file. The compiler will
generate a class file for each class defined in the source file. The name of the class
file is the name of the class, with .class appended. For class file generation,
anonymous classes are treated as if their name were the concatenation of the name
of their enclosing class, a $, and an integer.
The keyword public denotes that a method can be called from code in other
classes, or that a class may be used by classes outside the class hierarchy. The class
hierarchy is related to the name of the directory in which the .java file is located.
The keyword static in front of a method indicates a static method, which is
associated only with the class and not with any specific instance of that class. Only
static methods can be invoked without a reference to an object. Static methods
cannot access any class members that are not also static.

The keyword void indicates that the main method does not return any value to the
caller. If a Java program is to exit with an error code, it must call System.exit()
explicitly.
The method name "main" is not a keyword in the Java language. It is simply the
name of the method the Java launcher calls to pass control to the program. Java
classes that run in managed environments such as applets and Enterprise
JavaBeans do not use or need a main () method. A Java program may contain
multiple classes that have main methods, which means that the VM needs to be
explicitly told which class to launch from.
The main method must accept an array of String objects. By convention, it is
referenced as args although any other legal identifier name can be used. Since Java
5, the main method can also use variable arguments, in the form of public static
void main(String... args), allowing the main method to be invoked with an arbitrary
number of String arguments. The effect of this alternate declaration is semantically
identical (the args parameter is still an array of String objects), but allows an
alternative syntax for creating and passing the array.
The Java launcher launches Java by loading a given class (specified on the
command line or as an attribute in a JAR) and starting its public static void
main(String[]) method. Stand-alone programs must declare this method explicitly.
The String[] args parameter is an array of String objects containing any arguments
passed to the class. The parameters to main are often passed by means of a
command line.
Printing is part of a Java standard library: The System class defines a public static
field called out. The out object is an instance of the Print Stream class and provides

many methods for printing data to standard out, including println (String) which
also appends a new line to the passed string.
Java =>High-level Language:
A high-level programming language developed by Sun Microsystems. Java was
originally called OAK, and was designed for handheld devices and set-top boxes.
Oak was unsuccessful so in 1995 Sun changed the name to Java and modified the
language to take advantage of the burgeoning World Wide Web.
Java is an object-oriented language similar to C++, but simplified to eliminate
language features that cause common programming errors. Java source code files
(files with a .java extension) are compiled into a format called byte code (files with
a .class extension), which can then be executed by a Java interpreter. Compiled
Java code can run on most computers because Java interpreters and runtime
environments, known as Java Virtual Machines (VMs), exist for most operating
systems, including UNIX, the Macintosh OS, and Windows. Byte code can also be
converted directly into machine language instructions by a just-in-time compiler
(JIT).
Java is a general purpose programming language with a number of features that
make the language well suited for use on the World Wide Web. Small Java
applications are called Java applets and can be downloaded from a Web server and
run on your computer by a Java-compatible Web browser, such as Netscape
Navigator or Microsoft Internet Explorer.
Object-oriented software development matured significantly during the past
several years. The convergence of object-oriented modeling techniques and
notations, the development of object-oriented frameworks and design patterns, and

the evolution of object-oriented programming languages have been essential in the


progression of this technology.
Object-Oriented Software Development using Java: Principles, Patterns, and
Frameworks contains a very applied focus that develops skills in designing
software-particularly in writing well-designed, medium-sized object-oriented
programs. It provides a broad and coherent coverage of object-oriented technology,
including object-oriented modeling using the Unified Modeling Language (UML)
object-oriented design using Design Patterns, and object-oriented programming
using Java.
NetBeans
The NetBeans Platform is a reusable framework for simplifying the development
of Java Swing desktop applications. The NetBeans IDE bundle for Java SE
contains what is needed to start developing NetBeans plug-ins and NetBeans
Platform based applications; no additional SDK is required.
Applications can install modules dynamically. Any application can include the
Update Center module to allow users of the application to download digitallysigned upgrades and new features directly into the running application.
Reinstalling an upgrade or a new release does not force users to download the
entire application again.

The platform offers reusable services common to desktop applications, allowing


developers to focus on the logic specific to their application. Among the features of
the platform are:

User interface management (e.g. menus and toolbars)

User settings management

Storage management (saving and loading any kind of data)

Window management

Wizard framework (supports step-by-step dialogs)

NetBeans Visual Library

Integrated Development Tools

Wamp Server
WAMPs are packages of independently-created programs installed on computers
that use a Microsoft Windows operating system. WAMP is an acronym formed
from the initials of the operating system Microsoft Windows and the principal
components of the package:Apache, MySQL and one of PHP, Perl or Python.
Apache is a web server. MySQL is an open-source database. PHP is a scripting
language that can manipulate information held in a database and generate web
pages dynamically each time content is requested by a browser. Other programs
may also be included in a package, such as phpMyAdmin which provides a
graphical user interface for the MySQL database manager, or the alternative
scripting languages Python or Perl. Equivalent packages are MAMP (for the Apple
Mac) and LAMP (for the Linux operating system).

System Architecture

Modules
Loading web page training set.
Textual and visual content feature extraction.
Text and image classification.
Fusing of detected results.
Comparison of detected fusion results.

Module Description
Loading web page training set

Loading the phishing web pages into the database.


Loading the protected web pages into the database.

Textual and visual content feature extraction

Extraction of textual content of web page by using extraction algorithms.


Extraction of visual content of web page by using extraction algorithms.

The textual feature extraction is done by using HTML tag removal


algorithm, stop-words removal algorithm, stemming algorithm.
The visual feature extraction is done by extracting the color, raster, icons of
an image.

Text and image classification

Classification of extracted text by using SVM algorithm.


Each and every word will be classified by using this SVM algorithm and
average weight of a word will be found.
Classification of extracted images by using SVM algorithm.
An image will be classified by finding its similarity by using the SVM
image classification algorithm.

Fusing of detected results

The classified image and text results are merged.


The fusion algorithm is used for merging the results of text and image
classification.

Fusion algorithm is used for merging or joining the textual and visual
classified results.

Comparison of detected fusion results

The detected fusion results will be compared with original web page.
The posteriori probability will be found by using the similarity.
By this probability the fusion results of false and true web pages will be
compared.
The false web page is compared with the true web page.
The detected results will be shown to the user.

System Requirements
Software Requirement
Operating System

Windows XP

Language

Core Java

Version

JDK 1.5

IDE

Net beans 6.2

Database

My-Sql

Hardware Requirements
PROCESSOR

PENTIUM IV

CLOCK SPEED

2.7 GHZ

RAM CAPACITY

1 GB

HARD DISK DRIVE

200 GB

Conclusion
A new content-based anti-phishing system has been thoroughly developed. In this
system, we presented a new framework to solve the anti-phishing problem. The
new features of this framework can be represented by a text classifier, an image
classifier, and a fusion algorithm. Based on the textual content, the text classifier is
able to classify a given web page into corresponding categories as phishing or
normal. This text classifier was modeled by SVM rule. Based on the visual content,
the image classifier, which relies on SVM, is able to calculate the visual similarity
between the given web page and the protected web page efficiently. The matching
threshold used in both text classifier and image classifier is effectively estimated
by using a probabilistic model derived from the SVM theory. A novel data fusion
model using the SVM theory was developed and the corresponding fusion
algorithm presented. This data fusion framework enables us to directly incorporate
the multiple results produced by different classifiers. This fusion method provides
insights for other data fusion applications. More importantly, it is worth noting that
our content-based model can be easily embedded into current industrial antiphishing systems.

Future Enhancement
Our future work will include adding more features into the content
representations into our current model.
Investigating incremental learning models to solve the knowledge updating
problem in current probabilistic model.
Adding more data sets with textual and visual content of web pages for both
true and false web pages.

References

A. Emigh. (2005, Oct.). Online Identity Theft: Phishing Technology,


Chokepoints and Countermeasures. Radix Laboratories Inc., Eau Claire, WI
[Online]. Available: http://www.antiphishing.org/phisgingdhs- report.pdf

L. James, Phishing Exposed. Rockland, MA: Syngress, 2005.

A. Y. Fu, W. Liu, and X. Deng, Detecting phishing web pages with visual
similarity assessment based on earth movers distance (EMD), IEEE Trans.
Depend. Secure Comput., vol. 3, no. 4, pp. 301311, Oct. Dec. 2006.

Global Phishing Survey: Domain Name Use and Trends in 1H2009. AntiPhishing

Working

Group,

Cambridge,

MA

[Online].

Available:

http://www.antiphishing.org

N. Chou, R. Ledesma, Y. Teraguchi, and D. Boneh, Client-side defense


against web-based identity theft, in Proc. 11th Annu. Netw. Distribut. Syst.
Secur. Symp., San Diego, CA, Feb. 2005, pp. 119128.

M. Wu, R. C. Miller, and S. L. Garfinkel, Do security toolbars actually


prevent phishing attacks? in Proc. SIGCHI Conf. Human Factors Comput.
Syst., Montreal, QC, Canada, Apr. 2006, pp. 601610.

Y. Zhang, S. Egelman, L. Cranor, and J. Hong, Phinding phish: Evaluating


anti-phishing tools, in Proc. 14th Annu. Netw. Distribut. Syst. Secur.
Symp., San Diego, CA, Feb. 2007, pp. 116.

L. Li and M. Helenius, Usability evaluation of anti-phishing toolbars, J.


Comput. Virol., vol. 3, no. 2, pp. 163184, 2007.

M. Aburrous, M. Hossain, F. Thabatah, and K. Dahal, Intelligent phishing


website detection system using fuzzy techniques, in Proc. 3rd Int. Conf. Inf.
Commun. Technol., Damascus, VA, Apr. 2008, pp. 16.

R. Dhamija and J. D. Tygar, The battle against phishing: Dynamic security


skins, in Proc. Symp. Usable Privacy Secur., Pittsburgh, PA, Jul. 2005, pp.
7788.

M. Wu, R. C. Miller, and G. Little, Web wallet: Preventing phishing attacks


by revealing user intentions, in Proc. 2nd Symp. Usable Privacy Secur.,
Pittsburgh, PA, Jul. 2006, pp. 102113.

E. Gabber, P. B. Gibbons, Y. Matias, and A. J. Mayer, How to make


personalized web browsing simple, secure, and anonymous, in Proc. 1st Int.
Conf. Finan. Cryptograp., Anguilla, British West Indies, Feb. 1997, pp. 17
32.

J. A. Halderman, B. Waters, and E. W. Felten, A convenient method for


securely managing passwords, in Proc. 14th Int. Conf. World Wide Web,
Chiba, Japan, May 2005, pp. 471479.

W. Liu, N. Fang, X. Quan, B. Qiu, and G. Liu, Discovering phishing target


based on semantic link network, Future Generat. Comput. Syst., vol. 26,
no. 3, pp. 381388, Mar. 2010.

Y. Zhang, J. Hong, and L. Cranor, CANTINA: A content-based approach


to detecting phishing web sites, in Proc. 16th Int. Conf. World Wide Web,
Banff, AB, Canada, May 2007, pp. 639648.

T. A. Phelps and R. Wilensky, Robust hyperlinks and locations, D-Lib


Mag., vol. 6, nos. 78, pp. 78, JulAug. 2000.

P. Likarish, E. Jung, D. Dunbar, T. E. Hansen, and J. P. Hourcade, B-APT:


Bayesian anti-phishing toolbar, in Proc. IEEE Int. Conf. Commun., Beijing,
China, May 2008, pp. 17451749.

W. Liu, X. Deng, G. Huang, and A. Y. Fu, An antiphishing strategy based


on visual similarity assessment, IEEE Internet Comput., vol. 10, no. 2, pp.
5865, Mar.Apr. 2006.

W. Liu, G. Huang, X. Liu, M. Zhang, and X. Deng, Detection of phishing


web pages based on visual similarity, in Proc. 14th Int. Conf. World Wide
Web, Chiba, Japan, May 2005, pp. 10601061.

W. Liu, G. Huang, X. Liu, M. Zhang, and X. Deng, Phishing web page


detection, in Proc. 8th Int. Conf. Documents Anal. Recognit., Seoul, Korea,
Aug. 2005, pp. 560564.

V. Apparao, S. Byrne, M. Champion, S. Isaacs, I. Jacobs, A. Le Hors, G.


Nicol, J. Robie, R. Sutor, C. Wilson, and L. Wood. (1998, Oct.). Document
Object

Model

Level

Specification

[Online].

Available:

http://www.w3.org/TR/1998/RECDOM-Level-1-19981001

Y. Rubner, C. Tomasi, and L. J. Guibas, The earth movers distance as a


metric for image retrieval, Int. J. Comput. Vis., vol. 40, no. 2, pp. 99121,
2000.

M. Chandrasekaran, K. Narayanan, and S. Upadhyaya, Phishing email


detection based on structural properties, in Proc. 9th Annu. NYS Cyber
Secur. Conf., New York, Jun. 2006, pp. 28.

I. Fette, N. Sadeh, and A. Tomasic, Learning to detect phishing emails, in


Proc. 16th Int. Conf. World Wide Web, Banff, AB, Canada, May 2007, pp.
649656.

S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair, A comparison of machine


learning techniques for phishing detection, in Proc. Anti-Phish. Work.
Groups 2nd Annu. eCrime Res. Summit, Pittsburgh, PA, Oct. 2007, pp. 60
69.

R. Basnet, S. Mukkamala, and A. H. Sung, Detection of phishing attacks:


A machine learning approach, in Soft Computing Applications in Industry,
P. Bhanu, Eds. Berlin, Germany: Springer-Verlag, 2008.

A. McCallum and K. Nigam, A comparison of event models for naive


Bayes text classification, in Proc. AAAI Workshop Learn. Text Categor.,
Madison, WI, Jul. 1998, pp. 4148.

W. Hu, O. Wu, Z. Chen, and S. Maybank, Recognition of pornographic


web pages by classifying texts and images, IEEE Trans. Pattern Anal.
Mach. Intell., vol. 29, no. 6, pp. 10191034, Jun. 2007.

S. Brin and L. Page, The anatomy of a large-scale hypertexual web search


engine, in Proc. 7th Int. Conf. World Wide Web, Brisbane, QLD, Australia,
Apr. 1998, pp. 107117.

M. F. Porter, An algorithm for suffix stripping, Program, vol. 14, no. 3,


pp. 130137, 1980.

C. R. John, The Image Processing Handbook. Boca Raton, FL: CRC Press,
1995.

F. Nah, A study on tolerable waiting time: How long are web users willing
to wait? in Proc. 9th Amer. Conf. Inf. Syst., Tampa, FL, Aug. 2003, p. 285.

T. S. Chua, K. L. Tan, and B. C. Ooi, Fast signature-based colorspatial


image retrieval, in Proc. IEEE Int. Conf. Multimedia Comput. Syst.,
Ottawa, ON, Canada, Jun. 1997, pp. 362369.

T. W. S. Chow, M. K. M. Rahman, and S. Wu, Content based image


retrieval by using tree-structured regional features, Neurocomputing, vol.
70, nos. 46, pp. 10401050, 2007.

Y. Liu, Y. Liu, and K. C. C. Chan, Tensor distance based multilinear


locality-preserved maximum information embedding, IEEE Trans. Neural
Netw., vol. 21, no. 11, pp. 18481854, Nov. 2010.

Anda mungkin juga menyukai