Abstract
A novel framework using a SVM [Support Vector Machine] approach for contentbased phishing web page detection is presented. Our model takes into account
textual and visual contents to measure the similarity between the protected web
page and suspicious web pages. A text classifier, an image classifier, and an
algorithm fusing the results from classifiers are introduced. An outstanding feature
of this paper is the exploration of a SVM model to estimate the matching threshold.
This is required in the classifier for determining the class of the web page and
identifying whether the web page is phishing or not. In the text classifier, the naive
SVM rule is used to calculate the probability that a web page is phishing. In the
image classifier, the earth movers distance is employed to measure the visual
similarity, and our SVM model is designed to determine the threshold. In the data
fusion algorithm, the SVM theory is used to synthesize the classification results
from textual and visual content. The effectiveness of our proposed approach was
examined in a large-scale dataset collected from real phishing cases. Experimental
results demonstrated that the text classifier and the image classifier we designed
deliver promising results, the fusion algorithm outperforms either of the individual
classifiers, and our model can be adapted to different phishing cases.
Introduction
Malicious people, also known as phishers, create phishing web pages, i.e.,
forgeries of real web pages, to steal individuals personal information such as bank
account, password, credit card number, and other financial data. Unwary online
users can be easily deceived by these phishing web pages because of their high
similarities to the real ones. The Anti-Phishing Working Group reported that there
were at least 55, 698 phishing attacks between January 1, 2009, and June 30, 2009.
The latest statistics show that phishing remains a major criminal activity involving
great losses of money and personal data.
Automatically detecting phishing web pages has attracted much attention from
security and software providers, financial institutions, to academic researchers.
Methods for detecting phishing web pages can be classified into industrial toolbar
based anti-phishing, user-interface-based anti-phishing, and web page contentbased anti-phishing. To date, techniques for phishing detection used by the industry
mainly include authentication, filtering, attack tracing and analyzing, phishing
report generating, and network law enforcement. These anti-phishing internet
services are built into e-mail servers and web browsers and available as web
browser toolbars.
These industrial services, however, do not efficiently know all phishing attacks.
Wu et al. conducted thorough study and analysis on the effectiveness of antiphishing toolbars, which consist of three security toolbars and other mostly used
browser security indicators. The study indicates that all examined toolbars in were
ineffective to prevent web pages from phishing attacks. Reports show that 20 out
of 30 subjects were spoofed by at least one phishing attack, 85% of the spoofed
subjects indicated that the websites look legitimate or exactly same as they visited
before, and 40% of the spoofed subjects were tricked due to poorly designed web
sites. Cranor et al. performed another study on an evaluation of 10 anti-phishing
tools. They indicated that only one tool could consistently detect more than 60% of
phishing web sites without a high rate of false positives, whilst four tools were not
able to recognize 50% of the tested web sites. Apart from these studies on the
effectiveness of anti-phishing toolbars, Li and Helenius investigated usability of
five typical anti-phishing toolbars.
They found that the main user interface of the toolbar, warnings, and help system
are the three basic components that should be well designed. They also found that
it is beneficial to apply white-list and blacklist methods together. Also, due to the
quality of the online traffic the applications from the anti-phishing client side
should not rely merely on the Internet. Recently, Aburrous et al. developed a
resilient model by using fuzzy logic to quantify and qualify the website phishing
characteristics with a layered structure and to study the influence of the phishing
characteristics at different layers on the final phishing website rate.
Content-based anti-phishing, which is referred to as using the features of web
pages, consists of surface level characteristics, textual content, and visual content.
We clarify that the content of a web page we discuss here include the whole
information of a web page such as a domain name, URL, hyperlinks, terms,
images, and forms embedded in the web page. Surface-level characteristics have
been commonly used by industrial toolbars to detect phishing. For example, the
Spoof-Guard makes use of inspecting the age of domain, well-known logos, URL,
and links to acquire the characteristics of phishing web pages. Liu et al. proposed
the use of semantic link network (SLN) to automatically identify the phishing
target of a given webpage.
The method works by first finding the associated web pages of the given webpage
and then constructing a SLN from all those web pages. A mechanism of reasoning
on the SLN is exploited to identify the phishing target. Zhang et al. developed a
content-based approach, i.e., Carnegie Mellon Anti-phishing and Network Analysis
Tool, for anti-phishing by employing the idea of robust hyperlinks [16]. Given a
web page, this method first calculates the TF-IDF of each term, an algorithm
usually used in information retrieval, generates a lexical signature4 by selecting a
few terms, supplies this signature to a search engine, and then matches the domain
name of current web page and several top search results to evaluate the current
web page is legitimate or not. Another content-based technique, BAPT, is designed
to identify phishing websites by using an open-source Bayesian filter on the basis
of tokens which are extracted by a document object module (DOM) analyzer.
The concept of visual approach to phishing detection was first introduced by Liu et
al. This approach, which is oriented by the DOM-based visual similarity of web
pages, first decomposes the web pages into salient block regions. The visual
similarity between two web pages is then evaluated by three metrics, namely, block
level similarity, layout similarity, and overall style similarity, which are based on
the matching of the salient block regions. Fu et al. followed the overall strategy,
but proposed another method to calculate the visual similarity of web pages. They
first converted HTML web pages into images and then employed the earth movers
distance method to calculate the similarity of the images. This approach only
investigates phishing detection at the pixel level of web pages without considering
the text level. Apart from these approaches to detect phishing web pages, contentbased methods for detecting phishing emails have also been widely studied,
especially using machine learning techniques.
Objective
Scope Of Project
The main scope of the project is as follows:
To detect the website is a phishing website or not.
To detect the website is hacked by the attacker or not.
To compare the true and attacked websites by detecting its fusion results.
Project Description
Existing System
Phishing technique used by the existing system mainly includes authentication,
filtering, attack tracing and analyzing. Toolbar based anti-phishing which
guides the user to interact with trusted website. The toolbars like security
toolbars and browser security toolbars are used in the system. Methods for
detecting phishing web pages can be classified into industrial toolbar-based
anti-phishing, user-interface-based anti-phishing, and web page content-based
anti-phishing. Techniques for phishing detection used by the industry mainly
include authentication, filtering, attack tracing and analyzing, phishing report
generating, and network law enforcement. These anti-phishing internet services
are built into e-mail servers and web browsers and available as web browser
toolbars.
Content-based anti-phishing, which is referred to as using the features of web
pages, consists of surface level characteristics, textual content, and visual
content. We clarify that the content of a web page we discuss here include the
whole information of a web page such as a domain name, URL, hyperlinks,
terms, images, and forms embedded in the web page. Surface-level
characteristics have been commonly used by industrial toolbars to detect
phishing. For example, the Spoof-Guard makes use of inspecting the age of
domain, well known logos, URL, and links to acquire the characteristics of
phishing web pages. Liu et al. proposed the use of semantic link network to
automatically identify the phishing target of a given webpage.
The method works by first finding the associated web pages of the given
webpage and then constructing a SLN from all those web pages. A mechanism
of reasoning on the SLN is exploited to identify the phishing target. Zhang et al.
Disadvantages
All phishing attacks will not be detected by using the detection techniques.
The toolbar technique is an ineffective way to prevent web pages from
phishing attacks.
The online traffic will decrease the quality of web pages and its applications.
The existing approach only investigates phishing detection at the pixel level
of web pages without considering the text level.
The existing systems like CANTINA, Tool-bar based technique is very
difficult to implement.
All the phishing web pages will not be detected by using the CANTINA and
Tool-bar based systems.
Proposed System
of original words. We store the stemmed words to construct the vocabulary. Given
a web page, we then form a histogram vector, where each component represents
the term frequency and n denotes the total number of components in the vector. We
explain three points here.
1) We do not extract words from all the web pages in a dataset to construct the
vocabulary, because phishers usually only use the words from a targeted web page
to scam unwary users.
2) For the sake of simplicity, we do not use any feature extraction algorithms in
the process of vocabulary construction.
3) We do not take the semantic associations of web pages into account, because
the sizes of most phishing web pages are small.
In reality, using only text content is insufficient to detect phishing web pages. This
method will usually result in high false positives, because phishing web pages are
highly similar to the targeted web pages not only in textual content but also in
visual content such as famous logos, layout, and overall style. In this system, we
use the same approach as in using the SVM to measure the visual similarity
between an incoming web page and a protected web page.
First, we retrieve the suspected web pages and protected web pages from the web.
Second, we generate their signatures, which are used for the calculation of the
SVM between them. Thus all the web page images are normalized into fixed-size
square images. We use these normalized images to generate the signature of each
web page.
The image classifier is implemented by setting a threshold, which is later
estimated in the subsequent section. If the visual similarity between a suspected
web page and the protected web page exceeds the threshold, the web page is
classified as phishing, otherwise.
Step 1: Input the training set, train a text classifier and an image classifier, and
then collect similarity measurements from different classifiers.
Step 2: Partition the interval of similarity measurements into sub-intervals.
Step 3: Estimate the posterior probabilities conditioning on all the sub-intervals
for the image classifier.
Step 4: Estimate the posterior probabilities conditioning on all the sub-intervals
for the image classifier.
Step 5: For a new testing web page, classify it into corresponding category by
using the text classifier and the image classifier.
Step 6: Display the results whether the given web page is phishing or not.
Advantages
The data fusion framework enables us to directly incorporate the multiple
results produced by different classifiers.
The SVM algorithm is used for classifying both the textual and visual
content.
All phishing websites will be detected by using this approach.
Literature Survey
drawn high attention from both industry and the academic research domain
since it is a severe security and privacy problem and has caused huge negative
impacts on the Internet world. It is threatening peoples confidence to use the
Web to conduct online finance-related activities.
In this system, we propose an effective approach for detecting phishing Web
pages, which employs the Earth Movers Distance (EMD) to calculate the
visual similarity of Web pages. The most important reason that Internet users
could become phishing victims is that phishing Web pages always have high
visual similarity with the real Web pages, such as visually similar block layouts,
dominant colors, images, and fonts, etc. We follow the anti-phishing strategy in
to obtain suspected Web pages, which are supposed to be collected from URLs
in those e-mails containing keywords associated with protected Web pages. We
first convert them into normalized images and then represent their image
signatures with features composed of dominant color category and its
corresponding centroid coordinate to calculate the visual similarity of two Web
pages.
The linear programming algorithm for EMD is applied to visual similarity
computation of the two signatures. An anti-phishing system may be requested to
protect many Web pages. A threshold is calculated for each protected Web page
using supervised training. If the EMD-based visual similarity of a Web page
exceeds the threshold of a protected Web page, we classify the Web page as a
phishing one.
Evolving with the anti-phishing techniques, various phishing techniques and
more complicated and hard-to-detect methods are used by phi-shers. The most
straightforward way for a phi-sher to scam people is to make the phishing Web
pages similar to their targets.
A phishing strategy includes both Web link obfuscation and Web page
obfuscation. Web link obfuscation can be carried out in four basic ways: adding
a suffix to a domain name of the URL, using an actual link different from the
visible link, utilizing system bugs in real Web sites to redirect the link to the
phishing Web pages. Previous research works on duplicated document detection
approaches focus on plain text documents and use pure text features in
similarity measure, such as collection statistics, syntactic analysis, displaying
structure, visual-based understanding, vector space model, etc. Hoad and Zobel
have surveyed various methods on plagiarized document detection. However, as
Liu et al. demonstrated, pure text features are not sufficient for phishing Web
page detection since phishing Web pages mainly employ visual similarity to
scam users.
There are many anti-phishing techniques popularly used by the industry. The
most popular and major anti-phishing methods include authentication, which
includes e-mail authentication, and Web page authentication, filtering, which
includes e-mail filtering and Web page filtering, attack analyzing and tracing,
immune-system-like phishing report and detection, and network law
enforcement. Many user interface based anti-phishing approaches, including
Web browser toolbars, e-mail client agent toolbars, and distributed server
applications, are also commonly used.
Phishing is considered as one of the most serious threats for the Internet and ecommerce. Phishing attacks abuse trust with the help of deceptive e-mails,
fraudulent web sites and malware. In order to prevent phishing attacks some
organizations have implemented Internet browser tool-bars for identifying
deceptive activities. However, the levels of usability and user interfaces are
varying. Some of the toolbars have obvious usability problems, which can
affect the performance of these toolbars ultimately. For the sake of future
improvement, usability evaluation is indispensable. We will discuss usability of
ve typical anti-phishing tool- bars: built-in phishing prevention in the Internet
Explorer 7.0, Google toolbar, Net-craft Anti-phishing toolbar and Spoof- Guard.
In addition, we included Internet Explorer plug-in we have developed, Antiphishing IEPlug. Our hypothesis was that usability of anti-phishing toolbars,
and as a con- sequence also security of the toolbars, could be improved. Indeed,
according to the heuristic usability evaluation, a num- ber of usability issues
were found. In this article, we will describe the anti-phishing toolbars, we will
discuss anti- phishing toolbar usability evaluation approach and we will present
our ndings.
Phishing has become as one of the most serious network threats. Similar to
other malicious attacks, phishing can cause loss for both nancial institutions
and consumers. However, unlike most crackers, phishers gain benets by
accessing credential information, instead of system or network damage.
Moreover, phishing damages the trust of e-commerce. A devastating attack does
not require any emerging techniques. According to the March 2006 report of the
Anti-Phishing Working Group, the most frequently used artices are deceptive
e-mails or web pages, Trojan horses and key loggers. Moreover, more than 80%
of fraudulent web domains contain ambiguous names, for example, some form
of target name in the URL or only IP address without host name. These
ambiguous domain names are hazardous for careless consumers.
Because of the careless usability security design, phishers can easily take
advantage of poor usability design. In order to offer more reliable security, antiphishing tool-bars should be easier to use. Moreover, as end-users must be able
to use the toolbars and make correct choices, usability evaluation of these
toolbars is important. Our research objective was to nd out general usability
design principles for anti-phishing client side applications. Such information
may result in valuable information for improving usability and security of antiphishing applications. Based on this motivation, we conducted the heuristic
usability evaluation of ve toolbars. However, we must advice the reader that
we are not making a comparison of the toolbars in this system. An objective
comparison would require a different approach and should concentrate on
assessing phishing prevention capabilities.
In this system, we will present our evaluation and discuss the issues found
during the evaluation. In the following parts of this paper, we will rst
introduce the features and characteristics of these ve toolbars in order to make
readers aware of basic functionalities from a technical perspective. After that,
we will present the heuristic evaluation methodology and the evaluation results
we found. Based on the results, we will give advices for improving the toolbars
usability design. In conclusion, the usability evaluation is summarized, and the
impact of weak usability performance of the toolbars is discussed.
1. Toolbars based on clientserver architecture and anti- phishing prevention
combined with other functionalities. These types of toolbars need to
communicate with their servers, in order to protect users from being
spoofed. However, these kinds of toolbars are not tailored just for phishing
prevention. Instead, there are other functionalities that are not related to antiphishing. For example, Googles Safe Browsing functionality is only one of
the toolbars features. The other features include such as Enhanced Search
Box, AutoFill, etc.
2. Toolbars based on clientserver architecture and designed only for
phishing prevention. These are also based on the clientserver structure, but
the functionality is only phishing prevention. Therefore, users can only nd
phishing related functionalities from their interfaces. For example, Net craft
toolbar is designed only for phishing prevention. Even though some of its
functionalities are not directly associated with anti-phishing, these are
designed to support identication of fraud web pages.
3. Toolbars installed on the local computer and detecting fraud websites by
users browsing information. Because of the lack of the server side, these
kinds of toolbars have to use the browsing information or the browsing
history for detection. This kind of data cannot be managed by toolbars
themselves, but by web browsers. Therefore, it is required for users to
congure the browsing records carefully.
4. Toolbars installed on the local computer and detecting fraud websites.
Different from the previous type, these toolbars must use some other
methods to identify spoofing websites, like a whitelist or general detection.
Com- pared with the third type of toolbar, users may more freely customize
their own preferences, e.g. authentic web sites.
sensitive data, she presses a dedicated security key on the keyboard to open the
Web Wallet. Using the Web Wallet, she may type her data or retrieve her stored
data. The data is then filled into the web form. But before the fill-in, the Web
Wallet checks if the current site is good enough to receive the sensitive data. If
the current site is not qualified, the Web Wallet requires the user to explicitly
indicate where she wants the data to go. If the users intended site is not the
current site, the Web Wallet shows a warning to the user about this discrepancy,
and gives her a safe path to her intended site. There is one simple rule to
correctly use the Web Wallet: Always use the Web Wallet to submit sensitive
information by pressing the security key first. Equivalently, never submit
sensitive information directly through a web form because it is not a secure
practice.
We have run a user study to test the Web Wallet interface. The results are
promising:
The Web Wallet significantly decreased the spoof rate of normal phishing
attacks from 63% to 7%.
All the simulated phishing attacks in the study were effectively prevented by
the Web Wallet as long as it was used.
By disabling direct input into web forms and thus making itself the only way
to input sensitive information, the Web Wallet successfully trained a majority of
the subjects to use it to protect their sensitive information submission.
But there are also negative results which we plan to deal with in future research:
The subjects totally failed to differentiate the authentic Web Wallet interface
from a fake Web Wallet presented by a phishing site. This is a new type of
within
WEKA
and
CBA
packages.
JRip
WEKA's
De-fuzzification
This is the process of transforming a fuzzy output of a fuzzy inference
system into a crisp output. Fuzziness helps to evaluate the rules, but the final
output has to be a crisp number. The input for the de-fuzzification process is
the aggregate output fuzzy set and the output is a number. This step was
done using Centroid technique since it is a commonly used method.
There are a number of challenges posed by doing post- hoc classification of ebanking phishing websites. Most of these challenges only apply to the e-banking
phishing websites data and materialize as a form of information, which has the net
effect of increasing the false negative rate. The age of the dataset is the most
significant problem, which is particularly relevant with the phishing corpus. Ebanking Phishing websites are short-lived, often lasting only in the order of 48
hours. Some of our features can therefore not be extracted from older websites,
making our tests difficult. The average phishing site stays live for approximately
2.25 days. Furthermore, the process of transforming the original e- banking
phishing website archives into record feature datasets is not without error. It
requires the use of heuristics at several steps. Thus high accuracy from the data
mining algorithms cannot be expected. However, the evidence supporting the
golden nuggets comes from a number different algorithms and feature sets and we
believe it is compelling.
bias towards longer documents to give a measure of the importance of the term
within the particular document. The inverse document frequency (IDF) is a
measure of the general importance of the term. Roughly speaking, the IDF
measures how common a term is across an entire collection of documents.
Thus, a term has a high TF-IDF weight by having a high term frequency in a
given document.
CANTINA works as follows:
Given a web page, calculate the TF-IDF scores of each term on that web
page. Generate a lexical signature by taking the five terms with highest
TF-IDF weights.
If the domain name of the current web page matches the domain name of
the N top search results, we consider it to be a legitimate web site.
Otherwise, we consider it a phishing site.
Our technique makes the assumption that Google indexes the vast majority
of legitimate web sites, and that legitimate sites will be ranked higher than
phishing sites. Combined suggest that a phishing scam will rarely, if ever, be
highly ranked. At the end of this paper, however, we discuss some ways of
possibly subverting CANTINA.
Age of Domain
This heuristic checks the age of the domain name. Many phishing sites have
domains that are registered only a few days before phishing emails are sent
out. We use a WHOIS search to implement this heuristic. This heuristic
measures the number of months from when the domain name was first
registered. If the page has been registered longer than 12 months, the
heuristic will return +1, deeming it as legitimate and otherwise returns -1,
deeming it as phishing. If the WHOIS server cannot find the domain, the
heuristic will simply return -1, deeming it as a phishing page. The Net craft
and Spoof-Guard toolbars use a similar heuristic based on the time since a
domain name was registered. Note that this heuristic does not account for
phishing sites based on existing web sites where criminals have broken into
the web server, nor does it account for phishing sites hosted on otherwise
legitimate domains, for example in space provided by an ISP for personal
homepages.
Known Images
This heuristic checks whether a page contains inconsistent well-known
logos. For example, if a WWW 2007 / Track: Security, Privacy, Reliability,
and Ethics Session: Passwords and Phishing 642 page contains eBay logos
but is not on an eBay domain, then this heuristic labels the site as a probable
phishing page. Currently we store nine popular logos locally, including
eBay, PayPal, Citibank, Bank of America, Fifth Third Bank, Barclays Bank,
ANZ Bank, Chase Bank, and Wells Fargo Bank. Eight of these nine
legitimate sites are included in the PhishTank.com list of Top 10 Identified
Targets. A similar heuristic is used by the Spoof-Guard toolbar.
Suspicious URL
This heuristic checks if a pages URL contains an at (@) or a dash (-) in
the domain name. A @ symbol in a URL causes the string to the left to be
disregarded, with the string on the right treated as the actual URL for
retrieving the page. Combined with the limited size of the browser address
bar, this makes it possible to write URLs that appear legitimate within the
address bar, but actually cause the browser to retrieve a different page. This
heuristic is used by Mozilla Fire-Fox. Dashes are also rarely used by
legitimate sites, so we use this as another heuristic. Spoof-Guard checks for
both at symbols and dashes in URLs.
Suspicious Links
This heuristic applies the URL check above to all the links on the page. If
any link on a page fails this URL check, then the page is labeled as a
possible phishing scam. This heuristic is also used by Spoof-Guard.
IP Address
This heuristic checks if a pages domain name is an IP address. This
heuristic is also used in PILFER.
Dots in URL
This heuristic check the number of dots in a pages URL. We found that
phishing pages tend to use many dots in their URLs but legitimate sites
usually do not. Currently, this heuristic labels a page as phish if there are 5
or more dots. This heuristic is also used in PILFER.
Forms
This heuristic checks if a page contains any HTML text entry forms asking
for personal data from people, such as password and credit card number. We
scan the HTML for <input> tags that accept text and are accompanied by
labels such as credit card and password.
Software Description
Java
Java is a programming language originally developed by James Gosling at Sun
Microsystems (now a subsidiary of Oracle Corporation) and released in 1995 as a
core component of Sun Microsystems' Java platform. The language derives much
of its syntax from C and C++ but has a simpler object model and fewer low-level
facilities. Java applications are typically compiled to byte code (class file) that can
run on any Java Virtual Machine (JVM) regardless of computer architecture. Java
is a general-purpose, concurrent, class-based, object-oriented language that is
specifically designed to have as few implementation dependencies as possible. It is
intended to let application developers "write once, run anywhere." Java is currently
one of the most popular programming languages in use, particularly for clientserver web applications.
The original and reference implementation Java compilers, virtual machines, and
class libraries were developed by Sun from 1995. As of May 2007, in compliance
with the specifications of the Java Community Process, Sun relicensed most of its
Java technologies under the GNU General Public License. Others have also
developed alternative implementations of these Sun technologies, such as the GNU
Compiler for Java and GNU Class path.
Java Platform:
One characteristic of Java is portability, which means that computer programs
written in the Java language must run similarly on any hardware/operating-system
platform. This is achieved by compiling the Java language code to an intermediate
representation called Java byte code, instead of directly to platform-specific
machine code. Java byte code instructions are analogous to machine code, but are
intended to be interpreted by a virtual machine (VM) written specifically for the
host hardware. End-users commonly use a Java Runtime Environment (JRE)
installed on their own machine for standalone Java applications, or in a Web
browser for Java applets.
Standardized libraries provide a generic way to access host-specific features such
as graphics, threading, and networking.
A major benefit of using byte code is porting. However, the overhead of
interpretation means that interpreted programs almost always run more slowly than
programs compiled to native executables would. Just-in-Time compilers were
introduced from an early stage that compiles byte codes to machine code during
runtime.
Just as application servers such as Glassfish provide lifecycle services to web
applications, the Net Beans runtime container provides them to Swing applications.
Application servers understand how to compose web modules, EJB modules, and
so on, into a single web application, just as the Net Beans runtime container
understands how to compose Net Beans modules into a single Swing application.
Modularity offers a solution to "JAR hell" by letting developers organize their code
into strictly separated and versioned modules. Only those that have explicitly
declared dependencies on each other are able to use code from each other's
exposed packages. This strict organization is of particular relevance to large
applications developed by engineers in distributed environments, during the
development as well as the maintenance of their shared codebase.
End users of the application benefit too because they are able to install modules
into their running applications, since modularity makes them pluggable. In short,
Beans).
The
name
of
key
map
can
be
localized.
Use
panels
contains
an
instance
of
Project
and
org.netbeans.modules.java.j2seproject.ui.customizer.J2SEProjectProperties Please
note that the latter is not part of any public APIs and you need implementation
dependency to make use of it.
"Projects/org-netbeans-modules-java-j2seproject/Nodes" folder's content is
used to construct the project's child nodes. It's content is expected to be Node
Factory instances.
"Projects/org-netbeans-modules-java-j2seproject/Lookup" folder's content is
used to construct the project's additional lookup. It's content is expected to be
Lookup Provider instances. J2SE project provides Lookup Mergers for Sources,
Privileged Templates and Recommended Templates. Implementations added by 3rd
parties will be merged into a single instance in the project's lookup.
Use Options Dialog folder for registration of custom top level options panels.
Register your implementation of Options Category there (*.instance file). Standard
file systems sorting mechanism is used.
The keyword void indicates that the main method does not return any value to the
caller. If a Java program is to exit with an error code, it must call System.exit()
explicitly.
The method name "main" is not a keyword in the Java language. It is simply the
name of the method the Java launcher calls to pass control to the program. Java
classes that run in managed environments such as applets and Enterprise
JavaBeans do not use or need a main () method. A Java program may contain
multiple classes that have main methods, which means that the VM needs to be
explicitly told which class to launch from.
The main method must accept an array of String objects. By convention, it is
referenced as args although any other legal identifier name can be used. Since Java
5, the main method can also use variable arguments, in the form of public static
void main(String... args), allowing the main method to be invoked with an arbitrary
number of String arguments. The effect of this alternate declaration is semantically
identical (the args parameter is still an array of String objects), but allows an
alternative syntax for creating and passing the array.
The Java launcher launches Java by loading a given class (specified on the
command line or as an attribute in a JAR) and starting its public static void
main(String[]) method. Stand-alone programs must declare this method explicitly.
The String[] args parameter is an array of String objects containing any arguments
passed to the class. The parameters to main are often passed by means of a
command line.
Printing is part of a Java standard library: The System class defines a public static
field called out. The out object is an instance of the Print Stream class and provides
many methods for printing data to standard out, including println (String) which
also appends a new line to the passed string.
Java =>High-level Language:
A high-level programming language developed by Sun Microsystems. Java was
originally called OAK, and was designed for handheld devices and set-top boxes.
Oak was unsuccessful so in 1995 Sun changed the name to Java and modified the
language to take advantage of the burgeoning World Wide Web.
Java is an object-oriented language similar to C++, but simplified to eliminate
language features that cause common programming errors. Java source code files
(files with a .java extension) are compiled into a format called byte code (files with
a .class extension), which can then be executed by a Java interpreter. Compiled
Java code can run on most computers because Java interpreters and runtime
environments, known as Java Virtual Machines (VMs), exist for most operating
systems, including UNIX, the Macintosh OS, and Windows. Byte code can also be
converted directly into machine language instructions by a just-in-time compiler
(JIT).
Java is a general purpose programming language with a number of features that
make the language well suited for use on the World Wide Web. Small Java
applications are called Java applets and can be downloaded from a Web server and
run on your computer by a Java-compatible Web browser, such as Netscape
Navigator or Microsoft Internet Explorer.
Object-oriented software development matured significantly during the past
several years. The convergence of object-oriented modeling techniques and
notations, the development of object-oriented frameworks and design patterns, and
Window management
Wamp Server
WAMPs are packages of independently-created programs installed on computers
that use a Microsoft Windows operating system. WAMP is an acronym formed
from the initials of the operating system Microsoft Windows and the principal
components of the package:Apache, MySQL and one of PHP, Perl or Python.
Apache is a web server. MySQL is an open-source database. PHP is a scripting
language that can manipulate information held in a database and generate web
pages dynamically each time content is requested by a browser. Other programs
may also be included in a package, such as phpMyAdmin which provides a
graphical user interface for the MySQL database manager, or the alternative
scripting languages Python or Perl. Equivalent packages are MAMP (for the Apple
Mac) and LAMP (for the Linux operating system).
System Architecture
Modules
Loading web page training set.
Textual and visual content feature extraction.
Text and image classification.
Fusing of detected results.
Comparison of detected fusion results.
Module Description
Loading web page training set
Fusion algorithm is used for merging or joining the textual and visual
classified results.
The detected fusion results will be compared with original web page.
The posteriori probability will be found by using the similarity.
By this probability the fusion results of false and true web pages will be
compared.
The false web page is compared with the true web page.
The detected results will be shown to the user.
System Requirements
Software Requirement
Operating System
Windows XP
Language
Core Java
Version
JDK 1.5
IDE
Database
My-Sql
Hardware Requirements
PROCESSOR
PENTIUM IV
CLOCK SPEED
2.7 GHZ
RAM CAPACITY
1 GB
200 GB
Conclusion
A new content-based anti-phishing system has been thoroughly developed. In this
system, we presented a new framework to solve the anti-phishing problem. The
new features of this framework can be represented by a text classifier, an image
classifier, and a fusion algorithm. Based on the textual content, the text classifier is
able to classify a given web page into corresponding categories as phishing or
normal. This text classifier was modeled by SVM rule. Based on the visual content,
the image classifier, which relies on SVM, is able to calculate the visual similarity
between the given web page and the protected web page efficiently. The matching
threshold used in both text classifier and image classifier is effectively estimated
by using a probabilistic model derived from the SVM theory. A novel data fusion
model using the SVM theory was developed and the corresponding fusion
algorithm presented. This data fusion framework enables us to directly incorporate
the multiple results produced by different classifiers. This fusion method provides
insights for other data fusion applications. More importantly, it is worth noting that
our content-based model can be easily embedded into current industrial antiphishing systems.
Future Enhancement
Our future work will include adding more features into the content
representations into our current model.
Investigating incremental learning models to solve the knowledge updating
problem in current probabilistic model.
Adding more data sets with textual and visual content of web pages for both
true and false web pages.
References
A. Y. Fu, W. Liu, and X. Deng, Detecting phishing web pages with visual
similarity assessment based on earth movers distance (EMD), IEEE Trans.
Depend. Secure Comput., vol. 3, no. 4, pp. 301311, Oct. Dec. 2006.
Global Phishing Survey: Domain Name Use and Trends in 1H2009. AntiPhishing
Working
Group,
Cambridge,
MA
[Online].
Available:
http://www.antiphishing.org
Model
Level
Specification
[Online].
Available:
http://www.w3.org/TR/1998/RECDOM-Level-1-19981001
C. R. John, The Image Processing Handbook. Boca Raton, FL: CRC Press,
1995.
F. Nah, A study on tolerable waiting time: How long are web users willing
to wait? in Proc. 9th Amer. Conf. Inf. Syst., Tampa, FL, Aug. 2003, p. 285.