Page 1
! of 11
!
6 February 2015
Prasant Kumar
Aman Pandey
Jadavpur University
Page 2
! of 11
!
EXECUTIVE SUMMARY
Objective
Uniform Resource Locators (URLs) are the primary means by which users locate resources
on the Internet. A web threat refers to any threat that uses the internet to facilitate cyber
crime. Web vulnerabilities are on the rise with the use of smartphones and cellular devices
for professional and personal use. Detecting malicious URLs is an essential task in network
security. We wish to develop a classification system for URLs without analysing its page
contents.
Goals
Our goal is to identify malicious websites using the relationship between the URLs and the
lexical (hostname, domain name, path tokens, delimiters etc) and host based (WHOIS
information, External Links, membership in blacklist etc) features that characterise them,
also the domains traffic and popularity information.
Solution
Using the a priori information available about a vast number of malicious and benign
domains we design trainable classification module for binary classification of new/
unclassified URLs as malicious or benign.
Project Outline
The whole process can be outlined as comprising of six distinct steps as mentioned below.
Data Collection
Feature Extraction
Data Filtering
Formulation of Training and Testing Sets
Predictor Training
Prasant Kumar
Aman Pandey
Jadavpur University
Page 3
! of 11
!
Evaluation
Results
Based on the data collected and the classification mechanism employed we obtained and
accuracy of 84.62%.
Prasant Kumar
Aman Pandey
Jadavpur University
Page 4
! of 11
!
ABSTRACT
Malicious web sites are a cornerstone of internet criminal activities. The dangers of these
sites have created a demand for safeguards that protect end-users from visiting them.
Detection of malicious URLs and identification of threat types are critical to thwart these
attacks. Knowing the type of a threat enables estimation of severity of the attack
and helps adopt an effective countermeasures. Here we propose a machine learning
approach to classify Web sites into two classes : benign and malicious. Our mechanism
only analyses the Uniform Resource Locator (URL) itself without accessing the content of
Web sites. Thus, it eliminates the run-time latency and the possibility of exposing users to
the browser-based vulnerabilities. Our method uses a variety of discriminative features
including lexical features and host-based features. Our experimental studies with 100
benign URLs and 100 malicious URLs obtained from real-life internet sources show that our
method delivers a good performance : the accuracy being 84.62% on detecting malicious
Web sites.
In the next section we provide an overview of the problem, background on URL resolution,
and discussion of the features we use for our application.
Prasant Kumar
Aman Pandey
Jadavpur University
Page 5
! of 11
!
! Lexical Features
! Host-Based Features
! Other Features
The list of lexical and host based features that we have used as a feature in in training the
predictor are mentioned below.
Lexical Features :
! Length of Hostname
! Length of URL
! Number of Dots(or delimiters) in URL
! Domain Token Length
! Path Token Length
! Average Domain Token Length
! Average Path Token Length
! Longest Domain Token Length
! Longest Path Token Length
! Presence of IP Address
Host-Based Features :
! Resolved IP count
! Name-Server count Other Features :
! Number of Links on home page
! Number of images on home page
! ASN
Prasant Kumar
Aman Pandey
Jadavpur University
Page 6
! of 11
!
Other Features :
! Global Rank
! Daily Pageviews per Visitor
! Daily Time on Site
! Total sites Linking in
Length of Hostname : 14
Length of URL : 60
Number of dots(or delimiters) in URL : 6*
Domain Token Count : 2
Path Token Count : 4
Avg Domain Token Length : 7
Avg Path Token Length : 9
Longest Domain token Length : 10
Longest path Token Length : 11
Presence of IP address : 0
Other Features :
Link Popularity (Sites linking in) : 50^^^
Real traffic (Traffic Rank) : 70747^^^
Prasant Kumar
Aman Pandey
Jadavpur University
Page 7
! of 11
!
* Delimiters include :
.,/?=%_
^ Resolved IP address :
199.58.208.116
^^ The resolved name servers are as follows :
PDNS196.ULTRADNS.BIZ
PDNS196.ULTRADNS.CO.UK
PDNS196.ULTRADNS.COM
PDNS196.ULTRADNS.INFO
PDNS196.ULTRADNS.NET
PDNS196.ULTRADNS.ORG
^^^ Obtained from alexa.com
Prasant Kumar
Aman Pandey
Jadavpur University
Page 8
! of 11
!
OVERVIEW CONTINUED
For our approach we adopt the steps as shown in the flowchart below :