Anda di halaman 1dari 11

Jadavpur University

Page 1
! of 11
!

FINAL YEAR PROJECT REPORT

Reducing Web Vulnerabilities By


Detecting Malicious URLs
Prepared by: Prasant Kumar & Aman Pandey, BCSE IV, Jadavpur University

6 February 2015

Prasant Kumar

Aman Pandey

Jadavpur University

Page 2
! of 11
!

FINAL YEAR PROJECT REPORT

EXECUTIVE SUMMARY
Objective
Uniform Resource Locators (URLs) are the primary means by which users locate resources
on the Internet. A web threat refers to any threat that uses the internet to facilitate cyber
crime. Web vulnerabilities are on the rise with the use of smartphones and cellular devices
for professional and personal use. Detecting malicious URLs is an essential task in network
security. We wish to develop a classification system for URLs without analysing its page
contents.

Goals
Our goal is to identify malicious websites using the relationship between the URLs and the
lexical (hostname, domain name, path tokens, delimiters etc) and host based (WHOIS
information, External Links, membership in blacklist etc) features that characterise them,
also the domains traffic and popularity information.

Solution
Using the a priori information available about a vast number of malicious and benign
domains we design trainable classification module for binary classification of new/
unclassified URLs as malicious or benign.

Project Outline
The whole process can be outlined as comprising of six distinct steps as mentioned below.
Data Collection
Feature Extraction
Data Filtering
Formulation of Training and Testing Sets
Predictor Training

Prasant Kumar

Aman Pandey

Jadavpur University

Page 3
! of 11
!

FINAL YEAR PROJECT REPORT

Evaluation

Results
Based on the data collected and the classification mechanism employed we obtained and
accuracy of 84.62%.

Prasant Kumar

Aman Pandey

Jadavpur University

Page 4
! of 11
!

FINAL YEAR PROJECT REPORT

ABSTRACT

Malicious web sites are a cornerstone of internet criminal activities. The dangers of these
sites have created a demand for safeguards that protect end-users from visiting them.
Detection of malicious URLs and identification of threat types are critical to thwart these
attacks. Knowing the type of a threat enables estimation of severity of the attack
and helps adopt an effective countermeasures. Here we propose a machine learning
approach to classify Web sites into two classes : benign and malicious. Our mechanism
only analyses the Uniform Resource Locator (URL) itself without accessing the content of
Web sites. Thus, it eliminates the run-time latency and the possibility of exposing users to
the browser-based vulnerabilities. Our method uses a variety of discriminative features
including lexical features and host-based features. Our experimental studies with 100
benign URLs and 100 malicious URLs obtained from real-life internet sources show that our
method delivers a good performance : the accuracy being 84.62% on detecting malicious
Web sites.
In the next section we provide an overview of the problem, background on URL resolution,
and discussion of the features we use for our application.

Prasant Kumar

Aman Pandey

Jadavpur University

Page 5
! of 11
!

FINAL YEAR PROJECT REPORT

AN OVERVIEW OF THE APPROACH


Our method uses the following set of discriminative features for the task which can be
broadly classified into three groups :

! Lexical Features
! Host-Based Features
! Other Features
The list of lexical and host based features that we have used as a feature in in training the
predictor are mentioned below.

Lexical Features :
! Length of Hostname
! Length of URL
! Number of Dots(or delimiters) in URL
! Domain Token Length
! Path Token Length
! Average Domain Token Length
! Average Path Token Length
! Longest Domain Token Length
! Longest Path Token Length
! Presence of IP Address

Host-Based Features :
! Resolved IP count
! Name-Server count Other Features :
! Number of Links on home page
! Number of images on home page
! ASN

Prasant Kumar

Aman Pandey

Jadavpur University

Page 6
! of 11
!

FINAL YEAR PROJECT REPORT

Other Features :
! Global Rank
! Daily Pageviews per Visitor
! Daily Time on Site
! Total sites Linking in

AN ILLUSTRATION OF THE FEATURES


Let us now take a URL and extract its lexical and host based features:
URL : www.ultratools.com/whois/serverStats?selectedTab=serverStats
Lexical Features :









Length of Hostname : 14
Length of URL : 60
Number of dots(or delimiters) in URL : 6*
Domain Token Count : 2
Path Token Count : 4
Avg Domain Token Length : 7
Avg Path Token Length : 9
Longest Domain token Length : 10
Longest path Token Length : 11
Presence of IP address : 0

Host based Features :


Resolved IP count : 1^
Name server count : 6 ^^

Other Features :
Link Popularity (Sites linking in) : 50^^^
Real traffic (Traffic Rank) : 70747^^^

Prasant Kumar

Aman Pandey

Jadavpur University

Page 7
! of 11
!

FINAL YEAR PROJECT REPORT

* Delimiters include :
.,/?=%_

^ Resolved IP address :
199.58.208.116
^^ The resolved name servers are as follows :
PDNS196.ULTRADNS.BIZ
PDNS196.ULTRADNS.CO.UK
PDNS196.ULTRADNS.COM
PDNS196.ULTRADNS.INFO
PDNS196.ULTRADNS.NET
PDNS196.ULTRADNS.ORG
^^^ Obtained from alexa.com

Prasant Kumar

Aman Pandey

Jadavpur University

Page 8
! of 11
!

FINAL YEAR PROJECT REPORT

OVERVIEW CONTINUED
For our approach we adopt the steps as shown in the flowchart below :


A brief overview of the various stages :
Data Collection :
We collected a list of 200 URLs (100 malicious + 100 benign) from the below
mentioned websites.
www.phishtank.com for phishing websites.
www.maliciousdomainlist.com, www.malcode.com/database/ for URLs having
malicious contents.
www.dmoz.org for benign URLs.

Prasant Kumar

Aman Pandey

Jadavpur University

Page 9
! of 11
!

FINAL YEAR PROJECT REPORT

Feature Extraction :
The lexical features can be easily extracted by simple string manipulation of the
URLs.
For the Host based features we will make use of the online tools available at
www.ultratools.com
For other features like link popularity and traffic, we will take help of www.alexa.com

Data Filtering :
After all the features have been extracted some data filtering maybe required if one or
more values are missing for a particular URL or if there are duplicate rows.

Training and Testing Set :


The data set will be split into two parts. 70% of the malicious+ benign URLs will be in
the training set, which will be used for training the predictor. The rest 30% will be in
the testing set that will be used in evaluating the predictor.

Predictor Training :
An appropriate model or a few models will be defined and tested. Likely candidates
are Support Vector machines, multivariate regression model. This will constitute the
major chunk of the work involved. We have used Support Vector Machines with
Linear kernel for our training phase.

Evaluation :
Once the predictor has been trained, i.e., it has developed a pattern to classify URLs,
the testing set will be used to evaluate the accuracy of the predictor.

Prasant Kumar

Aman Pandey

Jadavpur University

Page 10
! of 11
!

FINAL YEAR PROJECT REPORT

Let us assume that we have a list of 'm' pre-classified URLs consisting of 'p' benign
and q' malicious URLs.
Say we extract a total of n features (including lexical, host based and others) from a
URL.
Let the features be represented by f1, f2,fn.
Feature vector space = <f1,f2,f3..,fn>
These features represent a point in the n-dimensnional space.
And the total number of such points is m (=p+q), the total number of URLs.
Since we already know the classification of these URLs, we can fit a curve in ndimension to separate (if not all then the maximum) benign URLs from the malicious
URLs.
Our dataset can be represented as a matrix as shown below,
f(1,1) f(1,2) f(1,3) .. f(1,n)
f(2,1) f(2,2) f(2,3) .. f(2,n)
.
.
.
f(m,1) f(m,2) f(m,3) .. f(m,n)

Let our decision function (or hypothesis) be of the form h(x) = w0 + w1x1 + w2x2 +
w3x3 + . + wnxn, where x(i) = f(i) being the points in the n-dimensional space.
Now using the Cost function (say Mean Squared Loss function) and Gradient
descent algorithm we can compute the values of w0, w1, , wn that will best fit the
decision curve separating the two classes of URLs.
Say h(z)

> =1 for benign URLs


<1 fo malicious URLs.

Then for any new URL, its features need only be extracted and fed to the decision
function for classification purpose.

Prasant Kumar

Aman Pandey

Jadavpur University

Page 11
! of 11
!

FINAL YEAR PROJECT REPORT

RESULTS & CONCLUSION


Having collected 200 URLs (100 malicious and 100 benign) and extracted the
features, we formulated the testing and training sets comprising of 135 and 65 URLs
respectively selected at random. The features were scaled as per need. The SVM
predictor was trained using the training file and a linear kernel model was used for
the purpose. The predictor was then assessed by the testing file at hand and it gave
an accuracy of 84.62%.
The accuracy obtained, though not the best of accuracies attained in this field,
reflects upon us that more can be made to further improve it. At this juncture, three
possible steps ahead can be seen. They have been reviewed below.

Increasing the number of URLs (sample space)


We feel that more labelled examples the predictor has as input in the form of training
file covering a wide range of URLs, the better training model can be formed resulting
in improved accuracy.

Revisiting the features employed (feature space)


Currently we are using a total of 19 features for evaluation of a URL, but more
features can be used to help improve accuracy. We can include the number of
backlinks the domain has, and the number of pages indexed by google or yahoo for
the domain, and the output of mcafee site advisor, to name a few. We feel that a
large number of a variety of features will provide us with a wide range of information
about the domain and in turn provide better accuracy.

Using a dierent kernel/learning algorithm

Prasant Kumar

Aman Pandey

Anda mungkin juga menyukai