ABSTRACT
Recently, major computer attacks are launched by visiting a malicious webpage. In this paper we
have to construct a real-time system that uses machine learning techniques to detect malicious URLs
(spam, phishing, exploits, and so on). So that, we have determine techniques that involve classifying
URLs based on their lexical and host-based features, as well as online learning to process large
numbers of examples and adapt quickly to evolving URLs over time. However, in a real-world
malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is
highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy.
Besides, another limitation of the previous work is to assume a large amount of training data is
available, which is impractical as the human labeling cost could be quite expensive. A user can be
tricked into voluntarily giving away confidential information on a phishing page or become victim to
a drive-by download resulting in a malware infection. A malicious URL is a link pointing to a malware
or a phishing site, and it may then propagate through the victim's contact list. Moreover, hacker
sometimes might use social engineering tricks making malicious URLs hard to be identified. To solve
these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning
(CSOAL).
Index Terms : Malicious URL Detection, Cost-Sensitive Learning, Online Learning, Active Learning.
I.
INTRODUCTION
The WWW allows people to access all information on the internet, but it also brings fake information, such as
fake drug, malware, and so on. Criminal enterprises such as spam-advertised commerce (e.g., counterfeit
watches or pharmaceuticals), financial fraud (e.g., via phishing) and as a vector for propagating malware
(e.g., so-called drive-by downloads). [1][2]A user accesses all kinds of information (Trusted or Suspicious)
on the Web by clicking on a URL (Uniform Resource Locator) that links to a particular website. It is thus
important for internet users to find the risk of clicking a URL in order to avoid check accessing the malicious
web sites.
Although the exact adversary mechanisms behind web criminal activities may vary, they all try to lure users
to visit malicious websites by clicking a corresponding URL (Uniform Resource Locator)[3]. The most
motivational things behind these schemes may differ; the common thing among them is the requirement that
unsuspecting users visit their sites.
12 | 2015, IJAFRC All Rights Reserved
www.ijafrc.org
www.ijafrc.org
www.ijafrc.org
www.ijafrc.org
REFERENCES
[1]
Jialei Wang, Peilin Zhao, and Steven C.H. Hoi, Member, IEEE, Cost-Sensitive Online Classification, VOL.
26, NO. 10, OCTOBER 2014
Peilin Zhao, Steven C.H. Hoi School of Computer Engineering Nanyang Technological University 50
Nanyang Avenue, Singapore 639798 Cost-Sensitive Online Active Learning with Application to
Malicious URL Detection August 1114, 2013
R. Akbani, S. Kwek, and N. Japkowicz, Applying support vector machines to imbalanced datasets, in
Proc. 15th ECML, Pisa, Italy, 2004, pp. 3950.
[2]
[3]
www.ijafrc.org
B. R. Bocka, Methods for multidimensional event classification: A case study using images from a
Cherenkov gamma-ray telescope, Nucl. Instrum. Meth., vol. 516, no. 23, pp. 511-528, 2004.
G. Blanchard, G. Lee, and C. Scott, Semi-supervised novelty detection, J. Mach. Learn. Res., vol. 11, pp.
29733009, Nov. 2010.
V. Chandola, A. Banerjee, and V. Kumar, Anomaly detection: A survey, ACM CSUR, vol. 41, no. 3,
Article 15, 2009.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: Synthetic minority oversampling technique, J. Artif. Intell. Res., vol. 16, no. 1, pp. 321357, 2002.
K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer, Online passive-aggressive
algorithms, J. Mach. Learn. Res., vol. 7, pp. 551585, Mar. 2006.
K. Crammer, M. Dredze, and F. Pereira, Exact convex confidence weighted learning, in Proc. NIPS,
2008, pp. 345352.
P. Domingos, Metacost: A general method for making classifiers cost-sensitive, in Proc. 5th ACM
SIGKDD Int. Conf. KDD, San Diego, CA, USA, 1999, pp. 155164.
M. Dredze, K. Crammer, and F. Pereira, Confidence-weighted linear classification, in Proc. 25th
ICML, Helsinki, Finland, 2008, pp. 264271.
C. Elkan, The foundations of cost-sensitive learning, in Proc.17th IJCAI, San Francisco, CA, USA,
2001, pp. 973978.
Y. Freund and R. E. Schapire, Large margin classification using the perceptron algorithm, Mach.
Learn., vol. 37, no. 3,pp. 277296, 1999.
C. Gentile, A new approximate maximal margin classification algorithm, J. Mach. Learn. Res., vol. 2,
pp. 213242, Dec. 2001.
S. C. H. Hoi, R. Jin, P. Zhao, and T. Yang, Online multiple kernel classification, Mach. Learn., vol. 90,
no. 2, pp. 289316, 2013.
AUTHORS PROFILE
www.ijafrc.org