INTRODUCTION
1.1 OVERVIEW
It is more challenging and complicated work tofind compromised machines and
spammers on theInternet. These compromised machines, which areknown as machines are
increasingly usedto spread various security attacks like spreading malware and spamming.
Attackers can recruita large number of compromised machines in anetwork using spamming
activity. An email spamis unsolicited and anonymous email which has beensent to a large
group of users. The main focus ofthis proposed system is on detection of users who
areinvolved in spamming activity and detection of emailattachments with viruses. A spam is
a compromisedmachine which is involved in spammingactivity. Spammers perform various
security attackslike capturing secrete data of users, click frauds,phishing, etc. So it is
necessary to identify andblock such spammers in a network. The proposed system detects and
blocks spammers in a network.The existing spammer detection method detect thespammers
in a social network. But our aim is tohelp system administrators to detect the spammersin
their own networks. The system deletes emailswith attachments having virus files. To
reactivatethe email account, user needs to pass a test. ASPOT detection algorithm is used by
the system todetect spammers. The proposed system assists thesystem administrators in
automatically identifyingspammers in an online manner. In addition to this italso helps in
identifying emails having viruses.The SPOT detection algorithm is based on astatistical tool
known as Sequential Probability Ratio Test (SPRT). Researcher Wald has designed this
statistical tool in his seminal work.
1.5THESIS CONTRIBUTION
The proposed system aims to deletes email withattached viruses. It detects and blocks
the spammersby using a SPOT detection algorithm. The accountreactivation test is provided
by the system. The system receives an email message. Thensystem checks for virus in the
attachment.The system deletes email having virus filesin an attachment. If the virus is not
found,then the message is checked for spam. Thesystem applies SPOT detection algorithm
todetect spammers.
1) Virus checks
2) Spam check and spamFilter
3) Blocking of spammers using SPOT andrecovery.
The various kinds of data that can be analyzed from e-mail traffic, and the levels of
privacy involved. Secondly, it gives a brief overview of link analysis techniques that can be
applied for network security. Further, our approaches are explained in detail. Results of
experimental evaluation of our approaches are presented. Theissues of identifying the
machines that are sending spam, or machines that has been compromised and is being used as
a spam relay. Note that our focus is not on identifying individual users who send spam, or
filtering an e-mail as spam based on its content. There has been work in such areas which is
not directly related to ours. Recent work on detection of spam Trojans suggests the use of
signature and behavior based techniques. In this they propose SpamMail, a new approach to
ranking and classifying emails according to the address of email senders. The central
procedure is to collect data about trusted email addresses from different sources and to create
a graph for the social network, derived from each user’s communication circle.
There are two SpamMail variants, which both apply a power-iteration algorithm on
the email network graph: Basic SpamMail results in a global reputation for each known email
address, and Personalized SpamMail computes a personalized trust value. SpamMail allows
to classify email addressesinto‘spammer address’ and ‘non-spammer address’ and
additionally to determine the relative rank of an email address with respect to other email
addresses. And alsoanalyzes the performance of SpamMail under several scenarios, including
sparse networks, and shows its resilienceagainst spammer attacks. They investigated the
feasibility of SpamMail, a new email ranking and classification scheme, which intelligently
exploits the social communication network created via email interactions. On the resulting
email network graph, a power- iteration algorithm is used to rank trustworthy senders and to
detect spammers. Mail-Rank performs well both in the presence of very sparse networks:
Even in case of a low participation rate, it can effectively distinguish between spammer email
addresses and non-spammer ones, even for those users not participating actively. SpamMail
is also very resistant against spammer attacks and, in fact, has the property that when more
spammer email addresses are introduced into the system, the performance of SpamMail
increases.
In this thesis itfocuses on the subset of compromised machines that are used for
sending spam messages, which are commonly referred to as spam messages. Given that
spamming provides a critical economic incentive for the controllers of the compromised
machines to recruit these machines, it has been widely observed that many compromised
machines are involved in spamming. A number of recent research efforts have studied the
aggregate global characteristics of spamming botnets (networks of compromised machines
involved in spamming) such as the size of botnets and the spamming patterns of botnets,
based on the sampled spam messages received at a large email service provider.
In this chapter, focusing on the studies that utilize spamming activities to detect
bots.Based on email messages received at a large email service provider, two recent studies
[2, 3] investigated the aggregate global characteristics of spamming botnets including the size
of botnets and the spamming patterns of botnets. These studies provided important insights
into the aggregate global characteristics of spamming botnets by clustering spam messages
received at the provider into spam campaigns using embedded URLs and near-duplicate
content clustering, respectively. However, their approaches are better suited for large email
service providers to understand the aggregate global characteristics of spamming botnets
instead of being deployed by individual networks to detect internal compromised machines.
Moreover, their approaches cannot support the online detection requirement in the network
environment considered in this thesis. We aim to develop a tool to assist system
administrators in automatically detecting compromised machines in their networks in an
online manner.
Xie, et al. developed an effective tool DBSpam to detect proxy-based spamming activities
in a network relying on the packet symmetry property of such activities [5]. We intend to
identify all types of compromised machines involved in spamming, not only the spam proxies
that translate and forward upstream non-SMTP packets (for example, HTTP) into SMTP
commands to downstream mail servers as in [5].
BotHunter [6], developed by Guet al., detects compromised machines by correlating the
IDS dialog trace in a network. It was developed based on the observation that a complete
malware infection process has a number of well-defined stages including inbound scanning,
exploit usage, egg downloading, outbound bot coordination dialog, and outboundattack
propagation. By correlating inbound intrusion alarms with outbound communications
patterns, BotHunter can detect the potential infected machines in a network. Unlike
BotHunter which relies on the specifics of the malware infection process, SPOT focuses on
the economic incentive behind many compromised machines and their involvement in
spamming. Compared to BotHunter, SPOT is a light-weight spam zombie detection system; it
does not need the support from the network intrusion detection system as required by
BotHunter.
As a simple and powerful statistical method, Sequential Probability Ratio Test (SPRT)
has been successfully applied in many areas [7]. In the area of networking security, SPRT has
been used to detect portscan activities [8], proxy-based spamming activities [5], and MAC
protocol misbehavior in wireless networks [9].
2.1.4 PROTECTION
Email is a convenient medium to share files as attachments with other users in a
group. Malicious attachments propagating viruses or worms are creating havoc with the email
system and wasting email and IT resources. Current email service providers utilize one or
more integrated anti-virusproducts to check and identify malicious attachments. Most current
anti-virus products work on the basis of signaturesA strong shortcoming to this system is that
they are based on known virus signatures, and as such, cannot detect unknown or new
malicious attachments, i.e. they do not solve the zero-day virus problem. Substantial research
on using heuristics and machine learning algorithms to learn virus patterns has been explored
(Kolter and Maloof, 2014).
Threading Message
Threading messages has been used for years by newsgroup readers as a way of
organizing message topics. They are usually based on linking subject lines or looking at the
message 'reply-to' id in the email header field. Recent work by (Venolia and Neustaedter,
2013; Kerr, 2003) on visualizing conversation threads are excellent propositions, once an
important or relevant email has been located. If the user has a few hundred messages sitting
in the INBOX, without priority reorganization, picking the start or middle of an interesting
thread is not an easy task.
2.2 EMAIL CLASSIFICATION
One way to help the user organize email is to have the email client automatically
either discard or move messages into specific folders for the user's convenience. One of the
earliest systems (Pollock, 1988), called ISCREEN, had a rich set of rules and policies to
allow the user to create rule sets to handle incoming emails. Ishmail (Helfman and Isbell,
1995) helped organize messages, by also providing summaries to the user on the status of
what and where groups of new messages where being moved.
False Positive Rate - is the percentage of examples which our model has misidentified as the
target concept. Generally our goal is to minimize this measurement while not increasing the
error rate. Generally the cost associated with false positives is higher than false negatives.
The false positive rate is computed as:
False Negative Rate - is the proportion of target instances that were erroneouslyreported as
non-target. When tuning the detection algorithm we must find abalance between false
negatives and false positives. A threshold is used overall examples the higher this threshold,
the more false negatives and the fewerfalse positives. The false negative rate is computed as:
# misidentified as non-target
I P I rate = ----- —— ------------------------
Total # of target examples
Sample Error Rate - is the percentage of examples of the training that the model has
misclassified divided by the total number of examples seen. This is one measure to estimate
how well the classifier has learned the target function.
True Error rate - is the probability that the model will misclassify an example given a
specific training sample and sample error rate. This measurement is hard to accurately
measure, but can be approximated if the training set closely resembles the true distribution of
future examples. In other words, if we train on half spam and half non-spam examples, but in
reality 90% of examples will be spam, the sample error will not be an accurate measurement
of the model's error rate.
Year:2017
Description
Classification Algorithm
Techniques
Hybrid Techniques.
Description
Online product reviews have become an important source of user opinions.
Due to profit or fame, imposters have been writing deceptive or fake reviews to
promote and/or to demote some target products or services. Such imposters are called
review spammers.
Reviewers and reviews appearing in a burst are often related in the sense that
spammerstendto work with other spammers and genuine reviewers tend to appear
together with other genuine reviewers.
A novel evaluation method to evaluate the detected spammers automatically using
Supervised classification of their reviews.
Advantages
To identify spam and spammers as well as different type of analysis on this topic.
Using a strong prior such as RAVP and a local observation OSI will help the belief
propagation to converge to a more accurate solution in less time.
Disadvantages
It is hard for anyone to know a large number of signals without extensive experience
in opinion spam detection.
It is more difficult for people to make well-informed buying decisions without being
deceived by fake reviews.
Technique
Kernel Density Estimation (KDE) technique
Algorithm
Loopy Belief Propagation algorithm
Title 4: Spreading Processes in Multilayer Networks
Author:MostafaSalehi, Rajesh Sharma, Moreno Marzolla, Matteo Magnani, PayamSiyari,
and DaniloMontesi
Year:2015
Description
Several systems can be modeled as sets of interconnected networks or networks with
multiple types of connections, here generally called multilayer networks.
Spreading processes in multilayer networks is an active and not yet consolidated
research field and offers many unsolved problems to address.
By collecting real datasets related to a multilayer network is nontrivial, this issue is
even more challenging when one tries to gather data on both the spreading process
and the structure of the underlying multilayer network.
Network sampling strategies can be used to address this issue by decreasing the
expense of processing large real networks.
Advantages
It is convenient to ‘flatten’ adjacency tensors into matrices which is called ‘supra-
adjacency matrices’ for computations.
Disadvantages
When both layers have the same average degree the epidemic threshold increases for
larger difference between intra and interlayer infection rates as it gets more difficult to
spread to other layers.
Technique
Generating function technique
Outbreak detection technique
Algorithm
Epidemic routing algorithm
Title 5 : Trust-Aware Review Spam Detection
Author:HaoXue, Fengjun Li, HyunjinSeo and RoseSNMPluretti
Year: 2015
Description
Online review systems play an important role in affecting consumers’ behaviors and
decision making, attracting many spammers to insert fake reviews to manipulate
review content and ratings.
To increase utility and improve user experience, some online review systems allow
users to form social relationships between each other and encourage their interactions.
A trust-based prediction achieves a higher accuracy than standard CF method.
There exists a strong correlation between social relationships and the overall
trustworthiness scores.
Advantages
The crucial goal of opinion spam detection in the review framework is to identify
every fake review and fake reviewer.
For fast and effective manipulation, spammers may control a large number of
accounts or work in groups to insert bogus reviews in a short period of time.
Disadvantages
It is difficult for the CF model to achieve the expected accuracy.
The application of text classification in semantic extraction and feature selection is
limited because of the low training speed.
Algorithm
Collaborative filtering algorithm
Title 7 : Detecting Product Review Spammers using Rating Behaviors
Author:Ee-Peng Lim, Viet-An Nguyen, Nitin Jindal, Bing Liu and Hady W. Lauw
Year: 2010
Description
It detects users generating spam reviews or review spammers and identifies several
characteristic behaviors of review spammers and models these behaviors so as to
detect the spammers.
Spammers may target specific products or product groups in order to maximize their
impact.
Detecting review spam is a challenging task as no one knows exactly the amount of
spam in existence.
The state-of-the-art approach to review spam detection is to treat the reviews as the
target of detection.
Advantages
It focuses on review centric spam identification which provides greater focus on
feedback content.
A review spam is harder to detect.
Disadvantages
A spam reviews concentrate on the information that is provided on the product page
and that these are more difficult to read than truthful reviews.
Technique
Classification technique
CHAPTER 3
PROBLEM FORMATION
The logical view of the network model. It assume that messages originated from machines
inside the network will pass the deployed spam zombie detection system. This assumption
can be achieved in a few different scenarios. First, in order to alleviate the ever-increasing
spam volume on the Internet, many ISPs and networks have adopted the policy that all the
outgoing messages originated from the network must be relayed by a few designated mail
servers in the network. Outgoing email traffic (with destination port number of 25) from all
other machines in the network is blocked by edge routers of the network. In this situation, the
detection system can be co-located with the designated mail servers in order to examine the
outgoing messages. Second, in a network where the aforementioned blocking policy is not
adopted, the outgoing email traffic can be replicated and redirected to the spam zombie
detection system. We note that the detection system does not need to be on the regular email
traffic forwarding path; the system only needs a replicated stream of the outgoing email
traffic. Moreover, as we will show in Section 6, the proposed SPOT system works well even
if it cannot observe all outgoing messages. SPOT only requires a reasonably sufficient view
of the outgoing messages originated from the network in which it is deployed.
A machine in the network is assumed to be either compromised or normal (that is, not
compromised). In this thesis we only focus on the compromised machines that are involved
in spamming. Therefore, we use the term a compromised machine to denote a spam, and use
the two terms interchangeably. Let X for i= 1, 2 , . . n denote the successive observations of a
random variable X corresponding to the sequence of messages originated from machine
As a simple and powerful statistical tool, SPRT has a number of compelling and desirable
features that lead to the wide-spread applications of the technique in many areas. First, both
the actual false positive and false negative probabilities of SPRT can be bounded by the user-
specified error rates. This means that users of SPRT can pre-specify the desired error rates. A
smaller error rate tends to require a larger number of observations before SPRT terminates.
Thus users can balance the performance and cost of an SPRT test.Second, it has been proved
that SPRT minimizes the average number of the required observations for reaching adecision
for a given error rate, among all sequential and non-sequential statistical tests. This means
that SPRT can quickly reach a conclusion to reduce the cost of the corresponding experiment,
without incurring a higher error rate. In the following we present the formal definition and a
number of important properties of SPRT.
An SNM is based on a collection of connected units or nodes called artificial neurons which
loosely model the neurons in a biological brain. Each connection, like the synapses in a
biological brain, can transmit a signal from one artificial neuron to another. An artificial
neuron that receives a signal can process it and then signal additional artificial neurons
connected to it.
With the explosive growth of information on the web, search engine has
become an important tool to help people find their desired information in daily lives. Given a
certain query, search engines can generally return thousands of pages, but most users read
only the first few ones. Therefore, the page ranking is highly important in search engines. So
many people employ some means to deceive the ranking algorithm of search engines to
enable some web pages to achieve undeserved high ranking values, which can attract the
attention of users and help obtain some benefits. All the deceptive actions that try to increase
the ranking of a page in search engines are generally referred to as Web spam. Web spam
seriously deteriorates search engine ranking results, leads to great obstacle in
users’information acquisition process and brings the poor user experience. From the point of
view of a search engine, even if spam pages are not ranked sufficiently high to SNMoying
users, there is a cost to crawl, index and store spam pages. Detecting web spam has become
one of the top challenges in the research of web search engines.According to the
characteristic of Web Spam dataset, this thesis focused on constructing classifiers based on
the features of web pages in order to improve the Web Spam detection performance. It
contains the following three parts:(1) developed to learn a discriminating function to detect
Web Spam by Genetic ProgrammingAn individual is defined as a discriminating function to
detect Web Spam.
It also studies the effect of the depth of the binary trees representing the individuals in
the Genetic Programming evolution process and the efficiency of the
combination.Weperform experiments on WEBSPAM-UK2006. The experimental results
show that:(1) the multi-population Genetic Programming by two combinations can improve
spam classification recall performance by5.6%, F-measure performance by2.25%and
accuracy performance by2.83%compared with one population Genetic Programming;(2) the
approach can improve spam classification recall performance by26%, F-measure
performance by11%and accuracy performance by4%compared with SVM.(2) Developed to
detect web spam by ensemble learning algorithm based on Genetic Programming. At present,
most Web Spam detection methods based on classification only employ one classification
algorithm to create base classifiers, and ignore the imbalance between spam and normal
samples, i.e. normal samples are much more than spam ones. Since there are many types of
Web Spam techniques and new types of spam are being developed continually, it is
impossible to expect that we are able to find an omnipotent classifier to detect any kinds of
Web Spam. Integrating the detection results of multi-classifiers is a way to find an enhanced
classifier for Web Spam detection, and ensemble learning is also one of effective methods for
the classification problem on the imbalanced dataset.
Two key issues in ensemble learning are how to generate diverse base classifiers and
how to integrate their results. This paper proposes to detect Web Spam by ensemble learning
algorithm based on Genetic Programming. This new method first generates multiple diverse
base classifiers, which use different classification algorithms and are trained on different
instances and features. Then Genetic Programming is utilized to learn a novel classifier,
which gives the final detection result based on the detection results of base classifiers. This
method generates diverse base classifiers with different data sets and classification algorithms
according to the characteristic of Web Spam Dataset. Ensemble on the results of base
classifiers by Genetic Programming can not only be easy to integrate their classification
results of heterogeneous base classifiers to improve classification performance, but also to
select part of base classifiers for integration to reduce prediction time. This approach also
combines the under-sampling technology with ensemble learning to improve the
classification performance on imbalanced datasets. In order to verify the effectiveness of the
Genetic Programming-based ensemble learning, we perform experiments on balanced and
imbalanced data sets respectively. The experiments on the balanced dataset first analyze the
effect of classification algorithms and feature sets on the ensemble. Then the experimental
results are compared with those of some known ensemble learning algorithms and the results
show that the new approach performs better than some known ensemble learning algorithms
in terms of precision, recall, F-measure, accuracy, Error Rate and AUC.
The experiments on the imbalanced dataset show that this method can improve the
classification performance whether the base classifiers belong to the same type or not, and in
most cases the heterogeneous classifier ensembles work better than the homogeneous ones.
The F-measure of this new assemble method is higher than those of AdaBoost, Bagging,
Random Forest, Vote, EDKC algorithm and the method based on Prediction Spam city.(3)
Developed to generate new features by Genetic Programming to detect Web Spam. For
classification problem, features play an important role. In publicly available WEBSPAM-
UK2006dataset, there are96content-based features, 41link-based features and transformed
link-based features.
In which terms are included in the build the desired set of rules, which by the way
not all users’ document body, an example is to include specific terms as can build such a set.
In addition, it is a time consuming "Free grant money", "free installation", "Promise you ...!",
process, since the generated set of rules should be changed "free preview", etc. or refined
periodically as the nature of spam changes too. Another way of grouping term spamming
techniques is Because of the problems associated with the manual based on the type of terms
that are added to the text fields, construction of rules, another approach was proposed in
either by repeating one or a few specific terms, including a to automatically adapt to the
changing nature of spam over large number of unrelated terms, or stitching phrase time and to
provide a system that can learn directly from wherein, sentences or phrases, possibly from
different data already stored in the web server databases.
CHAPTER 4
SPAM DETECTION MESSAGE ANALYSIS
Most commonly, spams are sent through emails by writing text, adding unsolicited
attachments or putting links of the propagandas and malware. The unsolicited electronic
content is often sent as a bulk to multiple recipients. There are different techniques involved
in email-spamming such as image attachment, blank email, backscattering, etc. A good
number of methodologies have also been established to classify the email-spam. The other
type of store and forward spam is called Short Message Service (SMS)-spam that refers to the
propagation of unsolicited texts usually containing advertisements through short message
service.
The distinguished characteristics of these spams include their small size and frequent
use of non-dictionary words. SMS-spam is not as spread as its email counterpart. Strict rules,
restrictions, and monetary charges have been imposed worldwide over user connection to set
limits in such spam propagation. Both of these store and forward spam travel through
intermediate servers that also keep a copy of the messages. Consequently, these messages
(including spam) can be delivered to the active, as well as, currently inactive users after their
next activation. Consequently, this type of spam’s can be classified through reputation-based
and content-based techniques on the storages before or after delivery. In contrast, instant
messaging spam’s are generated through ubiquitous spamming techniques with high potential
danger. SPIM can be delivered only to a registered “online “recipient though the instant
messaging applications.
The spam message comes through a chat window and therefore bypasses almost all
security settings through a pre-installed messaging application in a user device. The window
pops up advertisement, links to viruses and spyware, etc. It is also able to deliver
applications, such as Trojan Horse, that are capable to install themselves inside the user
device. Recently, Security experts in Governments, corporations and ISPs have been
continuously warning against SPIM because of its high intrusive character owing to its design
to pass local security settings. SPIT refers to similar unsolicited “spam” calls using Voice
over Internet Protocol (VoIP). Spammers use automated calling application (bots) for the
purpose of telemarketing, prank-call and other abuses. As the VoIP is continuously changing
the conventional telephony by low-cost communication; spammers are increasingly targeting
this platform to reach out to large group of callers.
SPIT is delivered by exploiting pitfalls in the underlying protocol, namely: Session
Initiation Protocol. However, their detection is not easy due to real-time communication and
associated legal challenges concerning call privacy.
The impact of spamming is threefold. First, it affects the privacy and security of the
spam recipients. Secondly, it creates vulnerabilities in the whole network that is hosting these
user devices. Thirdly, it has a larger effect on the corporate resources and infrastructure since
a significant amount of corporate resource gets wasted to serve these unsolicited messages.
More recently, spam sending bots are seen in attempting social engineering, gathering
intelligence, mounting phishing attacks, spreading malware and thereby threatening the
usability and security of the collaborative communication platforms. In the 3rdGeneration
Partnership Project (3GPP), technical specification group quantified that approximately 250
GB of SPIT traffic per month can be generated from only one SPIT bot. In absence of
effective filtering policies, SPIM is also seen generating significant potential revenue. Steve
Roche reported in his book that 5% of the IM is spamming overall SPIM, 70% of messages
carry links to pornographic websites, 12% contain “get rich” schemes and 9% promotes
product sales. Therefore, high academic and industrialinterest is required to bridge up the
gap.
1) Account authentication
2) Sending mails
3) SPOT detection
i. capture IP
5) PT detection
1. Account authentication
2. Sending mails
In this module a single person to send one or more mails to other person.
Spam means the more copies of the single message are send.
3. SPOT detection
4. CT detection
If the each mail, counts are greater than equal to threshold value.
1) Ca- specifies the minimum number of mail that machine must send. 2) P- specifies
the maximum spam mail percentage of a normal machine.
This algorithm is used to compute the count of total mails and the count of spam mails
of machine.
To check this count of total mails are greater than equal to Cs and the count of spam
mails are greater than equal to P.
Parser - A parser which serves to import the email data. This tool is responsible for taking
any email data format and importing it into the EMT database.
Database - An underlying database to store the email messages. It describes the schema and
rationale in detail. This component is where the actual email data resides for analysis by the
models.
GUI - A front-end Graphical User Interface (GUI), which allows the data and models to be
manipulated. It also allows the user to test a range of parameters for each model in the offline
system, so they can accurately judge what are the ideal parameters for a specific set of email
data.
Figure 4.2: The EMT Architecture composed of a email parser, data base back-end, set
of models, and a GUI front-end.
Message Window
The Message window offers a particular view of the data in the database. A view is
composed of a set of constraints defined over the data. For example it can be all messages
associated with a particular folder or user.
The following features can all be used alone or in combination to constrain the data
view.
2. User, Direction - We can choose a specific user to view all their email, and also define
which direction (inbound, outbound, or both) we would like to view.
4.3.1 SETUP
Each of the component classifiers presented in this thesis and embedded in EMT
produces a classification output as a score in the range with a high number indicating
confidence in the prediction that an email is unwanted or spam. We refer to these outputs as
the 'raw scores', which we combine through the various correlation functions.The training
regime requires some explanation. A set of emails are first marked and labeled by the user
indicating whether they are spam, or normal. This information can also be gleaned by
observing user behavior (whether they delete a message prior to opening it, or move it to a
"garbage" or "spam" folder). Although sometime the entire message will be contained in the
subject line (example, Meeting canceled!), most users will on average also click to make sure
if there is further details in the message body. For our experimental results, users provided
their email files with those messages considered spam placed in a special folder. Those we
labeled as spam, while all other messages we labeled as normal. These were all messages
received, deleted emails were moved to a deleted folder, but not actually deleted.
This data set of real emails was also used to study the model combination methods.
Our data set consists of emails collected from five users at Columbia University spanning
from 1997 to 2005, a user with a Hotmail account, and a user with a Verizon.net email
account. In total we collected 320,000 emails taking up about 2.5 gigabytes of space. Users
indicated which emails where spam by moving them to specific folders. .
Table4.1 : emails per calendar year for the main data set in the experiments.
Because current spam levels on the Internet are estimated at 60%, we sampled the set
of emails so that we would have a 60% ratio of spam to normal over all our emails. We were
left with a corpus of 278,274 emails time-ordered as received by each user.
We tested the models using the 80/20 rule with 80% being the ratio of training to
testing. Hence, the first 80% of the ordered email are used to train the component classifiers
and the correlation functions, while the following 20% serve as the test data used to plot our
results. This set up mimics how such an automatic classification system would be used in
practice. As time marches on, emails received are training data used to upgrade classifiers
applied to new incoming data. Those new data would be used as training for another round of
learning to update the classifiers. Earlier tests used 5-fold cross validation without any
statistical difference on the results (of the 1-fold), so we opted to keep it simple with the 1-
fold tests.
The data used was pristine and unaltered. No preprocessing was done to the bodies of
the emails with the exception that all text was evaluated in lower case. Headers of the emails
were ignored except for subject lines that are used in some of the non-content based
classifiers. While adding header data would have improved individual classification, there is
much variability in what is seen in the header, and we felt it might over-train and learn some
subtle features of tokens only available in the header data present in the Columbia data set.
For some of the individual classifiers: Ngram, TF-IDF, PGram, and Text Classifier, we
truncated the email parts so that we only used the first 800 bytes of each part of the email
attachment. This was used for both efficiency and computational considerations, as there
were many large executable attachments in our dataset. In addition the increase in detection
was about 10% with the same false positive rates over using full email bodies. The reason is
because of noise in the number of tokens seen in very large spam messages.
Table 4.2 : Email used in the spam framework data set analysis
Step 17: Λn = 0
System testing is the stage of implementation, which aimed at ensuring that system
works accurately and efficiently before the live operation commence. Testing is the process
of executing a program with the intent of finding an error. A good test case is one that has a
high probability of finding an error. A successful test is one that answers a yet undiscovered
error.
Testing is vital to the success of the system. System testing makes a logical
assumption that if all parts of the system are correct, the goal will be successfully
achieved. The candidate system is subject to variety of tests-on-line response, Volume Street,
recovery and security and usability test. A series of tests are performed before the system is
ready for the user acceptance testing. Any engineered product can be tested in one of the
following ways. Knowing the specified function that a product has been designed to from,
test can be conducted to demonstrate each function is fully operational. Knowing the internal
working of a product, tests can be conducted to ensure that “al gears mesh”, that is the
internal operation of the product performs according to the specification and all internal
components have been adequately exercised.
First undertook the project, we have also considered two alternative designs in detecting
spam zombies, one based on the number of spam messages and another the percentage of
spam messages sent from a machine, respectively. For simplicity, we refer to them as the
count-threshold (CT) detection algorithm and the percentage-threshold (PT) detection
algorithm, respectively.
In CT, the time is partitioned into windows of fixed length T. A user-defined threshold
parameter C specifies the maximum number of spam message that may be originated from a
normal machine in any time window. The system monitors the number of spam messages n
originated from a machine in each window. If n > C, then the algorithm declares that the
machine has been compromised.
PT works in a similar fashion, except that it works on the spam percentage. Formally, let
N and n denote the total messages and spam messages originated from a machine m within a
window T , then PT declares machine m as being compromised if ^ >P , where P is the user-
defined maximum spam percentage of a normal machine.
In the following we briefly compare them with the SPOT system. The three algorithms
have the similar time and space complexities. They all need to maintain a record for each
observed machine and update the record as messages arrive from the machine. However,
unlike SPOT, which can provide a bounded false positive rate and false negative rate, and a
confidence how well SPOT works, the error rates of CT and PT cannot be a priori specified.
SPOT requires four user-defined parameters a, 3, #1, and 60. As we have discussed in the
previous sections, selecting values for the four parameters are relatively straightforward. In
contrast, selecting the "right" values for the parameters of CT and PT are much challenging
and tricky. They require a thorough understanding of the different behaviors of the
compromised and normal machines in the concerned network and a training based on the
history of the two different behaviors in order for them to work reasonably well in the
network. Our preliminary studies of the two alternative designs confirm that, unlike SPOT,
the performance of the two alternative algorithms is sensitive to the parameters used in the
algorithm. They may have either higher false positive or false negative rates
In the above discussion of the SPOT algorithm we have for simplicity ignored the
potential impact of dynamic IP addresses and assumed that an observed IP corresponds to a
unique machine. This needs not to be the case for the algorithm to work correctly. SPOT can
work extremely well in the environment of dynamic IP addresses. To understand the reason
we note that SPOT can reach a decision with a small number of observations as illustrated in
which shows the average number of observations required for SPRT to terminate. In practice,
we have noted that 3 or 4 observations are sufficient for SPRT to reach a decision for the vast
majority of cases. If a machine is compromised, it is likely that more than 3 or 4 spam
messages will be sent before the (unwitting) user shutdowns the machine. Therefore, dynamic
IP addresses will not have any significant impact on SPOT.
By contrast, CT and PT need to deal with dynamic IP addresses very carefully. We have
introduced that both CT and PT need two parameters. For CT, we define a time window and
a maximum number of spam messages; for PT, we define a time window and a maximum
percentage of spam messages. Let us discuss the time window first. The ideal condition is the
length of the time window is equal to the duration of one machine's life time, but it is
impossible for a fix length time window to fit into any different life time of different
machines at the same time. If the length of the time window is shorter than the duration of
one machine's life time, this machine's life time will be spitted to multiple windows.
There could be two cases. In the first case, the last window is only occupied by the
final part of this machine's life time. This will give CT or PT a chance to count correctly. In
the second case, the last window might be shared by this machine and other machine or even
other machines. This must lead to a wrong result. If the length of the time window is longer
than the duration of one machine's life time, the situation is similar to the second case of the
above discussion about shorter time window. So, it is possible to get a wrong result. For CT,
we also need to set the maximum number of spam messages C, which is the threshold of
counting. If CT counts more than C spam messages in a time window, it declares a zombie.
But, if more than one machine share one time window, CT might count spam messages from
different machines together by mistake. The same mistake might happen when PT count
messages. Another reason to affect the performances of CT and PT is when they group
messages to fix length time window, they might not get enough spam messages in each
interval even the total number of spam messages is clearly big enough.
CHAPTER 5
EMAIL TRACE AND METHODOLOGY
The mail relay server ran Spam to detect spam messages. The email trace contains the
following information for each incoming message: the local arrival time, the IP address of the
sending machine, and whether or not the message is spam. In addition, if a message has a
known virus/worm attachment, it was so indicated in the trace by anti-virus software. The
anti-virus software and Spam were two independent components deployed on the mail relay
server. Due to privacy issues, we do not have access to the content of the messages in the
trace.
Ideally Spam should have collected all the outgoing messages in order to evaluate the
performance of SPOT. However, due to logistical constraints, we were not able to collect all
such messages. Instead, we identified the messages in the email trace that have been
forwarded or originated by the SPAM internal machines, that is, the messages forwarded or
originated by an SPAM internal machine and destined to an SPAM account. We refer to this
set of messages as the SPAM emails and perform our evaluation of SPOT based on the
SPAM emails. We note the set of SPAM emails does not contain all the outgoing messages
originated
IP addresses. First, only 9.9% of total IP addresses sending virus are non-spam only
IP addresses that never send spam messages. The similar percentage (9.3%) is observed for
SPAM IP addresses. Second, the highest percentage part of IP addresses in total IP addresses
is spam only IP addresses. This indicates the correlation between virus sending and spam
sending IP addresses. Third, this trend is not verified when we analyze SPAM IP addresses.
The highest percentage part of IP addresses in SPAM IP addresses is mixed IP addresses. The
reason is there are many mail relay servers in the SPAM network. They merged messages
from clients, and hide the real senders. We verified the most part of mixed IP addresses in
SPAM are these servers. In addition, we can observe that only a small part of IP addresses
that send spam messages take part in sending virus. In total amount of IP addresses, there is
only 4.2 %( 10385 out of 2461114) of them sending virus. By contrast, for IP addresses in
SPAM, the percentage of sending virus, which is 46.4 %( 204 out of 440), is much bigger
than that in the total IP addresses.
Step 2: Find the email header
The header contains information about the routing of the email and the IP address. Most
email programs like Outlook, Hotmail, Google Mail (Gmail,) Yahoo Gmail (AOL) hide the
header information because they see it as non-essential information. If you know how to open
the header, you can still find this data.
On Outlook, go to your inbox and highlight your email using your cursor, but do not
open it into its own window. If you are using a mouse, right click the message. If you
are using a Mac Operating System (OS) without a mouse, click while holding down
the "control" button. Select "Message Options" when the menu appears. Find the
headers at the bottom of the window that will appear.
On Hotmail, click on the drop down menu next to the word "Reply." Select "View
Message Source." A window will pop up with the address information.
On Gmail, click on the drop down menu next to the word "Reply" in the upper right
hand corner of your message. Select "Show Original." A window with the IP
information will pop up.
On Yahoo, right click or press "control" and click when you are on the message.
Choose "View Full Headers."
On AOL, click "Action" on your message, and then select "View Message Source."
Step3: Identify the IP address in the information you have just uncovered. Following any of
these methods for your chosen email carrier, you will have a window that pops up with a lot
of code information. You will not need all of this information. If the window is too small to
effectively pick out the IP address, copy the information and paste it into a word processing
document.
Step4: Look for the words "X-Originating-IP." This is the easiest way to spot the IP
address; however, it may not be listed in those terms on all email programs. If you cannot
find this term look for the word "Received" and follow the line until you see a numerical
address.
Use the "Find" function on your computer to easily spot these terms. Click
"Command" and the letter "F" on Mac OS. In Internet Explorer click the "Edit" menu.
Select "Find on this Page," then type the word into the box that appears and click
"Enter."
Step 5: Sender it narrow the search for specific senders by clicking the Add sender
button next to the Sender field. In the subsequent dialog box, select one or more
senders from your company from the user picker list and then click add. To add
senders who aren't on the list, type their email addresses and click check names. In
this box, wildcards are supported for email addresses in the format: *@contoso.com.
When specifying a wildcard, other addresses can't be used. When you're done with
your selections, click OK.
Step 6: Recipient you can narrow the search for specific recipients by clicking the
Add recipient button next to the Recipient field. In the subsequent dialog box, select
one or more recipients from your company from the user picker list and then click
add. To add recipients who aren't on the list, type their email addresses and click
check names. In this box, wildcards are supported for email addresses in the format:
*@contoso.com. When specifying a wildcard, other addresses can't be used. When
you're done with your selections, click OK.
CHAPTER 6
EXPERIMENTAL RESULT
6.1 MATLAB
Matlab's standard data type is the matrix all data are considered to be matrices of some sort.
Images, of course, are matrices whose elements are the grey values (or possibly the RGB
values) of its pixels. Single values are considered by Matlab to be matrices, while a string is
merely a matrix of characters; being the string's length. In this chapter we will look at the
more generic Matlab commands, and discuss images in further chapters. When you start up
Matlab, you have a blank window called the Command Window_ in which you enter
commands. Given the vast number of Matlab's functions, and the different parameters they
can take, a command line style interface is in fact much more efficient than a complex
sequence of pull-down menus.
You can use MATLAB in a wide range of applications, including signal and image
processing, communications, control design, test and measurement financial modeling and
analysis. Add-on toolboxes (collections of special-purpose MATLAB functions) extend the
MATLAB environment to solve particular classes of problems in these application areas.
MATLAB provides a number of features for documenting and sharing your work.
You can integrate your MATLAB code with other languages and applications, and distribute
your MATLAB algorithms and applications.When working with images in Matlab, there are
many things to keep in mind such as loading an image, using the right format, saving the data
as different data types, how to display an image, conversion between different image formats.
Run-time errors - Run-time errors are usually apparent and difficult to track down.
We can debug the M file using the Editor/Debugger as well as using debugging functions
from the Command Window. The debugging process consists of
Setting breakpoints
Set breakpoints to pause execution of the function, so we can examine where the problem
might be. There are three basic types of breakpoints:
JAVA PLATFORM:
Java Virtual Machine is the base for the java platform and is pored onto various
hardware-based platforms.
The API is a large collection of ready-made software components that provide many
useful capabilities, such as graphical user interface (GUI) widgets. It is grouped into libraries
of related classes and interfaces, these libraries are known as packages.
Development Tools:
The development tools provide everything you’ll need for compiling, running,
monitoring, debugging, and documenting your applications. As a new developer, the main
tools you’ll be using are the Java compiler (javac), the Java launcher (java), and the Java
documentation (javadoc).
The API provides the core functionality of the Java programming language. It offers a
wide array of useful classes ready for use in your own applications. It spans everything from
basic objects, to networking and security.
Deployment Technologies:
The JDK provides standard mechanisms such as Java Web Start and Java Plug-In, for
deploying your applications to end users.
The Swing and Java 2D toolkits make it possible to create sophisticated Graphical
User Interfaces (GUIs).
Drag-and-drop support:
Swing defines an abstract Look and Feel class that represents all the information central
to a look-and-feel implementation, such as its name, its description, whether it’s a native
look-and-feel- and in particular, a hash table (known as the “Defaults Table”) for storing
default values for various look-and-feel attributes, such as colors and fonts.
Each look-and-feel implementation defines a subclass of Look And Feel (for example,
swing .plaf.motif.MotifLookAndFeel) to provide Swing with the necessary information to
manage the look-and-feel.
The UIManager is the API through which components and programs access look-and-
feel information (They should rarely, if ever, talk directly to a LookAndFeelinstance).
UIManager is responsible for keeping track of which LookAndFeel classes are available,
which are installed, and which is currently the default. The UIManager also manages access
to the Defaults Table for the current look-and-feel.
6.3 PERFORMANCE EVALUATION
Evaluate the performance of SPOT based on the collected SPAM emails. In all the
studies, it set a = 0.01, (3 = 0.01, 6\ = 0.9, and 90= 0.2. That is the deployed spam filter has a
90% detection rate and 20% false positive rate. Many widely-deployed spam filters have
much better performance than what we assume here Spot identifier 132 of them to be
associated with compromised machines. In order to understand the performance of SPOT in
terms of the false positive and false negative rates, we rely on a number of ways to verify if a
machine is indeed compromised. First, we check if any message sent from an IP address
carries a known virus/worm attachment. If this is the case, we say we have a confirmation.
Out of the 132 IP addresses identified by SPOT, we can confirm 110 of them to be
compromised in this way. For the remaining 22 IP addresses, we manually examine the spam
sending patterns from the IP addresses and the domain names of the corresponding machines.
If the fraction of the spam messages from an IP address is high (greater than 98%), we also
claim that the corresponding machine has been confirmed to be compromised. We can
confirm 16 of them to be compromised in this way. We note that the majority (62.5%) of the
IP addresses confirmed by the spam percentage are dynamic IP addresses, which further
indicates the likelihood of the machines to be compromised.
For the remaining 6 IP addresses that we cannot confirm by either of the above
means, we have also manually examined their sending patterns. We note that, they have a
relatively overall low percentage of spam messages over the two month of the collection
period. However, they sent substantially more spam messages towards the end of the
collection period. This indicates that they may get compromised towards the end of our
collection period. However, we cannot independently confirm if this is the case.
Evaluating the false negative rate of SPOT is a bit tricky by noting that SPOT focuses
on the machines that are potentially compromised, but not the machines that are normal (see
Chapter 5). In order to have some intuitive understanding of the false negative rate of the
SPOT system, we consider the machines that SPOT does not identify as being compromised
at the end of the email collection period, but for which SPOT has re-set the records (lines 15
to 18 in Algorithm 1). That is, such machines have been claimed as being normal by SPOT
(but have continuously been monitored). We also obtain the list of IP addresses that have sent
at least a message with a virus/worm attachment. 7 of such IP addresses have been claimed as
being normal, i.e., missed, by SPOT.
The infected messages are only used to confirm if a machine is compromised in order
to study the performance of SPOT. Infected messages are not used by SPOT itself. SPOT
relies on the spam messages instead of infected messages to detect if a machine has been
compromised to produce the results in Table 6.4. We make this decision by noting that, it is
against the interest of a professional spammer to send spam messagesSuch messages are
more likely to be detected by anti-virus software’s, and hence deleted before reaching the
intended recipients. This is confirmed by the low percentage of infected messages in the
overall email trace shown in Table 6.1. Infected messages are more likely to be observed
during the spam zombie recruitment phase instead of spamming phase. Infected messages can
be easily incorporated into the SPOT system to improve its performance.
The actual false positive rate and the false negative rate are higher than the specified false
positive rate and false negative rate, respectively. One possible reason is that the evaluation
was based on the SPAM emails, which can only provide a partial view of the outgoing
messages originated from inside SPAM.
The number of actual observations that SPOT takes to detect the compromised machines.
As we can see from the figure, the vast majority of compromised machines can be detected
with a small number of observations. For example, more than 80% of the compromised
machines are detected by SPOT with only 3 observations. All the compromised machines are
detected with no more than 11 observations. This indicates that, SPOT can quickly detect
the compromised machines. We note that SPOT does not need compromised machines to
send spam messages at a high rate in order to detect them. Here, "quick" detection does not
mean a short duration, but rather a small number of observations. A compromised machine
can send spam messages at a low rate (which, though, works against the interest of
spammers), but it can still be detected once enough observations are obtained by SPOT.
6.4 PERFORMANCE EVALUATION DESIGNS
Evaluation designs based on the number of spam messages (CT) and another percentage
of spam messages sent from a machine (PT). In this section, we evaluate the performance of
CT and PT based on the user-defined parameters.
Both CT and PT need a fixed time window T and an appropriate user-defined threshold.
In this evaluation, we set T to 1 hour. For CT, we set the threshold to 30 messages, which
means a machine will be thought to be a zombie if it sends more than 30 spam messages in
any one hour window; for PT, we set the threshold to 50%, which means a machine will be
thought to be a zombie if it sends more than 50% spam messages in any one hour window.
Since SPOT needs at least 3 spam message (when a = 0.01, /// = 0.01, 90= 0.2, and 91= 0.9) to
detect a zombie, to compare PT to SPOT, we require at least 6 messages has been sent from
specific IP addresses. So, we ignore all of IP addresses which send less than 6 messages.
Itshows the CDF of the durations of the clusters. it can see from the figure, more than
75% and 58% of the clusters last no less than 30 minutes and one hour (corresponding to the
two vertical lines in the figure), respectively. The longest duration of a cluster we observe in
the trace is about 3.5 hours.
Given the above observations, in particular, the large number of spam messages in each
cluster, we conclude that dynamic IP addresses will not have any important impact on the
performance of SPOT. SPOT can reach a decision within the vast majority (96%) it cannot be
a mail relay server deployed by the network. In practice, a network may have multiple sub
domains and each has its own mail servers. A message may be forwarded by a number of
mail relay servers before leaving the network. SPOT can work well in this kind of network
environments. In the following we outline two possible approaches. First, SPOT can be
deployed at the mail servers in each sub domain to monitor the outgoing messages so as to
detect the compromised machines in that sub domain. Second, and possibly more practically,
SPOT is only deployed at the designated mail servers, which forward all outgoing messages
(or SPOT gets a replicated stream of all outgoing messages), as discussed in Chapter 3. SPOT
relies on the Received header fields to identify the originating machine of a message in the
network. Given that the Received header fields can be spoofed by spammers, SPOT should
only use the Received header fields inserted by the known mail servers in the network.
SPOT can determine the reliable Received header fields by backtracking from the last
known mail server in the network that forwards the message. It terminates and identifies the
originating machine when an IP address in the Received header field is not associated with a
known mail server in the network.
By contrast, it is sending rate that affects the detection very much for CT and PT. If
spammers reduce the number or percentage of spam messages in a time window, the user-
defined threshold will be compromised. For example, we set 30 messages as the threshold in
CT. Once this information has been figured out by spammers, they are able to break this by
sending less than 30 spam messages in any time window. Moreover, if they know the size of
a time window, they could send even more spam messages by crossing two windows'
boundary, ie. Send 29 at the end of the first window, then send another 29 at the beginning of
the next window. The similar scheme can be used for attacking PT.
CHAPTER 7
CONCLUSION
The proposed system aims to deletes email with attached viruses. It detects and blocks
the spammers by using a SPOT detection algorithm. The account reactivation test is provided
by the system. The flow of work on the proposed system. The proposed system works as the
system receives an email message. Then system checks for virus in the attachment. The
system deletes email having virus files in an attachment. If the virus is not found, then the
message is checked for spam. The system applies SPOT detection algorithm to detect
spammers.
The system maintains the database on a machine where it will be running. To detect
the spam message some spam pattern will be used. To find an email having virus files,
dataset with unique patterns will be used. These unique patterns will be the file extensions.
All incoming email messages will be scanned against the dataset.
Compromised machines are a major security threat on the Internet. Given that
spamming provides the critical economic incentive for attackers to recruit the large number
of compromised machines, in this thesis we developed SPOT, an effective spam zombie
detection system by monitoring outgoing messages in a network. SPOT was designed based
on a simple and powerful statistical tool named Sequential Probability Ratio Test to detect the
subset of compromised machines that are involved in the spamming activities. SPOT has
bounded false positive and false negative error rates. It also minimizes the number of
required observations to detect a spam zombie. Our evaluation studies based on a 2-month
email trace collected on the FSU campus network showed that SPOT is an effective and
efficient system in automatically detecting compromised machines in a network. Have also
evaluated two alternative designs based on spam count and spam fraction. The results show
that SPOT is over them in both detection number and detection accuracy. In summary, they
are not as effective as SPOT.
CHAPTER 8
FUTURE WORK
REFERENCES
[1] J. Donfro. A Whopping 20% of Yelp Reviews are Fake, accessed onJul. 30, 2015.
[Online]. Available: http://www.businessinsider.com/20-percent-of-yelp-reviews-fake-2013-
9
[2] M. Ott, C. Cardie, and J. T. Hancock, “Estimating the prevalence ofdeception in online
review communities,” in Proc. ACM WWW, 2012,pp. 201–210.
[3] M. Ott, Y. Choi, C. Cardie, and J. T. Hancock, “Finding deceptiveopinion spam by any
stretch of the imagination,” in Proc. ACL, 2011,pp. 309–319.
[5] N. Jindal and B. Liu, “Opinion spam and analysis,” in Proc. WSDM,2008, pp. 219–230.
[6] F. H. Li, M. Huang, Y. Yang, and X. Zhu, “Learning to identify reviewspam,” in Proc.
22nd Int. Joint Conf. Artif. Intell. (IJCAI), 2011, pp. 1–6.
[7] G. Fei, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R.
Ghosh,“Exploitingburstiness in reviews for review spammer detection,” inProc. ICWSM,
2013, pp. 1–10.
[8] A. J. Minnich, N. Chavoshi, A. Mueen, S. Luan, and M. Faloutsos,“Trueview:
Harnessing the power of multiple review sites,” in Proc.ACM WWW, 2015, pp. 787–797.
[9] B. Viswanathet al., “Towards detecting anomalous user behavior inonline social
networks,” in Proc. USENIX, 2014, pp. 1–16.
[10] H. Li, Z. Chen, B. Liu, X. Wei, and J. Shao, “Spotting fake reviewsvia collective
positive-unlabeled learning,” in Proc. ICDM, Dec. 2014,pp. 899–904.
[11] L. Akoglu, R. Chandy, and C. Faloutsos, “Opinion fraud detection inonline reviews by
network effects,” in Proc. ICWSM, 2013, pp. 1–10.
[12] S. Rayana and L. Akoglu, “Collective opinion spam detection: Bridgingreview networks
and metadata,” in Proc. ACM KDD, 2015, pp. 1–10.