1.1
Overview
This project is about a network forensic that allow finding the details of
networking events after they happened and how to analyze VoIP attacked data
pattern by using WEKA, a data mining tool. WEKA is used to view network
traffic, in order to investigate network and security attacks or application
performance issues. From the data pattern, an investigation will be conducted to
reveal information about network and application interactions, user sessions, and
response time and latency metrics. It is also to get the information about the
source of the attacks, when the attacks happen, where the source of the attacks
comes from and what type of attacks that are found and track down a hacker is to
keep vast records of activity on a network with the help of an intrusion detection
system.
From the gathered data, it will help to find a solution for each attack to
prevent them from happening again in the future. From the data analysis, it also
reveals who communicated with whom, when, and how often. This information
gained could be used as evidences to the victims for them to take further action
on the parties that committed network crimes on them.
1.2
Project Objective
The main objective of this project is to analyze the pattern of attack data
from the captured data. In which case, the data will indicate the condition of the
network events. Hence the source of attacks or other problem incidents will be
discovered. It helps in identifying unauthorized access to a computer system, and
searches for evidence of other types of threats of attack occurrence.
1
The second objective is to convert the pcap data to arff data file that will
recognize by the WEKA data mining tool. The first objective cannot be
conducted if the second objective is failing to apply.
1.3
Project Scope
This project will focus on VoIP and attacked data pattern by using
WEKA, a data mining tool. The Denial-of-service attack (DoS), Spam over
Internet Telephony (SPIT), and Man-in-the-middle (Mitm) attacks are the three
main focuses of this project.
1.4
Problem Statement
An emerging
application like VoIP has worsened the situation. Knowing the attacked patterns
allows network administrators to fence their network.
VoIP is one of the newest technologies that are being rapidly embraced
by the market as an alternative to the traditional Public Switched Telephony
Network (PSTN). The common VoIP threats are network-based DoS,
eavesdropping, signaling protocols, spam and etc. These attacks can make
conversations unintelligible due to malicious people that can listen in others
conversations, network overloaded, and packet loss or network congests that
caused a network down. In addition the bandwidth for each application on the
network will be less since they will be shared amongst the applications.
1.5
Problem Solving
The solution to the problem can be solved by any network tools. WEKA
which is a data mining tool will be used in this project to view network traffic
history to investigate the attacked and identify the source of attacks.
1.6
Chapter Organization
Chapter 2 discusses the literature review that is used in the project. The
literature review describes all the research and findings that related to this
project.
Chapter 3 will discuss on the research methodology that will give specific
research methods used to design the project. In this chapter, there had
explanations on the methods and specifications that used in this project and also
prepare budgets and costing.
Chapter 5 will discuss on the project verification. This chapter will give a
result from the project implementations or experiments. From this chapter, user
will understand on how the system running and the final output of the system.
This chapter consists of discussion on several subjects that related to this project.
The reviews start with a definition and concept of VoIP, Data Mining and Network
Forensic. In addition, the existing VoIP protocol and VoIP issues will be one of the
researches. Then some work by other researchers that related to the area of study will be
study so that it can be included in a literature review.
2.1
Background
2.1.1
the cost that the user has to pay is a monthly bill to an Internet service provider
(ISP). The other reason is VoIP services can be used for conference calls as
appose to the phone line whereby only two persons can speak at a time. With
VoIP, a conference can be setup with a whole team, communicating in a real
time.
Figure 2.1 shows a simple VoIP process. To send data over the internet,
the voices or the data are compressed into small packets to reduce amount of
transmission space. These packets are sent in different order and the packets are
then streamed line at the other end. Generally packet loss can happen during the
transmission. To recover from the loss, there is a mechanism in order to cover up
the loss and building up the data by collecting the pieces of information [3].
There are also other potential problems with VoIP such as increased
security risks and lower Quality of Service (QoS) and Denial of Service (DOS)
[4]. In the PSTN, a circuit or dedicated channel was set up between two points
for the call duration. These telephony systems are based on copper wires carrying
analog voice data over the dedicated circuits [5]. A set amount of bandwidth is
6
reserved when a call is established between the callers for the time the
connection is active. One of the main problems with PSTN technology is that
the 64 kbps of bandwidth is reserved even when there is no data being sent and
the entire bandwidth is not needed. The actual requirement for bandwidth is
usually only a small amount of what is reserved [4].
2.1.2
VoIP Attacks
VoIP is for sure gaining advantage over PSTN but there is a major concern for
the VoIP community which is its security. An increasing security mechanism
would have a poor VoIP performance service. On the other hand, without
security mechanisms, VoIP services would be open to threats and attacks [2].
Man-in-The-Middle (MiTM), Denial of Service (DoS) and Spam over Internet
Telephony (SPIT) are among the VoIP attacks.
7
If SPAM is for email, SPIT is for VoIP which is an unwanted bulk calls
or voicemails that sent over VoIP networks [10]. SPIT may be a bigger problem
to deal with to compare with SPAM. SPIT might cause a bandwidth problem that
will increase the bandwidth bills for several times. This is because voice
messages carry up more bytes than emails which only a few kilobytes apiece.
SPIT attacks are different with SPAM. SPAM can be detected before it interfere
the recipient meanwhile in SPIT, there is too late for prevention of SPIT if the
phone rings and the phone rings immediately after session initiation [10]. This
will disturb the users current activity.
2.1.3
Network Forensic
network data could range from extracting files and reconstructing web sessions
to tracing data leakage and detecting advanced persistent threats [13].
2.1.4
Data mining is the process of analyzing data from different corners and
summarizing it into useful information [14], and it is one of the analysis tools
software for analyzing data. Data mining could be separate into two parts,
directed and undirected. In directed data mining, it is trying to predict a particular
data point, but in undirected data mining, it is trying to find patterns in existing
data, or creates groups of data [15]. Data mining has dozens of techniques and
procedures that used to examine and transform data. The data mining is to
create a model that can improve the way to read and interpret the existing data
and the future data [15].
10
2.2
Previous Work
2.2.1
In this research paper, Mohammed I. Al-Saleh and Yahya A. Forihat did some
investigation on the evidences of Skype calls and chats in the Android devices.
Smartphones, have a bit of capabilities similar to that of PCs which can store a
large of data and different categories of information. Smartphone which is
having an Android-based device is getting more popular because there are a lot
of varieties of mobile Applications (Apps) that were developed to extend the
functionality of the phones. VoIP Apps are extensively used that provided the
usage for their wide availability and cheap prices and Skype is one of the popular
VoIP Apps.
11
This research paper might assume that Skype is one of the ways that
helps in committing cybercrimes. Digital Forensics may be conducted on mobile
devices, computers, and networks, in order to detect the cyber-criminal activities
and prove them guilty under the law. Fig. 2.2 is an investigation models
researchers designed. The figure summarized that the criminal starts a call
conversation session with the victim. The conversation sessions from the
criminals device need to be extracted by the investigator to extract evidences by
inspecting both RAM and NAND flash memories [18].
After doing several experiments, the pattern for each experiment had
shown there were no differences between the call conversation patterns. The
result of chat messages is found in both memories and have decreased the
average number of occurrences for the different time durations. This means, chat
messages were stuck for a long time in the flash memory without redundancy.
The remaining number of messages still can be used as evidence. The researchers
concluded that Skype conversation patterns and chat messages can be found in
both of the RAM and NAND flash memories for a long time and regardless of
deleting calls and chat histories and signing out of the Skype [18].
2.2.2
Figure 2.3 Class diagram for a VoIP network forensics system [19]
collector and the network investigator. The advantages using the forensic pattern
are; automated evidence analyses will reduce response times of the forensic
investigators, the analyzer can provide information about logs and for tracing
back the attackers, and can determine the call history, when a user is using the
VoIP device, and with whom the user communicates [19].
2.2.3
14
Fig.2.4 shows the relation between VoIP security patterns and related
cryptographic patterns. The double box represented the patterns. In the Network
Segmentation pattern, it will minimize disruption in the attack event and critical
voice traffic wont impact. The VoIP Tunneling pattern uses encryption to ensure
data integrity and confidentiality in VoIP networks. Tunnels will secure the VoIP
traffic transport over the external network and eliminates the risk of exposing a
network. The Signed Authenticated Call provides a suitable way for
authentication of messages in VoIP and the best countermeasure for theft of
service attacks. In Secure VoIP call, encryption and decryption of VoIP calls
were used to provide good confidentiality.
It concludes that, use VPNs and encrypt all voice traffic are the best
security approach in VoIP. This would ensure that the critical voice traffic
would be unaffected if an attack did occur on the data network [20]. To enhance
the security in VoIP, filtering and firewalls can be implemented to control the
traffic between the data VPN and the voice [20].
15
2.2.4
This research project focuses on large sets of data that can be handled by
a data mining system. WEKA data mining tools are studied to demonstrate the
data mining methodology and thus obtain the data. The WEKA tool kit is easily
extendable and flexible. WEKA is written in Java and makes it easy to use and
easily portable. It allows modeling techniques and data preprocessing.
WEKA is a user friendly which provides a large set of functions and tools
included attribute selection, pre-processing filters, data clustering, classification
and selection of data, data visualization of data and association discovery.
WEKA is open source free software that is available to all users and it can be
used to run individual experiments. There are various data formats WEKA
supported. These files are ARFF, Comma Separated value (CSV), Decision
induction algorithm acceptable format etc.
Fig. 2.5 present the flow of data mining that used in WEKA. Data is
classified based on the attribute selection, and data are then divided into clusters
based on the types of grouping that the user selects. The output obtained after
clustering gives the accuracy of data when the data is clustered which can be
16
used for future predictions. Finally regression analysis describes how regression
can be applied and results can be visualized.
This research project used a bank data to import into WEKA and
implement it in 4 modules that represents data mining process stages. The source
file can be in one of the formats which are either .arff or .csv. Fig. 2.6 is a
WEKA preprocessing window with the bank data. The data are saved to bankdata-final.arff after the parameters are set up. The project was implemented in
four modules which represents various stages and each task of data mining.
Association, classification, clustering and regression are the four stages of data
mining process [21].
17
2.3
Critical Analysis
JOURNAL
JOURNAL 1
[REFERENCE
REQUIRED],
JOURNAL 2
[REFERENCE
REQUIRED],
JOURNAL 3
[REFERENCE
REQUIRED],
JOURNAL 4
[REFERENCE
REQUIRED],
Skype
Converged
Network
Converged
Network
Bank
Employee
SOFTWARE
HARDWARE
DoS
SPIT
MiTM
SIP
RTP
RESEARCH
DATA
TOOLS
VOIP ATTACKS
PROTOCOL
18
This chapter will cover the detail explanation of methodology that is being used
to make this project complete and working well. The method is used to achieve the
objective of the project that will accomplish a perfect result. Subsequently, section 3.1
introduces the methodology that be used in this project. In section 3.2 the resources of
the hardware and software are listed. The budget and costing of the tools are listed in
Section 3.3. Section 3.4 and Section 3.5 the Work Breakdown Structure (WBS) and the
project timeline, Gantt chart was developed which consists of activity duration
estimation and the development of the project schedule.
3.1
Implementation phase,
19
3.1.1
Initiation
which can formulate a research question based on the research gaps and discuss
how these projects are likely.
3.1.2
Planning
The next phase, the planning phase, all of the work to be done is identify
where is the hardware and software resource requirements, and research model is
identified, along with the strategy process to implement the project. A project
plan is created outlining the activities, tasks, dependencies and timeframes and
identified a project budget by providing cost estimates for the equipment and
materials costs. The budget is used to monitor and control cost expenditures
during project implementation. The project plan can be referred at Fig.3.3 and
Fig.3.4 on pages 7 and 8.
3.1.3
Design
During the third phase, the design phase, the hardware and software are
defined, and .pcap data files collections are collected in this phase. The system
architecture, topology is well designed in this phase, which show the process of
project work and the process of converting the .Pcap data files into a format that
will be recognized by WEKA. Fig.3.2 shows the architecture of the project.
21
The .Pcap data files are the most available file format for logging network
traffic and can be used by almost any network analysis tool which displays huge
amounts of data that need to go through to find problems with the network. To be
recognized by WEKA,. Pcap data files are converted into a temporary .csv data
file format using a tshark Wireshark command line. Then the .csv data files will
convert into .arff data files format that supported by WEKA using a simple txt
notepad file and saved it as .arff file.
3.1.4
3.1.5
Verification
The fifth phase is the verification. This is where the result in fourth phase
will be verified in order to identify whether the data and the design implemented
meets the requirements of the project or not. If there is failure in testing phase,
there will be some modification to this system until it will run successfully. The
conclusions can be made based on the correctness and completeness of
development and operation in Testing phase process.
3.1.6
Documentation
The project requires the following hardware and software. Table 3.1
shows the hardware and Table 3.2 shows the software specifications. These are
the minimum requirement needed to ensure the success of the simulator.
3.2.1
Hardware Specifications
No
.
1
Device
Laptop
Quantit
y
1
Specifications
ASUS brand
Processor : Intel inside CORE i3
RAM : 6.00 GB
OS : Microsoft Windows 7
23
3.2.2
Software Specifications
No
.
Software
Descriptions
WEKA
Wireshark
3.3 Budget/Costing
The following is review of the budget and costing of the hardware and software
requirements. Table 3.3 shows the hardware and Table 3.4 shows the software
estimated budget and costing.
3.3.1
No.
1
Equipment
Laptop
Quantity
Price(RM)
Remark
1800
Students properties
24
3.3.2
No
.
Equipment
Quantity
Price(RM)
Remark
WEKA
Open source
Wireshark
Open source
The following figure is WBS which is contains level of the work breakdown
structure that provides further definition and detail.
25
26
27
This chapter explains the project testing and implementation stages. Section 4.1,
testing stage will discuss on a conversion of the pcap files into arff format files. The
testing stage is divided into two subsections. Section 4.1.1 introduces the conversion of
the pcap files into csv files format, meanwhile in section 4.1.2 introduces the conversion
of the csv files into arff files format. In section 4.2 will discusses on an ethical matters
and in section 4.3 will discuss on a ways to analyze the data.
4.1
Testing Stage
4.1.1
28
Run the pcap files using wireshark and on File menu choose an Export
Packet Dissection. This menu item allows exporting some of the packets in the
capture file to file. In this case, choose CSV (Comma Separated Values packet
29
summary) as shown in figure 4.1 on pages 25. Then save the files as csv files
format as shown in figure 4.2.
4.1.2
30
Open the csv file by change files of types become CSV data files (*.csv) as
shown in figure 4.4.
31
Then save as the file in the file name delete ".csv" and change it to ".arff" like in
figure 4.5, then the data files already finished converting csv file to arff file.
4.2
Ethical Matters
32
4.3
Analysis Stage
4.3.1
This stage was focused on some common attack types of DoS attack which is
ICMP Echo flood, UDP flood, TCP SYN flood, and a data from reliable sources
by using WEKA Explorer preprocessing, classification, clustering, and attribute
selection.
4.3.1.1
4.3.1.1.1
Preprocessing
The file was loaded into WEKA in the Preprocess window as shown in Fig.4.8
by click on Open file button and choose the .arff file from the local file
system.
Once the data is loaded, WEKA recognizes attributes that are shown in the
Attribute window.
Left panel of Preprocess window shows the list of recognized attributes:
No.: number that identifies the order of the attribute as they are in the
data file.
During the scan of the data, WEKA computes some basic statistics on each
attribute. The following statistics are shown in Selected attribute box on the
right panel of Preprocess window:
Distinct: is the number of different values that the data contains for
this attribute.
No. is numeric. Therefore, the following frequency statistics for this attribute in
the Selected attributes window:
Missing: 0 means that the attribute is specified for all instances (no
missing values).
Distinct: 6 means that number. has six connections communication
Unique: 6 means that other instances do have the same value as number.
has.
35
4.3.1.1.2
Classification
36
In the Fig.4.13, C4.5 algorithm and J48, decision tree learner is used to analyze
the data sample. The C4.5 algorithm was chosen because of it can handle
numeric attributes.
37
In this data sample, the classifier will be evaluated based on how well it predicts
66% of the tested data. The Percentage split radio-button was checked and
keeps it as default 66%. Percentage splits evaluate the classifier on how well it
predicts a certain percentage of the data, which is held out for testing. The
amount of data held out depends on the value entered in the % field. When the
options have been specified, the learning process will be started by click on the
Start button.
4.3.1.1.3
Clustering
Once
the
cluster
scheme
SimpleKMeans
is
selected,
When training set is completed, the Cluster output area on the right panel of
Cluster window is filled with text describing the results of training and testing.
A new entry appears in the Result list box on the left of the result.
4.3.1.1.4
Attribute selection
39
40
The implementation of the other data which are UDP Flood, TCP SYN
Flood and the data from reliable source were not shown because it have same
steps as shown by ICMP Flood data, so the results on each data will analyze on
next chapter, Chapter V: Result and Analysis.
41
5.1
Reliable Data
= = = Run information = = =
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:
reliabledata
Instances:
4447
Attributes: 7
No.
Time
Source
Destination
Protocol
Length
Info
Test mode:
42
Number of Leaves:
10
19
43
SIP
is
signaling
protocol
used
for
controlling
multimedia
communication sessions, like voice or video calls over IP. The protocol can be
used for modifying, creating and terminating two-party or multiparty sessions
consisting of one or several media streams. In this capture file, SIP is used to
create and tear down VoIP sessions.
RTP defines a standardized packet format for delivering audio and video
over the Internet. RTP is usually used in conjunction with the RTCP. When in
conjunction, RTP is usually originated and received on even port numbers,
whereas RTCP uses the next higher odd port number. In this capture file, RTP is
used as the media protocol to transport voice.
weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:
reliabledata
Instances:
4447
Attributes: 7
No.
Time
Source
Destination
Protocol
44
Length
Info
Test mode:
(0 bindings) | (2.0/1.0)
(fetch bindings) |
(2.0)
| | | No. > 1297: Request: SUBSCRIBE sip:555@172.25.105.40 | (2.0)
| | No. > 1302: Request: ACK sip:1000@172.25.105.40 | (3.0/1.0)
| | No. > 1: User-Agent: Asterisk PBX 1.6.0.10 | FONCORE
5.2
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:
icmp
Instances:
Attributes: 7
No.
Time
Source
Destination
Protocol
Length
Info
Test mode:
Number of Leaves : 2
100
Kappa statistic
%
%
0.5
0.6124
133.3333 %
141.4214 %
50
%
%
ROC Area
0.000
0.000
0.000
0.000
0.000
Echo
0.000
0.000
0.000
0.000
0.000
0.000
1.000
1.000
0.000
0.000
0.000
0.000
Echo
0.000
0.000
0.000
0.000
0.000
Echo
0.000
0.000
0.000
0.000
0.000
0.000
0.000
1.000
a b c d <-- classified as
0 0 0 0 | a = Echo (ping) request id=0x0200, seq=9472/37, ttl=32 [ETHERNET
FRAME CHECK SEQUENCE INCORRECT]
0 0 2 0 | b = Destination unreachable (Host unreachable) [ETHERNET FRAME
CHECK SEQUENCE INCORRECT]
0 0 0 0 | c = Echo (ping) request id=0x0200, seq=9728/38, ttl=32 [ETHERNET
FRAME CHECK SEQUENCE INCORRECT]
0 0 0 0 | d = Echo (ping) request id=0x0200, seq=9984/39, ttl=32 [ETHERNET
FRAME CHECK SEQUENCE INCORRECT]
48
5.3
Scheme:
weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:
tcp
Instances:
Attributes: 7
No.
Time
Source
Destination
Protocol
Length
Info
Test mode:
49
Number of Leaves:
33.3333 %
66.6667 %
Kappa statistic
0.1429
0.2667
0.483
84.6154 %
116.1347 %
33.3333 %
26.6667 %
50
F-Measure MCC
ROC Area
0.000
0.000
0.000
0.000
0.000
0.000
0.500
0.333
0.000
0.000
0.000
0.000
0.000
0.500
0.333
0.333
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.500
0.500
1.000
0.667
0.500
0.750
0.500
Len=648
[ETHERNET
FRAME
CHECK
SEQUENCE
INCORRECT]
Weighted Avg.
0.333
0.167
0.167
0.333
0.222
0.167
0.583
0.389
a b c d e <-- classified as
0 0 1 0 0 | a = boinc-client > neod2 [ACK] Seq=1 Ack=1 Win=8760 Len=0
[ETHERNET FRAME CHECK SEQUENCE INCORRECT]
0 0 0 0 1 | b = neod2 > boinc-client [PSH, ACK] Seq=5841 Ack=1 Win=8760
Len=648 [ETHERNET FRAME CHECK SEQUENCE INCORRECT]
0 0 0 0 0 | c = boinc-client > neod2 [ACK] Seq=1 Ack=2921 Win=8760 Len=0
[ETHERNET FRAME CHECK SEQUENCE INCORRECT]
0 0 0 0 0 | d = boinc-client > neod2 [ACK] Seq=1 Ack=5841 Win=8760 Len=0
[ETHERNET FRAME CHECK SEQUENCE INCORRECT]
51
From the information above, the file begins with standard TCP ACK
packets sent between 192.168.0.1 and 192.168.0.2. When TCP sends a packet to
a destination and does not get a reply, it waits a specified amount of time then
retransmits the original packet. If a response is still not received, the source
(transmitting) computer doubles the amount of time it waits for a response before
sending another retransmission. Once the retransmission attempts have failed, the
connection has completely failed and the data in the transmission is lost.
52
6.1
Project Accomplishment
In the early days of VoIP, there was no big concern about security issues
related to its use. People were mostly concerned with its cost, functionality and
reliability. Now that VoIP is gaining wide acceptance and becoming one of the
mainstream communication technologies, security has become a major issue. To
overcome a major problem, the network forensic is prepared to the monitoring
and analysis of computer network traffic for the purposes of information
gathering, legal evidence, or intrusion detection.
This project started with converting the pcap (Packet Capture) into
Attribute-Relation File Format (arff) which format that WEKA recognize and
learned how to analyze the data by using WEKA Explorer preprocessing,
classification, clustering, and attribute selection before getting the data from
company who provide VoIP analysis.
We believed that the objectives set for this project are met. The first
objective is to analyze the pattern of attack data from the captured data. In which
case, the data indicates the condition of the network events.
53
The second objective is also achieved. It is to convert the pcap data to arff
data file so that the input will be recognized by the WEKA data mining tool. It is
important to state that and the first objective depends on this second objective.
We have some hiccup in getting the right data for our analysis since many
companies are tied with the legality that refrain them from sharing their data with
us. However, we still get data from a simulated data from other related project
conducted by another student in UniKL. Otherwise, our research will produce
more interesting findings.
6.2
Future Recommendation
For the future recommendation, there are few aspects that can be
further enhanced by expanding a few features and criteria to make the
analysis more firm and strong.
required data
analyzed
for
the
future
enhancements.
Include more type of attacks that are
attacks
up the process.
As a conclusion we would like to highlight that the issues with VoIP security are one
of the concerned raised by the VoIP community. Although the problem is still under
control the system admin currently is not equipped with the right tools to detect the
VoIP attacks as earliest as possible. In most cases Wireshark or other network sniffer
is used to determine the condition of the network. We are trying to provide
alternative tools to the system admin by providing report pattern produced by a data
mining tool like WEKA.
55
REFERENCES
[1]
A Brief History of VoIP Document One - The Past. Hallock, Joe. 2004.
[2]
[3]
[4]
[5]
[6]
[7]
Man-in-the-Middle
Attacks.
schneier.com.
[Online]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
57
[21]
[22]
[23]
Rapid
Application
Development.
Core
Partners
Inc.
s.l.:
www.corepartners.com.
[24]
Advantages of Rapid Application Development. buzzle.com. [Online] 2002013. [Cited: December 12, 2013.]
http://www.buzzle.com/articles/advantages-of-rapid-applicationdevelopment.html
58
25