A
Project report
Submitted in fulfilment of the requirement for the award of the degree of
BACHELOR OF TECHNOLOGY
BY
L.Rama Narayana Reddy 13VD1A0532
V.Tejaswi 13VD1A0554
P.Snigda 13VD1A0547
2016-2017
This is a record of bonafide work carried out by us and the results embodied in this
project report have not been reproduced or copied from any source. The results embodied in
this project have not been submitted to any other University or Institute for the award of any
degree or diploma.
V.Tejaswi (13VD1A0554)
P.Snigda (13VD1A0547)
2
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD
COLLEGE OF ENGINEERING MANTHANI
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
This is to certify that the project report entitled Analysis of Log File using
Hadoop, being submitted by L.Rama Narayana Reddy(13VD1A0532),
V.Tejaswi(13VD1A0554) and P.Snigda (13VD1A0547) in the fulfillment for the award of
the Degree of BACHELOR OF TECHNOLOGY in Computer Science and Engineering to
the JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD
COLLEGE OF ENGINEERING MANTHANI is a record of bonafide work carried out by
them under my guidance and supervision.
The results of investigation enclosed in this report have been verified and found
satisfactory. The results embodied in this project report have not been submitted to any other
University or Institute for the award of any degree or diploma.
Dr.K.Shahu Chatrapati
3
COLLEGE OF ENGINEERING MANTHANI
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
This is to certify that the project report entitled Analysis of Log File using
Hadoop, being submitted by L.Rama Narayana Reddy(13VD1A0532),
V.Tejaswi(13VD1A0554) and P.Snigda (13VD1A0547) in the fulfillment for the award of
the Degree of Bachelor of Technology in Computer Science and Engineering to the
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY HYDERABAD
COLLEGE OF ENGINEERING MANTHANI is a record of bonafide work carried out by
them under my guidance and supervision. The results embodied in this project report have
not been submitted to any other University or Institute for the award of any degree or
diploma.
Dr.K.Shahu Chatrapati
Date:
External Examiner
4
ACKNOWLEDGMENT
We express our sincere gratitude to Dr. Vishnu Vardhan, Vice Principal, JNTUH
College of Engineering Manthani for his excellent guidance, advice and encouragement in
taking up this project.
We express our profound gratitude and thanks to our project guide Dr. K. Shahu
Chatrapati, HOD, CSE Department for his constant help, personal supervision, expert
guidance and consistent encouragement throughout this project which enabled us to complete
our project successfully in time.
We also take this opportunity to thank other faculty members of CSE Department for their
kind co-operation.
We wish to convey our thanks to one and all those who have extended their helping
hands directly and indirectly in completion of our project.
V.Tejaswi (13VD1A0554)
P.Snigda (13VD1A0547)
5
National Informatics Centre
National Informatics Centre (NIC) was established in 1976, and has since emerged as
a "prime builder" of e-Government / e-Governance applications up to the grassroots level as
well as a promoter of digital opportunities for sustainable development. NIC, through its ICT
Network, "NICNET", has institutional linkages with all the Ministries Departments of the
Central Government, 36 State Governments/ Union Territories, and about 688 District
administrations of India. NIC has been instrumental in steering e-Government/e-Governance
applications in government ministries/departments at the Centre, States, Districts and Blocks,
facilitating improvement in government services, wider transparency, promoting
decentralized planning and management, resulting in better efficiency and accountability to
the people of India.
6
Records and Property registration, Culture & Tourism, Import & Exports facilitation, Social
Welfare Services, Micro-level Planning, etc. With increasing awareness leading to demand
and availability of ICT infrastructure with better capacities and programme framework, the
governance space in the country witnessed a new round of projects and products, covering
the entire spectrum of e-Governance including G2C, G2B, G2G, with emphasis on service
delivery.
NIC has set up state-of-the-art ICT infrastructure consisting of National and state Data
Centres to manage the information systems and websites of Central Ministries/Departments,
Disaster Recovery Centres, Network Operations facility to manage heterogeneous networks
spread across Bhawans, States and Districts, Certifying Authority, Video-Conferencing and
capacity building across the country. National Knowledge Network (NKN) has been set up to
connect institutions/organizations carrying out research and development, Higher Education
and Governance with speed of the order of multi Gigabits per second. Further, State
Government secretariats are connected to the Central Government by very high speed links
on Optical Fiber Cable (OFC). Districts are connected to respective State capitals through
leased lines.
As NIC is supporting a majority of the mission mode e-Governance projects, the chapter on
National e-Governance Projects lists the of details of these projects namely National Land
Records Modernization Programme (NLRMP), Transport and National Registry, Treasury
Computerization, VAT, MG-NREGA, India-Portal, e-Courts, Postal Life Insurance, etc. NIC
also lays framework and designs systems for online monitoring of almost all central
government schemes like Integrated Watershed Management (IWMP), IAY, SGSY, NSAP,
BRGF, Schedule Tribes and other Traditional Forest Dwellers Act etc. ICT support is also
being provided in the States / UTs by NIC. Citizen centric services are also being rendered
electronically at the district level, such as Income Certificate, Caste Certificate, and
7
Residence Certificate etc. along with other services like Scholarship portals, permits, passes,
licenses to name a few. In executing all these activities, NIC has been given recognition in
terms of awards and accolades in International as well as National levels, which are listed in
the Awards Section. Thus, NIC, a small program started by the external stimulus of an UNDP
project, in the early 1970s, became fully functional in 1977 and since then has grown with
tremendous momentum to become one of India's major S&T; organizations promoting
informatics led development.
8
ABSTRACT
9
ABSTRACT:
In todays Internet world Logs are an essential part of any computing system,
supporting capabilities from audits to error management, As logs grow and the number of log
sources increases (such as in cloud environments), a scalable system is necessary to
efficiently process logs.log file analysis is becoming a necessary task for analyzing the
customers Behavior in order to improve sales as well as for datasets like environment,
science, social network, medical, banking system it is important to analyze the log data to get
required knowledge from it. Web mining is the process of discovering the knowledge from
the web data.
Log files are getting generated very fast at the rate of 1-10 Mb/s per machine, a single
data center can generate tens of terabytes of log data in a day. These datasets are huge. In
order to analyze such large datasets, we need parallel processing system and reliable data
storage mechanism. Virtual database system is an effective solution for integrating the data
but it becomes inefficient for large datasets. The Hadoop framework provides reliable data
storage by Hadoop Distributed File System and MapReduce programming model which is a
parallel processing system for large datasets. Hadoop distributed file system breaks up input
data and sends fractions of the original data to severalmachines in Hadoop cluster to hold
blocks of data. This mechanism helps to process log data in parallel using all the machines in
the Hadoop cluster and computes result efficiently. The dominant approach provided by
Hadoop to Store first query later, loads the data to the Hadoop Distributed File System and
then executes queries written in Pig Latin.
This approach reduces the response time as well as the load on to the end system. Log
files are primary source of information for identifying the System threats and problems that
occur in the System at any point of time. These threats and problem in the system can be
identified by analyzing the log file and finding the patterns for possible suspicious behavior.
The concern administrator can then be provided with appropriate alter or warning regarding
these security threats and problems in the system, which are generated after the log files are
analyzed. Based upon this alters or warnings the administrator can take appropriate actions.
Many tools or approaches are available for this purpose, some are proprietary and some are
open source
10
CONTENTS PAGE NO
1. INTRODUCTION
1.5 Modules
2. LITERATURE SURVEY
3. SYSTEM ANALYSIS
3.1 Existing System
3.2 Proposed System
3.3 Feasibility Study
3.3.1 Economical Feasibility
3.3.2 Technical Feasibility
3.3.3 Social Feasibility
4. SYSTEM REQUIREMENTS SPECIFICATIONS
4.1 Introduction 25
4.2 Non-Functional Requirements 25
4.3 System Requirements
5. SYSTEM DESIGN 25
5.1 Introduction 25
5.2 High-level design 25
5.3 Low-level design 25
11
5.3.1 UML Diagrams
6. CODING
7. TESTING
7.1 Types Of Testing
7.2 Test Strategy and Approach 30
7.3 Test Cases 32
8. SCREENSHOTS 34
9. CONCLUSION
10.BIBILIOGRAPHY
12
1. INTRODUCTION
13
1. INTRODUCTION:
Apache Hadoopis an open-source software framework written in java for distributed
storage and distributed processing of very large data sets on computer clusters built
fromcommodity hardware. All the modules in hadoop are designed with a fundamental
assumption thathardware failures are common and should be automatically handled by the
framework.Hadoop framework includes following four modules:
Hadoop Common: These are Java libraries and utilities required by other
Hadoopmodules. These libraries provides file system and OS level abstractions and
contains the necessary Java files and scripts required to start Hadoop.
Hadoop YARN: This is a framework for job scheduling and cluster resource
management.
Hadoop Distributed File System (HDFS): A distributed file system that provides
highthroughput access to application data.
Hadoop MapReduce: This is YARN-based system for parallel processing of large
datasets.
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable files
system.HDFS stores large files typically in the range of gigabytes, terabytes, and petabytes
across multiple machines.
HDFS uses a master/slave architecture where master consists of a single Name Node that
manages the file system metadata and one or more slave DataNodesthat store the actual data.
14
HDFS Architecture:
15
servers and logged in local hard discs. Proposed system uses four node environments where
data is manually stored in localhard disk in local machine. This log data will then be
transferred to HDFS using Pig Latin script. This log data is processed by MapReduce to
produce Comma Separated Values i.e. CSV.Find the areas where there exist errors or
warnings in the server. Also find the spammer IPs in the web application. Then we use Excel
or similar software to produce statistical information and generate reports.
16
1.5 Modules:
Implementation is the stage of the project when the theoretical design is turned out
into a workingsystem. Thus it can be considered to be the most critical stage in achieving a
successful new systemand in giving the user, confidence that the new system will work and
be effective. Theimplementation stage involves careful planning, investigation of the existing
system and itsconstraints on implementation, designing of methods to achieve changeover
and evaluation ofchangeover methods.
1.5.2Process Diagrams:
17
2. LITERATURE SURVEY
18
2. LITERATURE SURVEY:
Big data is a collection of large datasets that cannot be processed using traditional
computing techniques. Big Data includes huge volume, high velocity, and extensible variety
of data. This data will be of three types.
Structured data: Relational data.
Semi Structured data: XML data.
Unstructured data: Word, PDF, Text, Media Logs.
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models
and is developed under open source license. It enables applications to work with thousands of
nodes and petabytes of data. Hadoop framework includes four modules- Hadoop common,
Hadoop yarn, Hadoop Distributed File System (HDFS), Hadoop MapReduce.The two major
pieces of Hadoop includes HDFS and MapReduce
19
Java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java Hotspot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
If output is not as above then install java by following command
# sudo yum install java-1.7.0-openjdk
To verify whether java is installed or not we use the following command.
$ javac
1. Click here to download the Java 8 Package. Save this file in your home directory.
Command: we get
https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
20
Command: tar -xvf hadoop-2.7.3.tar.gz
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME 17
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Apply the changes in current running environment
$ source ~/.bashrc
STEP4: Now set java path in hadoop-env.sh using vi-editor in etc. folder
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.95-2.6.4.0.el7_2.x86_64/jre
(b).Edit Configuration Files:
Navigate to below location
$ cd $HADOOP_HOME/etc/hadoop
Now append these xml files
$vi core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
21
</configuration>
$vi hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
$vi mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
$vi yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
(c).Format Namenode:
Go to bin and apply below command
22
STEP5: Start Hadoop cluster
To start hadoop cluster, navigate to your hadoop sbin directory and execute scripts one
by one.
$ cd $HADOOP_HOME/sbin/
Run start-all.sh to start hadoop
$ start-all.sh
$ stop-all.sh
Command: cd
Command: cd hadoop-2.7.3
This formats the HDFS via NameNode. This command is only executed for the first time.
Formatting the file system means initializing the directory specified by the
dfs.name.dir variable.
Never format, up and running Hadoop filesystem. You will lose all your data stored in the
HDFS.
23
STEP7: Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all the
daemons.
Command: cd hadoop-2.7.3/sbin
Either you can start all daemons with a single command or do it individually.
Command: ./start-all.sh
Start NameNode:
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all
files stored in the HDFS and tracks all the file stored across the cluster.
Start DataNode:
On startup, a DataNode connects to the Namenode and it responds to the requests from the
Namenode for different operations.
Start ResourceManager:
ResourceManager is the master that arbitrates all the available cluster resources and thus
helps in managing the distributed applications running on the YARN system. Its work is to
manage each NodeManagers and the each applications ApplicationMaster.
Start NodeManager:
The NodeManager in each machine framework is the agent which is responsible for
managing containers, monitoring their resource usage and reporting the same to the
ResourceManager.
Start JobHistoryServer:
24
JobHistoryServer is responsible for servicing all job history related requests from client.
Command:
(Or)
Command:
./start-all.sh
./stop-all.sh
STEP8: To check that all the Hadoop services are up and running, run the below command.
Command: jps
25
STEP9: Now open the Mozilla browser and go to localhost:50070/dfshealth.html to
http://localhost:50070/
Hadoop DataNode started on port 50075 default.
http://localhost:50075/
Hadoop secondaryNode started on port 50090 default.
http://localhost:50090/
Access port 8088 for getting the information about cluster and all applications.
http://localhost:8088/19
26
3. Untar the hbase-1.1.2-bin.tar.gz tar file
a. Open command prompt
b. Type command:
>sudo tar -xzf /home/lakkireddy/edureka/hbase-1.1.2-bin.tar.gz
b. > cd /usr/lib/hbase/hbase-1.1.2/conf
<configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase:rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
</configuration>
e. save and exit geditor.
27
7. Edit hbase-env.sh
a. On command Prompt, run following commands
b. > cd /usr/lib/hbase/hbase-1.1.2/conf
c. > sudo gedit hbase-env.sh
d. Export your java home path
e.g. export JAVA_HOME=/usr/lib/jvm/oracle_jdk8/jdk1.8.0_51
e. Save and exit geditor
f. Exit command prompt
8. Export hbase_home path in .bashrc file, run following command
export HBASE_HOME=/usr/lib/hbase/hbase-1.1.2
export PATH=$PATH:$HBASE_HOME/bin
d. Exit vi-editor
a. > start-dfs.sh
b. > start-yarn.sh
c. Verify that hadoop services are running, type command
> jps
28
10. Now start hbase services, type command
a. > start-hbase.sh
29
11. Verify that on HDFS (Hadoop Distributed File system) hbase directory is created,
On Command prompt enter following command
a. hadoop fs -ls /tmp/hbase-hduser
30
Step 1: Download Pig tar file.
Step 2: Extract the tar file using tar command. In below tar command, x means extract an
archive file, z means filter an archive through gzip, f means filename of an archive file.
Command: ls
Step 3: Edit the .bashrc file to update the environment variables of Apache Pig. We are
setting it so that we can access pig from any directory, we need not go to pig directory to
execute pig commands. Also, if any other application is looking for Pig, it will get to know
the path of Apache Pig from this file.
# Set PIG_HOME
export PIG_HOME=/home/edureka/pig-0.16.0
export PATH=$PATH:/home/edureka/pig-0.16.0/bin
export PIG_CLASSPATH=$HADOOP_CONF_DIR
Run below command to make the changes get updated in same terminal.
31
Step 4: Check pig version. This is to test that Apache Pig got installed correctly. In case, you
dont get the Apache Pig version, you need to verify if you have followed the above steps
correctly.
Step 5: Check pig help to see all the pig command options.
Step 6: Run Pig to start the grunt shell. Grunt shell is used to run Pig Latin scripts.
Command: pig
If you look at the above image correctly, Apache Pig has two modes in which it can run, by
default it chooses MapReduce mode. The other mode in which you can run Pig is Local
mode. Let me tell you more about this.
Local Mode With access to a single machine, all files are installed and run using a
local host and file system. Here the local mode is specified using -x flag (pig -x
local). The input and output in this mode are present on local file system
32
MapReduce Mode This is the default mode, which requires access to a Hadoop
cluster and HDFS installation. Since, this is a default mode, it is not necessary to
specify -x flag. The input and output in this mode are present on HDFS.
33
34
3. SYSTEM ANALYSIS
3. SYSTEM ANALYSIS:
35
3.1 Existing System:
The current processing of log files goes through ordinary sequential ways in order to
perform preprocessing, session identification and user identification. The non-
Hadoopapproach loads the log file dataset, to process each line one after another. The log
field is then identified by splitting the data and by storing it in an array list. The preprocessed
logfield is stored in the form of hash table, with key and value pairs, where key is the month
and value is the integer representing the month. In existing system work is possible to run
only on single computer with a single java virtual machine (JVM).
A JVM has the ability to handle a dataset based on RAM i.e. if the RAM is of 2GB
then a JVM can process dataset of only 1GB. Processing of log files greater than 1GB
becomes hectic. The non-Hadoop approach is performed on java 1.6 with single JVM.
Although batch processing can be found in these single-processor programs, there are
problems in processing due to limited capabilities. Therefore, it is necessary to use parallel
processing approach to workeffectively on massive amount of large datasets.
36
The feasibility of the project is analyzed in this phase and business proposal is put
forth with a verygeneral plan for the project and some cost estimates. During system analysis
the feasibility studyofthe proposed system is to be carried out. This is to ensure that the
proposed system is not a burdento the company. For feasibility analysis, some understanding
of the major requirements for thesystem is essential.
3.3.1Economic Feasibility:
This study is carried out to check the economic impact that the system will have on
theOrganization. The amount of fund that the company can pour into the research and
development ofthe system is limited. The expenditures must be justified. Thus the developed
system as well withinthe budget and this was achieved because most of the technologies used
are freely available. Onlythe customized products had to be purchased.
This study is carried out to check the technical feasibility, that is, the technical
requirements of thesystem. Any system developed must not have a high demand on the
available technical resources.This will lead to high demands on the available technical
resources. This will lead to high demandsbeing placed on the client. The developed system
must have a modest requirement, as only minimal or null changes are required for
implementing this system.
The aspect of study is to check the level of acceptance of the system by the user. This
includes theprocess of training the user to use the system efficiently. The user must not feel
threatened by thesystem, instead must accept it as a necessity. The level of acceptance by the
users solely depends onthe methods that are employed to educate the user about the system
and to make him familiar withit. His level of confidence must be raised so that he is also able
to make some constructive criticism,which is welcomed, as he is the final user of the system.
37
4. SYSTEM REQUIREMENT SPECIFICATIONS
4. SYSTEM REQUIREMENTS:
38
4.1 INTRODUCTION:
Software Requirements Specification plays an important role in creating quality
software solutions. Specification is basically a representation process. Requirements are
represented in a manner that ultimately leads to successful software implementation.
Requirements may be specified in a variety of ways. However there are some guidelines
worth following: -
Diagrams and other notational forms should be restricted in number and consistent in use.
Usability:
Usability is the ease of use and learns ability of a human-made object. The object of
use can be a software application, website, book, tool, machine, process, or anything a human
interacts with. A usability study may be conducted as a primary job function by a usability
analyst or as a secondary job function by designers, technical writers, marketing personnel,
and others.
Reliability:
The probability that a component part, equipment, or system will satisfactorily
perform its intended function under given circumstances, such as environmental conditions,
limitations as to operating time, and frequently and thoroughness of maintenance for a
specified period of time.
Performance:
Accomplishment of a given task measured against preset standards of accuracy,
completeness, cost, and speed.
Supportability:
39
To which the design characteristics of a stand by or support system meet the
operational requirements of an organization.
Implementation:
Implementation is the realization of an application, or execution of a plan, idea,
model, design, specification, standard, algorithm, or policy
Interface:
An interface refers to a point of interaction between components, and is applicable at
the level of both hardware and software. This allows a component whether a piece of
hardware such as a graphics card or a piece of software such as an internet browser to
function independently while using interfaces to communicate with other components via an
input/output system and an associated protocol.
Legal:
It is established by or founded upon law or official or accepted rules of or relating to
jurisprudence; legal loophole.
SOFTWARE REQUIREMENTS:
Operating System : Ubuntu 14.04
Coding Language : Java
Scripting Language: Pig Latin Script
IDE : Eclipse
Web Server :Tomcat
Database : HDFS
HARDWARE REQUIREMENTS:
40
Processor Type : Intel (any version)
Speed : 1.1 GHZ
RAM : 4GB
Hard disk : 20 GB
Keyboard : 101/102 Standard Keys
41
5. SYSTEM DESIGN
5. SYSTEM DESIGN:
42
5.1 INTRODUCTION:
The most creative and challenging phase of the life cycle is system design. The term
design describes a final system and the process by which it is developed. It refers to the
technical specifications that will be applied in implementations of the candidate system. The
design may be defined as the process of applying various techniques and principles for the
purpose of defining a device, a process or a system with sufficient details to permit its
physical realization.
The designers goal is how the output is to be produced and in what format.Samples of the
output and input are also presented.Second input data and database files have to be designed
to meet the requirements of the proposed output.
The processing phases are handled through the program Construction and Testing. Finally,
details related to justification of the system and an estimate of the impact of the candidate
system on the user and the organization are documented and evaluated by management as a
step toward implementation.
The importance of software design can be stated in a single word
Quality. Design provides us with representations of software that can be assessed for
quality. Design is the only way where we can accurately translate a customers requirements
into a complete software product or system. Without design we risk building an unstable
system that might fail if small changes are made. It may as well be difficult to test, or could
be one whos quality cant be tested. So it is an essential phase in the development of a
software product.
5.2High-level design:
High Level Design defines a complete scale architecture of the developing system
required. In short it is an overall representation of a design required for our target developing
system/application. It is usually done by higher level professionals/software architects.
43
Constructing
Documenting
These are the artifacts of a software-intensive system.
A conceptual model of UML:
The three major elements of UML are
1. The UMLs basic building blocks
2. The rules that dictate how those building blocks may be put together.
3. Some common mechanisms that apply throughout the UML.
Basic building blocks of the UML:
The vocabulary of UML encompasses three kinds of building blocks:
1. Things
2. Relationships
3. Diagrams
Things are the abstractions that are first-class citizens in a model;
Relationships tie these things together;
Diagrams group the interesting collection of things.
Things in UML: There are four kind of things in the UML
1. Structural things
2. Behavioral things.
3. Grouping things
4. Annotational things
These things are the basic object oriented building blocks of the UML.They are used to write
well-formed models.
STRUCTURAL THINGS:
Structural things are the nouns of the UML models. These are mostly static parts of
the model, representing elements that are either conceptual or physical. In all, there are seven
kinds of Structural things.
Updation and queries. They are the highest authorities within the system, which have
maximumcontrol upon the entire database.
44
45
5.3.3 CLASS DIAGRAM:
46
Analysis of Sample Log file using Pig Latin Script:
By this command the log files are loaded into hdfs, now we can run our pig script to analyze
the log files from mapreduce mode rather than local mode.
After loading the log fileinto hdfs we have to write the Pig Script to analyze the particular log
file which is loaded into hdfs.
The format of pig script will be different for type of log file used for knowledge discovery or
analysis of system threats or analysis of user call log data.
In pig latin script we can extract the log file data based on our requirement, by using the
particular pig query, as shown in the below snapshot .
By using the following command we can execute the log file in mapreduce mode.
The mapreduce jobs will run simultaneously as shown in the below snapshot
In the above snap shot we can see the mapreduce job is completed 80% and ready to display
the output.
Inorder to display the following output in the hdfs by using the following query in the pig
Latin script we can.
47
Then after by clicking the part-m-00000 file the download option will be available to
download the log analysis result as shown below.
After clicking the download option the output file will be downloaded and the result will be
like.
48
6. CODING
6. CODING:
49
JDBC program to load data:
packagenet.codejava.upload;
import java.io.*;
import java.net.URL;
importjava.net.URLConnection;
importjava.net.URLEncoder;
importjava.sql.*;
importjava.util.Enumeration;
importjava.util.Iterator;
importjava.util.List;
importjavax.servlet.*;
importjavax.servlet.http.*;
importorg.apache.commons.fileupload.FileItem;
importorg.apache.commons.fileupload.FileItemFactory;
importorg.apache.commons.fileupload.FileUploadException;
importorg.apache.commons.fileupload.disk.DiskFileItemFactory;
importorg.apache.commons.fileupload.servlet.ServletFileUpload;
// upload settings
publicUploadFile() {
super();
50
/**
*/
//doPost(request, response);
//throw new ServletException("GET method used with " + getClass( ).getName( )+":
POST method required.");
request.getRequestDispatcher("/WEB-INF/index.jsp").forward(request, response);
/**
*/
System.out.println("demo");
if (!ServletFileUpload.isMultipartContent(request)) {
writer.flush();
return;
51
// configures upload settings
factory.setRepository(new File(System.getProperty("java.io.tmpdir")));
if (!uploadDir.exists()) {
uploadDir.mkdir();
try {
System.out.println(uploadPath);
List<FileItem>formItems = upload.parseRequest((HttpServletRequest)request);
52
for (FileItem item : formItems) {
if (!item.isFormField()) {
// C:\tomcat\apache-tomcat-7.0.40\webapps\data\
item.write(storeFile);
System.out.println("SUCCESSFULLY UPLOADED");
Fileupload.java:
53
packageHdfsFileOperation;
import java.io.*;
importorg.apache.hadoop.conf.Configuration;
importorg.apache.hadoop.fs.*;
Path workingDir=hdfs.getWorkingDirectory();
newFolderPath=Path.mergePaths(workingDir, newFolderPath);
if(hdfs.exists(newFolderPath))
hdfs.delete(newFolderPath, true);
System.out.println(Folder Created.);
54
Path hdfsFilePath= new Path(newFolderPath+/dataFile1.txt);
hdfs.copyFromLocalFile(localFilePath, hdfsFilePath);
localFilePath=new Path(c://hdfsdata/datafile1.txt);
hdfs.copyToLocalFile(hdfsFilePath, localFilePath);
hdfs.createNewFile(newFilePath);
for(inti=1;i<=5;i++)
sb.append(Data);
sb.append(i);
sb.append(\n);
FSDataOutputStreamfsOutStream = hdfs.create(newFilePath);
fsOutStream.write(byt);
fsOutStream.close();
55
//Reading data From HDFS File
newInputStreamReader(hdfs.open(newFilePath)));
System.out.println(str);
Main.java:
import java.io.File;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
56
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
/*
* This program processes Apache HTTP Server log files using MapReduce
*/
ClassNotFoundException, InterruptedException {
if (args.length < 1) {
System.exit(1);
conf.set(
"logEntryRegEx",
conf.set("fieldsToCount", "1569");
countJob.setJarByClass(Main.class);
countJob.setMapOutputKeyClass(Text.class);
57
countJob.setMapOutputValueClass(IntWritable.class);
countJob.setOutputKeyClass(Text.class);
countJob.setOutputValueClass(IntWritable.class);
countJob.setMapperClass(CountMapper.class);
countJob.setReducerClass(CountReducer.class);
countJob.setInputFormatClass(TextInputFormat.class);
countJob.setOutputFormatClass(TextOutputFormat.class);
// this performs reduces on the Map outputs before it's sent to the
// Reducer
countJob.setCombinerClass(CountReducer.class);
+ File.separator + "counts");
if (!fileSystem.exists(inputFile)) {
+ inputFile.getParent());
return;
if (fileSystem.exists(countOutput)) {
fileSystem.delete(countOutput, true);
System.out
58
.println("Deleted existing output file before continuing.");
fileSystem.close();
FileInputFormat.addInputPath(countJob, inputFile);
FileOutputFormat.setOutputPath(countJob, countOutput);
countJob.waitForCompletion(true);
Mapper.java :
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
/**
* @param key
* @param value
59
* @param context
*/
@Override
/*
* For each entry in the log file, generate a k/v pair for every field
* HTTP response, User Agent etc. This mapper is very generic and the
*/
if (logEntryMatcher.find()) {
if(!index.equals("")) {
+ logEntryMatcher.group(Integer.parseInt(index)));
60
context.write(k, one);
Reducer.java:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
/* @see org.apache.hadoop.mapreduce.Reducer#reduce(KEYIN,
* java.lang.Iterable, org.apache.hadoop.mapreduce.Reducer.Context)
*/
@Override
int sum = 0;
sum += value.get();
61
total.set(sum);
context.write(key, total);
62
7. TESTING
7. TESTING:
63
The purpose of testing is to discover errors. Testing is the process of trying to discover
every conceivable fault or weakness in a work product. It provides a way to check the
functionality of components, sub-assemblies, assemblies and/or a finished product It is the
process of exercising software with the intent of ensuring that the Software system meets its
requirements and user expectations and does not fail in an unacceptable manner. There are
various types of test. Each test type addresses a specific testing requirement.
64
Valid Input : identified classes of valid input must be accepted.
Invalid Input : identified classes of invalid input must be rejected.
Functions : identified functions must be exercised.
Output : identified classes of application outputs must be exercised.
Systems/Procedures : interfacing systems or procedures must be invoked.
65
Unit testing is usually conducted as part of a combined code and unit test phase of the
software lifecycle, although it is not uncommon for coding and unit testing to be conducted as
two distinct phases.
Integration Testing:
Software integration testing is the incremental integration testing of two or more
integrated software components on a single platform to produce failures caused by interface
defects. The task of the integration test is to check that components or software applications,
e.g. components in a software system or one step up software applications at the company
level interact without error.
3 user
4 user
5 user
66
Test Results:
All the test cases mentioned above passed successfully. No defects encountered.
Acceptance Testing:
User Acceptance Testing is a critical phase of any project and requires significant
participation by the end user. It also ensures that the system meets the functional
requirements.
Test Results: All the test cases mentioned above passed successfully. No defects
encountered.
67
8. SCREENSHOTS
8. SCREENSHOTS:
68
69
70
71
.
72
73
74
9. CONCLUSION
9. CONCLUSION:
75
Log analysis helps to improve the business strategies as well as to generate statistical
reports. Hadoop MapReduce based log file analysis tool will provide us graphical reports
showing hits for web pages, users page view activity, in which part of website users are
interested, traffic attack etc. From these reports business communities can evaluate which
parts of the website need to be improved on behalf, which are the potential customers, from
which IP or area or region website is getting maximum hits, etc., which will be help in
designing future business and marketing plans. Hadoop MapReduce framework provides
parallel distributed computing and reliable data storage by replicating data for large volumes
of log files. Firstly, data get stored block wise in rack on several nodes in a cluster so that
access time required can be reduced which saves much of the processing time and enhance
performance. Here Hadoops characteristic of moving computation to the data rather moving
data to computation helps to improve response time. Secondly, MapReduce successfully
works distributed for large datasets giving the more efficient resultsWeb Server Log
Processing has bright, vibrant scope in the field of information technology.IT organizations
analyze server logs to answer questions about security and compliance.
Proposed systemwill focus on a network security use case. Specifically, we will look
at how Apache Hadoop can help the administrator of a large enterprise network diagnose and
respond to a distributed denial-of-service attack.
76
10. BIBLIOGRAPHY
10.BIBLIOGRAPHY:
77
http://tipsonubuntu.com/2016/07/31/install-oracle-java-8-9-ubuntu-16-04-linux-mint-
18/
http://www.tecadmin.net/setup-hadoop-2-4-single-node-cluster-on-linux/#
http://www.wikihow.com/Set-Up-Your-Java_Home-Path-in-Ubuntu
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
hdfs/HdfsUserGuide.html
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
common/SingleCluster.html
https://www.tutorialspoint.com/apache_pig/apache_pig_installation.htm
https://pig.apache.org/docs/r0.7.0/setup.html
http://stackoverflow.com/questions/15426142/log-files-in-hbase
https://community.hortonworks.com/content/supportkb/49162/where-can-i-find-
region-server-log.html
http://data-flair.training/blogs/install-run-apache-pig-ubuntu-quickstart-guide/
http://blogs.perficient.com/delivery/blog/2015/09/09/some-ways-load-data-from-hdfs-
to-hbase/
http://www.trytechstuff.com/how-to-install-pig-on-ubuntulinux/
https://www.youtube.com/results?
search_query=how+to+load+unstructured+data+into+hadoop
https://sreejithrpillai.wordpress.com/2015/01/08/bulkloading-data-into-hbase-table-
using-mapreduce/
http://www.cloudera.com/documentation/cdh/5-0-x/CDH5-Installation-
Guide/cdh5ig_pig_install.html
http://www.tecadmin.net/steps-to-install-tomcat-server-on-centos-rhel/
http://hadooptutorial.info/pig-installation-on-ubuntu/
78
79