A PROJECT REPORT
Submitted by
P.PRIYANKA (111713205082)
IN
INFORMATION TECHNOLOGY
RMK ENGINEERING COLLEGE, CHENNAI
APRIL 2017
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Semester : 08
USER
SANJUNA DEVI C
RECOMMENDER MS.PRATHUSHA LAXMI
SYSTEM BASED PRIYANKA P M.E., (Ph.D)
UPON
KNOWLEDGE, SANJIVA HEAD OF THE DEPARTMENT
AVAILABILITY SINDHUJA R Dept. of Information Technology
AND REPUTATION
FROM
INTERACTIONS IN
FORUMS
The report of the project work submitted by the above students in partial
fulfillment for the award of Bachelor of Technology Degree in INFORMATION
TECHNOLOGY of Anna University was evaluated and confirmed to be the report
of the work done by the above students and then evaluated.
iii
Submitted the project during the viva voce held on .
iv
ACKNOWLEDGEMENT
At the outset, we would like to express our gratitude to our beloved and respected
Chairman, Thiru.R.S.Munirathnam for his support and blessings to accomplish the
project.
We would like to express our thanks to our Vice Chairman Thiru. R.M.Kishore
for his encouragement and guidance.
We thank our Principal, Dr. K.A. Mohamed Junaid, for creating the wonderful
environment for us and enabling us to complete the project.
We wish to express our sincere thanks and gratitude to Dr. K. VIJAYA, M.E.
Ph.D., Head, Department of Information Technology and our Project Guide
MS.PRATHUSHA LAXMI ME.,(P.hD) who has been a constant source of inspiration to
us and also for having extended their fullest co-operation and guidance without which this
project would not have been a success.
We express our sincere gratitude and thanks to our beloved Project Coordinator
Mr. K. CHIDAMBARA THANU M.E,(Ph.D) for having extended their fullest
co-operation and guidance without which this project would not have been a success.
Our thanks to all faculty and non teaching staff members of our department for
their constant support to complete this project.
iv
ABSTRACT
vi
TABLE OF CONTENTS
ABSTRACT v
LIST OF FIGURES ix
LIST OF ABBREVIATIONS x
1 INTRODUCTION 1
1.1 GENERAL
2.1 INTRODUCTION
2.3.1 HDFS
2.3.2 YARN
vii
2.5 HDFS BLOCKS
3.2 HIVE
3.2.3.WORKING OF HIVE
5.3.2 TERMINOLOGIES
3.5 PYTHON
3.6 MYSQL
3.7 UBUNTU
viii
FRAMEWORK
4.1 OBJECTIVE
4.3 LIMITATIONS
4.5 ADVANTAGES
5 SYSTEM DESIGN 26
SETUP ON CENTOS
ix
7 PROCESSING STEPS 36
8 CONCLUSION 38
9 APPENDIX 1 39
SCREENSHOTS
10 REFERENCES 62
x
LIST OF FIGURES
xi
LIST OF ABBREVIATIONS
x
CHAPTER 1
INTRODUCTION
1.1 GENERAL:
1.1.1. Importance of Big Data:
In todays world with the rapidly increasing need for the storage of
large amounts of data, a technology is needed to support it, one such technology is
Big Data". It plays a vital role in the storage of data sets that are so large or
complex that traditional data processing application software is inadequate to deal
with them. Challenges include capture, storage, analysis, data-curation,
search, sharing, transfer, visualization, querying, updating and information
privacy. The term "big data" often refers simply to the use of predictive
analytics, user behavior analytics, or certain other advanced data analytics
methods that extract value from data, and seldom to a particular size of data set.
"There is little doubt that the quantities of data now available are indeed large, but
thats not the most relevant characteristic of this new data ecosystem." Analysis of
data sets can find new correlations to "spot business trends, prevent diseases, and
combat crime and so on." Scientists, business executives, practitioners of
medicine, advertising and governments alike regularly meet difficulties with large
data-sets in areas including Internet search, finance, urban informatics,
and business informatics. Scientists encounter limitations in e-Science work,
including meteorology, genomics, connectomics, complex physics simulations,
biology and environmental research
1
BIG DATA MARKET FORECAST, 2011-2017(in $US Billions)
$60.00
$50.00 $50.10
$45.30
$40.00
$38.40
$30.00
$28.30
$20.00
$18.60
$11.80
$10.00
$7.30
$0.00
2011 2012 2013 2014 2015 2016 2017
1.2. CHARACTERISTICS:
Volume
The term Volume refers to the amount of data generated. While handling
medical claims, large amount of data is being generated per day which contains
details about the individuals and their policies which will vary on daily basis. Each
day lots of entries of data are registered and the volume of data gets doubled and
2
tripled every day. The volume of data might be petabytes and Exabyte which are
too complex.
Variety
The term Variety refers to the various sources of data. Medical claims
consists of data coming from different sources which may be structured , semi
structured or unstructured data i.e. entries which are made in spreadsheets and
claim holders image, etc. basically it deals with all types of data such as text,
documents and images.
Velocity
The term Velocity refers to the speed of data generated and how it can be
processed. In medical claims, there will be a new entry for every second and lots
of data are injected to the databases to process them.
Variability
The term Variability refers to data whose meaning is constantly changing.
Medical claims data can be varying since there will be continuous updates
performed by the individuals like they would like to close the claim policy or
wants to add anything new to the existing policy. So data is varying with time and
it should be meaningful to analyze the data generated.
Veracity
The quality of captured data can vary greatly, affecting accurate analysis.
For effective analysis of medical claims, all the data entered in the database should
be accurate enough to produce effective analysis for future references. For
example, the concerned company may want to analyze the claim holders for a
particular period of time the amount they claimed.
3
CHAPTER 2
HADOOP DISTRIBUTED FILE SYSTEM
2.1. Introduction
Solution:
4
used in the Active-Standby mode with shared edits to handle the Name Node
failure.
Solution:
HDFS
YARN
Map Reduce
2.3.1. HDFS:
HDFS stands for Hadoop Distributed File System. It is also known as HDFS
V2 as it is part of Hadoop 2.x with some enhanced features. It is used as a
Distributed Storage System in Hadoop Architecture.
2.3.2. YARN:
5
MapReduce's resource management and scheduling capabilities from the data
processing component, enabling Hadoop to support more varied processing
approaches and a broader array of applications. YARN combines a central
resource manager that reconciles the way applications use Hadoop system
resources with node manager agents that monitor the processing operations of
individual cluster nodes
HDFS has been designed keeping in view the following features: Very large
files: Files that are megabytes, gigabytes, terabytes or petabytes in size .Streaming
data access: HDFS is built around the idea that data is written once, but read many
times .A dataset is copied from the source and then analysis is done on that dataset
over time Commodity hardware: Hadoop does not require expensive, highly
reliable hardware as it is designed to run on clusters of commodity hardware.
Hard Disk has concentric circles which form tracks .One file can contain
many blocks. These blocks in a local file system are nearly 512 bytes and are not
6
necessarily continuous. For HDFS, since it is designed for large files, block size is
128 MB by default. Moreover, it gets blocks of local file system contiguously to
minimize head seek time.
7
2. Edits: When any write operation takes place in HDFS, the directory structure
gets modified .These modifications are stored in the memory as well as in edits
files (edits files are stored on hard disk). If existing fsimage file gets merged with
edits, well get an updated fsimage file. This process is called Check pointing and
is carried out by the Secondary Namenode. It takes fsimage and edits files from
Namenode and returns updated fsimage file after merging
The datanode stores actual data blocks of file in HDFS on its own local
disk. It Sends signals to Namenode periodically (called as Heartbeat) to verify if it
is active. Sends block reporting to the Namenode on cluster startup as well as
periodically at every 10th Heartbeat. The DataNode are the workhorse of the
system .They perform all the block operations, including periodic the Checksum.
They receive instructions from the Namenode of where to put the blocks and how
to put the blocks.
8
2.7. Goal of HDFS:
Fault detection and recovery:
Huge datasets:
Hardware at data:
A requested task can be done efficiently, when the computation takes place
near the data. Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.
CHAPTER 3
DESCRIPTION OF THE TOOLS USED:
3.2. Hive
Initially Hive was developed by Face book; later the Apache Software
Foundation took it up and developed it further as an open source under the name
Apache Hive. It is used by different companies. For example, Amazon uses it in
Amazon Elastic Map Reduce.
10
Hive is not
A relational database
A design for On Line Transaction Processing (OLTP)
A language for real-time queries and row-level updates
3.2.1. FEATURES OF HIVE:
The conjunction part of HiveQL process Engine and Map Reduce is Hive
Execution Engine. Execution engine processes the query and generates results as
same as Map Reduce results. It uses the flavor of Map Reduce.
Hadoop distributed file system or HBASE are the data storage techniques to store
data into file system.
12
Fig 3.2.3 Working of Hive
13
The Job Tracker is a single point of failure for the HadoopMapReduce
service which means if Job Tracker goes down, all running jobs are halted.
14
o Map stage : The map or mappers job is to process the input data.
Generally the input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS). The input file is passed to
the mapper function line by line. The mapper processes the data and
creates several small chunks of data.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the data
to form an appropriate result, and sends it back to the Hadoop server.
15
Fig 3.3.1 Map-Reduce Algorithm
Overall, Mapper implementations are passed the JobConf for the job via
the JobConfigurable.configure(JobConf) method and override it to initialize
themselves. The framework then calls map(WritableComparable, Writable,
OutputCollector, Reporter) for each key/value pair in the InputSplit for that task.
Applications can then override the Closeable.close() method to perform any
required cleanup.
Output pairs do not need to be of the same types as input pairs. A given
input pair may map to zero or many output pairs. Output pairs are collected with
calls toOutputCollector.collect(WritableComparable,Writable).
All intermediate values associated with a given output key are subsequently
grouped by the framework, and passed to the Reducer(s) to determine the final
output. Users can control the grouping by specifying
a Comparator via JobConf.setOutputKeyComparatorClass(Class).
The Mapper outputs are sorted and then partitioned per Reducer. The total
number of partitions is the same as the number of reduce tasks for the job. Users
can control which keys (and hence records) go to which Reducer by implementing
a custom Partitioner.
16
intermediate outputs, which helps to cut down the amount of data transferred from
the Mapper to the Reducer.
The intermediate, sorted outputs are always stored in a simple (key-len, key,
value-len, value) format. Applications can control if, and how, the intermediate
outputs are to be compressed and the Compression Codec to be used via
the JobConf.
With 0.95 all of the reduces can launch immediately and start transfering
map outputs as the maps finish. With 1.75 the faster nodes will finish their first
round of reduces and launch a second wave of reduces doing a much better job of
load balancing.
The scaling factors above are slightly less than whole numbers to reserve a few
reduce slots in the framework for speculative-tasks and failed tasks.
Reducer NONE
In this case the outputs of the map-tasks go directly to the FileSystem, into
the output path set by setOutput Path(Path). The framework does not sort the map-
outputs before writing them out to the FileSystem.
17
Partitioner
Reporter
18
OutputCollector
3.3.2. TERMINOLOGIES:
PayLoad - Applications implement the Map and the Reduce functions, and
form the core of the job.
MasterNode - Node where JobTracker runs and which accepts job requests
from clients.
JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
19
Job - A program is an execution of a Mapper and Reducer across a dataset.
HBase is a data model that is similar to Googles big table designed to provide
quick random access to huge amounts of structured data. It leverages the fault
tolerance provided by the Hadoop File System (HDFS).
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the
Hadoop File System and provides read and write access.
20
Fig 3.4 Working of HBASE
Features of HBase
3.5. Python
Python features a dynamic type system and automatic memory management and
supports multiple programming paradigms, including object-oriented, imperative,
functional programming, and procedural styles. It has a large and comprehensive
standard library.
Python interpreters are available for many operating systems, allowing Python
code to run on a wide variety of systems. CPython, the reference implementation
of Python, is open source software and has a community-based development
model, as do nearly all of its variant implementations. CPython is managed by the
non-profit Python Software Foundation.
3.6. MySQL
22
environments. MySQL is a fast, easy-to-use RDBMS being used for many small
and big businesses.
3.7. UBUNTU
Ubuntu operates under the GNU General Public License (GPL) and all of
the application software installed by default is free software. In addition, Ubuntu
installs some hardware drivers that are available only in binary format, but such
packages are clearly marked in the restricted component.
23
programs run with low privileges and cannot corrupt the operating system or other
users' files. For increased security, the sudo tool is used to assign temporary
privileges for performing administrative tasks, which allows the root account to
remain locked and helps prevent inexperienced users from inadvertently making
catastrophic system changes or opening security holes. Policy Kit is also being
widely implemented into the desktop to further harden the system. Most network
ports are closed by default to prevent hacking. A built-in firewall allows end-users
who install network servers to control access. A GUI (GUI for Uncomplicated
Firewall) is available to configure it. Ubuntu compiles its packages
using GCC features such as PIE and buffer overflow protection to harden its
software. These extra features greatly increase security at the performance expense
of 1% in 32 bit and 0.01% in bit. The home and Private directories can be
encrypted.
24
CHAPTER 4
AUTOMATED DATA VALIDATION
FRAMEWORK
4.1. OBJECTIVE
This is mainly designed to automate the validation of data in the field of medical
claims based on the constraints given by the respective organization on time basis.
Earlier all the medical claims have been processing manually. All the data are
recorded by the organizations agent and maintained into separate records for each
category. Later, people started moving these manual records into computers storage
using various spreadsheets and databases like MySQL to store the data.
After the usage of computer software, all the data are moved and maintained in
spreadsheets. As the time varied, the amount of data generated and entered
were in large amount which cannot be stored and processed in the existing
25
software.
Since only certain amount of data can be stored in spreadsheet it was very
difficult to manage and process the data. So handling of very large amount of
data i.e. petabyte and Exabyte of data per day is a difficult task.
In this proposed system, medical claims data are collected and loaded into the
database. Then the data are cleaned and transformed to make it effective for performing
validation. Once the data is loaded into the database, it automatically validates the data
based upon the given constraints to maintain the data with consistency and for effective
analysis. After the data is validated, the invalid data are moved and stored in the
BAD_FILES directory of the HDFS. The administrator gets notified about the data
refresh through email notification. This automation makes the clients to perform the
validation of large amount of medical claims easier and effective for analysis and also
saves a lot of time.
4.5. ADVANTAGES
Time efficient.
It is an automated process and hence does not require much of man power.
Once the validation process is completed and lookup refresh is success then an
email notification is sent to pre-defined recipients.
26
CHAPTER 5
SYSTEM DESIGN
Use case diagrams give a graphic overview of the actors involved in a system,
different functions needed by those actors and how these different functions are
interacted.
28
Fig 5.1 Usecase diagram
Sequence diagrams in UML show how objects interact with each other and the order
29
those interactions occur. Its important to note that they show the interactions for a
particular scenario. The processes are represented vertically and interactions are show
as arrows.
30
5.3. ACTIVITY DIAGRAM
31
Fig 5.3 Activity diagram
32
5.5. COMPONENT DIAGRAM
33
Fig 5.5 Component diagram
CHAPTER 6
INITIAL SETUP PROCESS STEPS
35
On clicking Next, a prompt appears to set up RAM size for the VM, Increase
the RAM up to 2048 MB if the system has 8 GB RAM and increase up to 1 GB
if the system has 4 GB RAM.
Before powering on the VM, click on the setting option and then increase the
RAM size.
Click on Next to get the option of selecting the Hard Disk Option; choose the
third option i.e. using the existing Virtual hard drive le.
After, click on the folder icon to browse the location where the unzipped le of
CentOS is present.
Select the imported VM and click on the Start button to start the VM and Type
username and password.
Open the terminal and login to the root user to have administrator permissions
and Type the password.
Add more users in the CentOS by using the command adduser followed by the
username.
Then, Set the password of the added user by using the command passwd
followed by the password.
Disable the rewall in the CentOS using the required command.
Add the user into sudoers le to give the administrator rights to the created
user.Type the required command to add the created user into sudoers le.
Add the user by scrolling the cursor down to the appropriate position.
To type any command in the above le, enter insert mode by pressing I in the
keyboard and then add the users in the sudoers le and then press Esc button to
come out of insert mode and then type:wq to save and exit.
36
Reboot the machine and then login to the created user Select the option shown
with the red colored arrow symbol On clicking the above option, download will
start and get saved in Downloads folder.
Move the above le into /home directory using the mv command and then
switch the directory to /home by typing the command cd.
Untar the jdk and extract the java le by using the tar command.
Enter the command ls to see the extracted jdk in the same folder /home.
Download the hadoop le and follow the same steps to untar it and move to
/home.
Update the .bashrc le with required environment variables including
hadoop path.
Type the command source .bashrc to make the environmental variables work.
Create two directories to store NameNode metadata and DataNode blocks
Note: Change the permissions of the directory.
Change the directory to the location where hadoop is installed .
Open hadoop-env.sh and add the java home(path) and hadoop home(path) in it.
Open Core-site.xml and add the required properties in between conguration tag
of core-site.xml .
Open the hdfs-site.xml and add the required lines in between conguration tags.
Open the Yarn-site.xml and add the frequired lines in between conguration
tags
Copy the mapred-site.xml template into mapred-site.xml and then add the
required properties .
Login to the root user and then install openssh server in the CentOS.
Generate ssh key for hadoop user .
37
Copy the public key from .ssh directory to authorized_keys folder.
Change the directory to .ssh and then type the below command to copy the les
into the authorized _keys folder.
Type the command ls to check whether authorized_keys folder has been created
or not.
To ensure whether the keys have been copied, type the cat command.
Change the permission of the .ssh directory Restart the ssh service .
CHAPTER 7
38
PROCESSING STEPS
1. Data Load
2. Data Validation
3. Data Analysis
DATA LOAD
The first step before the data load is to start the daemons like
Once the daemons are started the data needs to be loaded into
hive and hbase tables to make it available for the validation
parameters.
DATA VALIDATION
After the data is loaded, the data are refreshed and then during
the data validation phase the data are checked based on the
criteria given.
For this the validity of the data will be checked for based on
the criteria,
39
o Date must be of the format YYYY-MM-DD
o d must start with alphabets (all upper case) and end with digit
The valid data will be stored into the HDFS directory and
the invalid data are stored in the bad_files.
DATA ANALYSIS
The data are stored in the hive and lookup tables, then they are
validated based on the conditions given then the data are
analyzed and then use cases are passed and finally the output of
the use cases are displayed.
40
CHAPTER 8
CONCLUSION
framework based on the latest trend analysis .When developed as a complete project,
this framework will be able to ease the data validation of large data sets and time taken
for the transfer between the hive tables and the hdfs with the growing need to store
41
CHAPTER 9
SCREENSHOTS
HADOOP INSTALLATION
42
43
44
HDFS OVERVIEW
45
46
Stopping all the existing daemons
47
Starting all the new daemons
48
49
Creating the hive tables
50
Loading the data and displays the data in the
51
52
53
Lookup tables are getting created and loaded with data
54
55
56
Data refresh is being done
57
58
59
60
Config file and the validation file is run
61
62
63
The usecases are put in the analysis file and it is run
64
65
66
67
68
69
SAMPLE CODE:
Validation_project_master.sh
sh stop-daemons.sh
sh start-daemons.sh
jps
hive -f createhivetables.hql
hive -f loadstagetables.hql
70
echo "LET US NOW CREATE THE LOOK UP TABLES IN HIVE AND HBASE AT
THE SAME TIME..."
hive -f lookupload.hql
hive -f data_refresh.hql
#python config.py
#data_validation.py
71
echo "LETS US RUN THE USE CASES.."
hive -f analysis.hql
sh invalid_file.sh
72
73