Anda di halaman 1dari 36

1

A
PROJECT REPORT
ON

“AIRLINE ANALYSIS OF DATA USING HIVE”

Submitted in partial fullfillment


for the award of the degree of
BACHELOR OF TECHNOLOGY
(Rajasthan Technical University, Kota)
IN
INFORMATION TECHNOLOGY

SESSION (2018-2019)

Submitted To: Submitted By:

Ma’am Payal Awal Saumya Gupta (14EEMIT030)


Assistant Professor Shalli Gadia (14EEMIT031)
VII Sem IT

DEPARTMENT OF INFORMATION TECHNOLOGY


GOVERNMENT WOMEN ENGINEERING COLLEGE
NASIRABAD ROAD, MAKHUPURA, AJMER-305002
March, 2018
2

CANDIDATE DECLARATION

I hereby declare that the work which is being presented in the Seminar report entitled “AIRLINE ANALYSIS OF
DATA USING HIVE” in partial fulfilment for the award of the degree of Bachelor of Technology Information
Technolgy and submitted to the department of Computer Engineering, Government Women’s Engineering College,
Ajmer , is an authentic record of my work carried out during the session 2018-19.

Saumya Gupta (14EEMIT030)


Shalli Gadia (14EEMIT031)
Government Women’s Engineering College, AJMER
3

CERTIFICATE

This is to certify that the Seminar Report entitled “AIRLINE ANALYSIS OF DATA USING HIVE” being submitted
by Saumya Gupta & Shalli Gadia in partial fullfillment of the requirement of the degree of Bachelor of technology in
Information Technology at Rajasthan Technical University Kota is work carried out by her under my guidance for the
academic session 2018-2019.

This work has been found satisfactory by me and is approved for submission.

Supervisor Name

Ma’am Payal Awal


(Assistant proffesor)

Place:- Ajmer

Date : 22-10-2018
4

ACKNOWLEDGEMENT

I would like to express our appreciation and deep gratitude to My Supervisor Ma’am Payal Awal for his continuous
guidance, encouragement throughout the development of this Seminar. I grateful to the faculty members of
Information Technology department for giving his support.

I wish to give sincere thanks to Dr. Ranjan Maheshwari, principal, Govt. Women Engineering College, Ajmer for
providing us suitable environment to carry out My Seminar Report work.

I thank Mukesh Khandelwal, Head of Department of Information Technology Govt. Women Engineering College,
and Ajmer for providing me all the required facilities to carry out my seminar report

Yours Sincerely,

Saumya Gupta(15EEMIT030)
Shalli Gadia(15EEMIT031)
VII Sem IT
5

CHAPTER NO. TITLE NAME PAGE NO

CANDIDATE DECLERATION

CERTIFICATE

ACKNOWLEDGEMENT

ABSTRACT 1

1. INTRODUCTION

1.1 Bigdata 2

1.2 Bigdata History 2

1.3 Major problem in Bigdata 3

2. HADOOP

2.1 What is Hadoop? 4

2.1.1 Component of Hadoop 4

2.2 Apache Hadoop ecosystem 5

2.2.1 Architecture Comparison Hadoop 1 vs Hadoop 2 6

2.3 Hadoop design principle 6

3. Hadoop distributed file system

3.1 HDFS architecture 7

3.2 Assumption and Goals 7

4. Installation process

4.1 Install virtual box 9

4.2 Install cloudera 10


6

5. Commands

5.1 Hadoop hdfs command 12

6. Sqoop

6.1 Description 15

6.2 Sqoop-Import command 15

6.3 Sqoop-Export command 16

7. HIVE

7.1 What is Hive? 17

7.2 Facts of Hive 17

7.3 Hive architecture 17

7.4 How to process data with hive 18

7.5 Getting started with hive 18

7.5.1 Hive table 18

7.6 Performing Queries 19

8. Analysis of airport data

8.1 Airport dataset 20

8.2 Airline dataset 21

8.3 Route dataset 22

8.4 Analysis few aspect 22

8.5 Methodology 23

8.5.1 Commands for analysis airport data 23


7

CONCLUSION 26

REFERENCES 27
8

ABSTRACT

In the contemporary world, Data analysis is a challenge in the era of varied inters disciplines though there is a
specialization in the respective disciplines. In other words, effective data analytics helps in analyzing the data of any
business system. But it is the big data which helps and axial rates the process of analysis of data paving way for a
success of any business intelligence system. With the expansion of the industry, the data of the industry also expands.
Then, it is increasingly difficult to handle huge amount of data that gets generated no matter what’s the business is
like, range of fields from social media to finance, flight data, environment and health. Big Data can be used to assess
risk in the insurance industry and to track reactions to products in real time. Big Data is also used to monitor things as
diverse as wave movements, flight data, traffic data, financial transactions, health and crime. The challenge of Big
Data is how to use it to create something that is value to the user.

How can it be gathered, stored, processed and analyzed it to turn the raw data information to support decision making.
In this paper Big Data is depicted in a form of case study for Airline data. The proposed method is made by
considering following scenario under consideration An Airport has huge amount of data related to number of flights,
data and time of arrival and dispatch, flight routes, No. of airports operating in each country, list of active airlines in
each country. The problem they faced till now it’s, they have ability to analyze limited data from databases. The
Proposed model intension is to develop a model for the airline data to provide platform for new analytics based on the
following queries.
9

CHAPTER 1
INDRODUCTION
BIGDATA
Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a
business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the
data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.

When the amount of data is beyond the processing capability of the device that data is called big data for that
particular device.

Example:- 1gb data is a big data for any non-smart phone but it is not big data for pc.

Big Data History

While the term “big data” is relatively new, the act of gathering and storing large amounts of information for eventual
analysis is ages old. The concept gained momentum in the early 2000s when industry analyst Doug Laney articulated
the now-mainstream definition of big data as the three Vs:

Volume. Organizations collect data from a variety of sources, including business transactions, social media and
information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new
technologies (such as Hadoop) have eased the burden.

Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors
and smart metering are driving the need to deal with torrents of data in near-real time.

Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured
text documents, email, video, audio, stock ticker data and financial transactions.
10

At SAS, we consider two additional dimensions when it comes to big data:

Variability. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with
periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be
challenging to manage. Even more so with unstructured data.

Complexity. Today's data comes from multiple sources, which makes it difficult to link, match, cleanse and transform
data across systems. However, it’s necessary to connect and correlate relationships, hierarchies and multiple data
linkages or your data can quickly spiral out of control.

Major problems which we face in bigdata


-storage of the massive data
-processing of the data
11

CHAPTER 2
WHAT IS HADOOP ?
Hadoop is an open source distributed processing framework that manages data processing and storage for big data
applications running in clustered systems. It is at the center of a growing ecosystem of big data technologies that are
primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine
learning applications. Hadoop can handle various forms of structured and unstructured data, giving users more
flexibility for collecting, processing and analyzing data than relational databases and data warehouses provide.

Components Hadoop
Hadoop Distributed File System (HDFS) : HDFS is the distributed storage system that is designed to provide high-
performance access to data across multiple nodes in a cluster. HDFS is capable of storing huge amounts of data that is
100+ terabytes in size and streaming it at high bandwidth to big data analytics applications.
MapReduce: MapReduce is a programming model that enables distributed processing of large data sets on compute
clusters of commodity hardware. Hadoop MapReduce first performs mapping which involves splitting a large file into
pieces to make another set of data.
After mapping comes the reducing task, which takes the output from mapping and assemble the results into a
consumable solution. Hadoop can run MapReduce programs written in many languages, like Java, Ruby, Python, and
C++. Owing to parallel nature of MapReduce programs, Hadoop easily facilitates large-scale data analysis using
multiple machines in the cluster.
YARN: Yet Another Resource Negotiator or YARN is a large-scale, distributed operating system for big data
applications. YARN is considered to be the next generation of Hadoop’s compute platform. It brings on the table a
clustering platform that helps manage resources and schedule tasks. YARN was designed to set up both global and
application-specific resource management components. YARN improves utilization over more static MapReduce
rules, that were rendered in early versions of Hadoop, through dynamic allocation of cluster resources.

Apache Hadoop Frameworks

1. Hive
Hive is an open-source data warehousing framework that structures and queries data using a SQL-like language called
HiveQL. Hadoop allows developers to write complex MapReduce applications over structured data in a distributed
system. If a developer can’t express a logic using Hive, Hadoop allows to choose traditional map/reduce programmers
to plug in their custom mappers and reducers. Hive is a very good relational-database framework and can accelerate
queries using indexing feature.
2. HBase
HBase is an open-source, distributed, versioned, non-relational database model that provides random, realtime
read/write access to your big data. Hbase is a NoSQL Database for Hadoop. It’s a great framework for businesses that
have to deal with multi-structured or sparse data. HBase makes it possible to push the boundaries of Hadoop that runs
processes in batch and doesn’t allow for modification. With HBase, you can modify data in real-time without leaving
the HDFS environment.
HBase is a perfect fit for the type of data that fall into a big table. HBase first performs the task of storing and
searching billions of rows and millions of columns. It then shares the table across multiple nodes, paving the way for
MapReduce jobs to run locally.
3. Pig
Pig is an open-source technology that enables cost-effective storage and processing of large data sets, without
requiring any specific formats. Pig is a high-level platform and uses Pig Latin language for expressing data analysis
programs. Pig also features a compiler that creates sequences of MapReduce programs.
The framework processes very large data sets across hundreds to thousands of computing nodes, which makes it
amenable to substantial parallelization. In simple words, we can consider Pig as a high-level mechanism that is
12

suitable for executing MapReduce jobs on Hadoop clusters using parallel programming.

Apache Hadoop Ecosystem

The Hadoop platform consists of two key services: a reliable, distributed file system called Hadoop Distributed File
System (HDFS) and the high-performance parallel data processing engine called Hadoop MapReduce, described
in MapReduce below. Hadoop was created by Doug Cutting and named after his son’s toy elephant. Vendors that
provide Hadoop-based platforms include Cloudera, Hortonworks, MapR, Greenplum, IBM, and Amazon.
The combination of HDFS and MapReduce provides a software framework for processing vast amounts of data in
parallel on large clusters of commodity hardware (potentially scaling to thousands of nodes) in a reliable, fault-
tolerant manner. Hadoop is a generic processing framework designed to execute queries and other batch read
operations against massive datasets that can scale from tens of terabytes to petabytes in size.
The popularity of Hadoop has grown in the last few years, because it meets the needs of many organizations for
flexible data analysis capabilities with an unmatched price-performance curve. The flexible data analysis features
apply to data in a variety of formats, from unstructured data, such as raw text, to semi-structured data, such as logs, to
structured data with a fixed schema.
Hadoop has been particularly useful in environments where massive server farms are used to collect data from a
variety of sources. Hadoop is able to process parallel queries as big, background batch jobs on the same server farm.
This saves the user from having to acquire additional hardware for a traditional database system to process the data
(assume such a system can scale to the required size). Hadoop also reduces the effort and time required to load data
into another system; you can process it directly within Hadoop. This overhead becomes impractical in very large data
sets.
Many of the ideas behind the open source Hadoop project originated from the Internet search community, most
notably Google and Yahoo!. Search engines employ massive farms of inexpensive servers that crawl the Internet
retrieving Web pages into local clusters where they are analyzes with massive, parallel queries to build search indices
and other useful data structures.
The Hadoop ecosystem includes other tools to address particular needs. Hive is a SQL dialect and Pig is a dataflow
language for that hide the tedium of creating MapReduce jobs behind higher-level abstractions more appropriate for
user goals. Zookeeper is used for federating services and Oozie is a scheduling system. Avro, Thrift and Protobuf are
platform-portable data serialization and description formats.
13

HADOOP DESIGN PRINCIPLE


1.Facilitate the storage and processing of large and rapidly going data set.
> structured and unstructured data
> simple programming models
2. scale out rather than scale up
3. Bring code to data rather than data to code
4. High scalability and availability
5. use commodity hardware
14

CHAPTER 3
HADOOP DISTRIBUTED FILE SYSTEM – HDFS

HDFS Architecture

Assumptions and Goals

Hardware Failure – In HDFS hardware failure is very common. HDFS instance has hundred thousands of servers
which contain data. So, because of this large network there is a probability that failure will occur. Thus, HDFS error
and fault control with automatic recovery should be our main goal.

Streaming Data Access – Streaming data means is shifting/transferring of data at constant rate (high speed) in order
to carry out various functions. By data streaming, HDFS can provide High Definition TV services or constant back up
to storage medium. Therefore, data is read in continuously with constant data rate rather reading in form of
blocks/packets.

Latency- Latency is defined as time delay caused due to various operations during the process. In Hadoop, initial time
is spend in various activities for example – resource distribution, job submission and split creation. Thus, in Hadoop
latency is very high.

Large Data Sets – In Hadoop, applications which are running require considerable data sets.Memory requirement can
vary from gigabytes to terabytes.
Moving Computation Vs Moving Data – In HDFS, computation is moved to data. In Hadoop taking computation
toward data is more efficient. HFDS provides interface which transfer application to data where it is located.

Name Node and Data Node


Hadoop Distributed File system follows Master-Slave architecture. Cluster is made in Hadoop,and cluster consists of
single Name node which acts as master server which is user for managing file system namespace and it provides
regulation for accessing files by client.

Difference between Name Node and Data Node


15

Names node is used for executing file system namespace operations like closing, renaming files and directories
whereas data node is responsible for reading and writing data. Name node is responsible for mapping of blocks to data
node while data node is used for creation, replication and deletion.In HDFS file is divided into one or more blocks.

Hard Link
Hard link is a file that links a name with a file in distributed file system. There can be multiple hard links for a same
file, we can create multiple names for same file and create aliasing effect for example if contents of file 1 are altered
then these effects will be visible when the same file is opened with another name.

Soft Link, Symbolic Link


In HDFS, reference for another or directory is there in target file. Reference is in the form of relative path. If the link
is deleted, target will not get affected. Also if target is shifted or removed, even then it will point to old target and non-
existing target will be broken.

Replication Factor
Replication factor is defined as number of copies should be maintained for particular file. Replication factor is stored
in Name Node which maintains file system namespace.

Data Replication
Data replication is a main feature of HDFS. Data replication makes HDFS very reliable system that can store large
files. In this, files are broken into blocks which are stored. All the blocks have same size except the last block. In order
to provide reliability blocks are replicated. In HDFS block size and replication factor specified during creation, are not
fixed and they can be changed. Name node receives block report and heartbeat in periodic intervals, thus ensuring data
nodes are working properly. Block report contains list of all blocks in data node. Files can be written only once and
name node makes decisions for replication of blocks.

Replication Placement
Optimization replica replacement distinguishes Hadoop distributed file system from other DFS.

Secondary Name Node


Secondary name node is used for connecting with name node and builds snapshot of directory of primary name nodes
16

CHAPTER 4
INSTALLATION PROCESS

Install VirtualBox 5.0.4

 Steps to install VirtualBox in Windows Host


17

Install Cloudera as virtual applicance

Steps to deploy cloudera CDH 5.4.x with VirtualBox


18
19

CHAPTER 5
COMMANDS

Hadoop HDFS Commands


Hadoop HDFS version Command Usage

version

Hadoop HDFS version Command Example

hdfs dfs version

Hadoop HDFS version Command Description


This Hadoop command prints the Hadoop version

mkdir

Hadoop HDFS mkdir Command Usage

mkdir <path>

Hadoop HDFS mkdir Command Example

hdfs dfs -mkdir /user/dataflair/dir1

Hadoop HDFS mkdir Command Description

This HDFS command takes path URI’s as an argument and creates directories.
Creates any parent directories in path that are missing (e.g., mkdir -p in Linux).

ls

Hadoop HDFS ls Command Usage

ls <path>

Hadoop HDFS ls Command Example

hdfs dfs -ls /user/dataflair/dir1

Hadoop HDFS ls Commnad Description


This Hadoop HDFS ls command displays a list of the contents of a directory specified by path provided by the user,
showing the names, permissions, owner, size and modification date for each entry.

Hadoop HDFS ls Command Example

hdfs dfs -ls -R

Hadoop HDFS ls Description


This Hadoop fs command behaves like -ls, but recursively displays entries in all subdirectories of a path.

put

Hadoop HDFS put Command Usage


20

put <localSrc> <dest>

Hadoop HDFS put Command Example

hdfs dfs -put /home/dataflair/Desktop/sample /user/dataflair/dir1

Hadoop HDFS put Command Description


This hadoop basic command copies the file or directory from the local file system to the destination within the DFS.

Learn Internals of HDFS Data Write Pipeline and File write execution flow.

copy From Local

Hadoop HDFS copyFromLocal Command Usage

copyFromLocal <localSrc> <dest>

Hadoop HDFS copyFromLocal Command Example

hdfs dfs -copyFromLocal /home/dataflair/Desktop/sample /user/dataflair/dir1

Hadoop HDFS copyFromLocal Command Description

get

Hadoop HDFS get Command Usage

get [-crc] <src> <localDest>

Hadoop HDFS get Command Example

hdfs dfs -get /user/dataflair/dir2/sample /home/dataflair/Desktop

Hadoop HDFS get Command Description


This HDFS fs command copies the file or directory in HDFS identified by the source to the local file system path
identified by local destination.

Hadoop HDFS get Command Example

hdfs dfs -getmerge /user/dataflair/dir2/sample /home/dataflair/Desktop

Hadoop HDFS get Command Description


This HDFS basic command retrieves all files that match to the source path entered by the user in HDFS, and creates a
copy of them to one single, merged file in the local file system identified by local destination.

Hadoop HDFS get Command Example

hadoop fs -getfacl /user/dataflair/dir1/sample

hadoop fs -getfacl -R /user/dataflair/dir1

Hadoop HDFS get Command Description


This Apache Hadoop command shows the Access Control Lists (ACLs) of files and directories. If a directory contains
a default ACL, then getfacl also displays the default ACL.
Options :
-R: It displays a list of all the ACLs of all files and directories recursively.
path: File or directory to list.
21

Hadoop HDFS get Command Example

hadoop fs -getfattr -d /user/dataflair/dir1/sample

copy To Local

Hadoop HDFS copyToLocal Command Usage

copyToLocal <src> <localDest>

Hadoop HDFS copyToLocal Command Example

hdfs dfs -copyToLocal /user/dataflair/dir1/sample /home/dataflair/Desktop

Hadoop HDFS copyToLocal Description


Similar to get command, only the difference is that in this the destination is restricted to a local file reference.

cat

Hadoop HDFS cat Command Usage

cat <file-name>

Hadoop HDFS cat Command Example

hdfs dfs -cat /user/dataflair/dir1/sample

Hadoop HDFS cat Command Description


This Hadoop fs shell command displays the contents of the filename on console or stdout.

mv

Hadoop HDFS mv Command Usage

mv <src> <dest>

Hadoop HDFS mv Command Example

hadoop fs -mv /user/dataflair/dir1/purchases.txt /user/dataflair/dir2

Hadoop HDFS mv Command Description


This basic HDFS command moves the file or directory indicated by the source to destination, within HDFS.

cp

Hadoop HDFS cp Command Usage

cp <src> <dest>

Hadoop HDFS cp Command Exampl

hadoop fs -cp /user/dataflair/dir2/purchases.txt /user/dataflair/dir1

Hadoop HDFS cp Command Description


This Hadoop File system shell command copies the file or directory identified by the source to destination, within
HDFS.
22

CHAPTER 6
SQOOP
Sqoop is a command-line interface application for transferring data between relational databases and Hadoop .

Description
Sqoop supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run
multiple times to import updates made to a database since the last import. Imports can also be used to populate tables
in Hive or HBase.Exports can be used to put data from Hadoop into a relational database. Sqoop got the name from
sql+hadoop. Sqoop became a top-level Apache project in March 2012.

Sqoop – IMPORT Command

sqoop import --connect jdbc:mysql://localhost/employees --username edureka --table employees


23

Sqoop – Export
24

CHAPTER 7
HIVE

What Is Hive ?

• Apache Hive is a data warehouse (initially developed by Facebook) software project built on top of Apache Hadoop
for providing data summarization, query and analysis.
• Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with
Hadoop. It provides a SQL interface to query data stored in Hadoop distributed file system (HDFS) or Amazon S3 (an
AWS implementation) through an HDFS-like abstraction layer called EMRFS (Elastic MapReduce File System).

Apache Hive: Fast Facts

Hive Architecture
25

How to process data with Apache Hive?


• Now execution engine submits these stages to appropriate components. After in each task the deserializer associated
with the table or intermediate outputs is used to read the rows from HDFS files. Then pass them through the
associated operator tree. Once it generates the output, write it to a temporary HDFS file through the serializer. Now
temporary file provides the subsequent map/reduce stages of the plan. Then move the final temporary file to the
table’s location for DML operations.
• Now for queries, execution engine directly read the contents of the temporary file from HDFS as part of the fetch
call from the Driver.

Getting Started With Hive

Hive Table

A Hive Table:
-Data: file or group of files in HDFS
-Schema: in the form of metadata stored in a relational database

Schema and Data are separate


-Schema can be defined for existing data
-data can added or removed independently
-Hive can be pointed at existing data
You have to define a schema if you have existing data in HDFS that you want to
use in Hive.
26

Defining a Table

Managing Table

Loading Data
• Use LOAD DATA to import data into Hive Table

• Not modified by Hive – Loaded as is


• Use the word OVERWRITE to write over a file of the same name
• The schema is checked when the data is queried
• If a node does not match, it will be read as a null

Performing Queries (HiveQL)


• SELECT
27

• Supports the following:


• WHERE clause
• UNION ALL and DISTINCT
• GROUP BY and HAVING
• LIMIT clause
• Can use REGEX column Specification

Subqueries
• Hive support subqueries only in the FROM clause

• The columns in the subquery SELECT list are available in outer query
28

CHAPTER 8
ANALYSIS OF AIRPORT DATA
The proposed method is made by considering following scenario under consideration
An Airport has huge amount of data related to number of flights, data and time of arrival and dispatch, flight routes,
No. of airports operating in each country, list of active airlines in each country. The problem they faced till now it’s,
they have ability to analyze limited data from databases. The Proposed model intension is to develop a model for the
airline data to provide platform for new analytics based on the following queries.

Airport Data Set

Attribute Description
Airport ID Unique OpenFlights
identifier for this
airport
Name Name of airport. May
or may not contain
the City name.
City Main city served by
airport. May be
spelled differently
from Name.
Country Country or territory
where airport is
located.
IATA/FAA 3-letter FAA code,
for airports located in
Country "United
States of America"
ICAO 4-letter ICAO code.
Latitude Decimal degrees,
usually to six
significant digits.
Negative is South,
positive is North.
Longitude Decimal degrees,
usually to six
significant digits.
Negative is West,
positive is East.
Altitude In feet.
Timezone Hours offset from
UTC. Fractional
hours are expressed
as decimals, eg. India
is 5.5.
DST Daylight savings
time. One of E
(Europe), A
(US/Canada), S
(South America), O
(Australia), Z (New
Zealand), N (None)
or U (Unknown). See
also: Help: Time
Tz database time Timezone in "tz"
(Olson) format, eg.
"America/Los_Angel
es". zone
29

Airline Data Set

Attribute Description
Airline Unique OpenFlights
identifier for this
airline. ID
Name Name of the airline
Alias Alias of the airline.
For example, All
Nippon Airways is
commonly known as
"ANA".
IATA 2-letter IATA code, if
available.
ICAO 3-letter ICAO code, if
available
Callsign Airline callsign.
Country Country or territory
where airline is
incorporated
Active "Y" if the airline is or
has until recently
been operational, "N"
if it is defunct. This
field is not reliable: in
particular, major
airlines that stopped
flying long ago, but
have not had their
IATA code
reassigned (eg.
Ansett/AN), will
incorrectly show as
"Y".
30

Route Data Set


Attribute Description
Airline 2-letter (IATA) or 3-
letter(ICAO) code of
the airline.
Airline ID Unique OpenFlights
identifier for airline
Source airport 3-letter (IATA) or 4-
letter (ICAO) code of
the source airport
Source airport ID Unique OpenFlights
identifier for source
airport
Destination airport 3-letter (IATA) or 4-
letter (ICAO) code of
the destination
airport.
Destination airport ID Unique OpenFlights
identifier for
destination airport.
Codeshare "Y" if this flight is a
codeshare (that is, not
operated by Airline,
but another carrier),
empty otherwise.
Stops Number of stops on
this flight ("0" for
direct)
Equipment 3-letter codes for
plane type(s)
generally used on this
flight, separated by
spaces

This paper proposes a method to analyze few aspects which are related to airline data such as

a) list of airports operating in the country India,


b) list of airlines having zero stops
c) list of airlines operating with code share
d) list highest airports in each country
e) list of active airlines in United State
31

METHODOLOGY
Methodology used is as follows:

1. Create tables with required attributes


2. Extract semi structured data into table using the load a command
3. Analyze data for the following queries
a) list of airports operating in the country India
b) list of airlines having zero stops
c) list of airlines operating with code share
d) which country has highest airports
e) list of active airlines in United State

Commands for analysis airport data

Fig Create and Load data set into HDFS

create table airports (id int, name string, city string, country string, faa string, icao string, latitude string,
longitude string, timezone string, DST string, tz string) row format delimited fields terminated by ',';

create table final_airlines (airline string, name string, alias string, IATA string, ICAO string, callsign
string, country string, active string) row format delimited fields terminated by ',';

create table routes (airline string, id string, source string, source_id string, destination string,
destination_id string, code_share string, stops string, equipments string) row format delimited fields
terminated by ',';

LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/airlines/Final_airlines.enc' INTO TABLE


final_airlines;
32

LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/airlines/airports_mod.enc' INTO TABLE


airports;
LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/airlines/routes.enc' INTO TABLE routes;

Fig. List of airlines operating with code share

create table active_airlines as select active,count(*) as status from final_airlines group by active;

select * from active;


33

Fig : List of airlines having zero stops

create table zero_stops as select * from routes where stops LIKE '%0%';

select * from zero_stops;

List of airports operating in country India


34

create table india_airports as select * from airports where country LIKE '%India%';

select * from india_airports;

ACTIVE IN US
create table active_airlines_US as select * from final_airports where (country LIKE '%United States%')
and (active LIKE '%Y%');
select * from active_airlines_US;
35

CONCLUSION
This project addresses the related work of distributed data bases that were found in literature, challenges
ahead with big data, and a case study on airline data analysis using Hive. Author attempted to explore
detailed analysis on airline data sets such as listing airports operating in the India, list of airlines having
zero stops, list of airlines operating with code share which country has highest airports and list of active
airlines in united state. Here author focused on the processing the big data sets using hive component of
hadoop ecosystem in distributed environment. This work will benefit the developers and business analysts
in accessing and processing their user queries.
36

REFERENCES
[1] Challenges and opportunities with Big
Datahttp://cra.org/ccc/wpcontent/uploads/sites/2/2015/05/bigdatawhitepaper.pdf
[2] Oracle: Big Data for Enterprise, June 201http://www.oracle.com/us/products/database/big-data-for-
enterprise-519135.pdf
[3] Data set is taken from edureka http://www.edureka.co/my-course/big-data-and-hadoop
[4] HiveQL Language Manual
[5] Apache Tez

Anda mungkin juga menyukai