Anda di halaman 1dari 47

Final Year Industrial Training - 7CSTR Report

Data Analysis of Goibibo and Make My Trip Dataset

Submitted in
partial fulfillment of the degree of Bachelor of Technology
Rajasthan Technical University

By

Bhavika Sabnani
PCE16CE027

Department of Computer Engineering


Poornima College of Engineering, Jaipur
(Academic Year 2019-20)
Data Analysis of Goibibo and Make My Trip Dataset

CERTIFICATE BY COMPANY/ INDUSTRY / INSTITUTE

[B. Tech VII Semester] Department of Computer Engineering i


Data Analysis of Goibibo and Make My Trip Dataset

CANDIDATE'S DECLARATION

I BHAVIKA SABNANI hereby declare that I have undertaken 60 Days industrial training at

NetParam Technologies Pvt. Ltd. (an IT Division of C-DAC ATC NETCOM) during a period

from 04 May, 2019 to 04 July, 2019 in partial fulfillment of requirements for the award of degree

of B.Tech (Computer Engineering) at POORNIMA COLLEGE OF ENGINEERING, JAIPUR.

The work which is being presented in the training report submitted to Department of Computer

Engineering at POORNIMA COLLEGE OF ENGINEERING, JAIPUR is an authentic record of

training work.

It has not been submitted anywhere else for the award of any degree, diploma and fellowship of
any University or Institution.

Signature of the Student

The Number of Weeks / Days industrial training Viva–Voce Examination of________Student


Name__________ has been held on ____________ and accepted.

Signature of Internal Examiner Signature of External Examiner

[B. Tech VII Semester] Department of Computer Engineering ii


Data Analysis of Goibibo and Make My Trip Dataset

ACKNOWLEDGEMENT

A project of such a vast coverage cannot be realized without help from numerous sources and
people in the organization. I am thankful to Mr. Shashikant Singhi, Chairman, PGC and
Prof(Dr.) Mahesh M. Bundele, Director, PCE for providing me a platform to carry out such
a training successfully.

I am also very grateful to Dr. Surendra Yadav (HOD,CE) for his kind support.

I would like to take this opportunity to show my gratitude towards Mrs Shalini Puri &
Dr. Sunil Pathak (Practical Training Seminar- 7CSTR) who helped me in successful
completion of my Final Year Practical Training. They have guided, motivated & were source
of inspiration for me to carry out the necessary proceedings for the training to be completed
successfully.

I am also grateful to the Mr Sharthak Achargee for his/her guidance and


support.

I am also privileged to have Mr. Manish Kumar Sharma who has/have flourished me with
his/her/their valuable facilities without which this work cannot be completed.

I would also like to express my hearts felt appreciation to all of my friends whose direct or
indirect suggestions help me to develop this project [and to entire team members for their
valuable suggestions.

Lastly, thanks to all faculty members of Computer Engineering department for their moral
support and guidance.

[B. Tech VII Semester] Department of Computer Engineering iii


Data Analysis of Goibibo and Make My Trip Dataset

ABSTRACT
Data generated by different sources is increasing day by day which has given birth to a new term called “Big

Data”. Proper management and analysis of this data is required, so that useful results can be produced from

this data. By analyzing data useful insights can be generated from it. Many Organizations that are working on

Big Data get benefit by the analysis of data. Hadoop is a open source framework which provides different tools

to store and analyze Big Data. Analysis of the hotel booking dataset of two popular websites Goibibo and Make

My Trip is done by using hadoop .

[B. Tech VII Semester] Department of Computer Engineering iv


Data Analysis of Goibibo and Make My Trip Dataset

TABLE OF CONTENTS
CHAPTER 1 : INTRODUCTION............................................................................................... 1

1.1 What is Data?.......................................................................................................................1

1.2 What is Big Data?................................................................................................................1

1.3 Database and Database Management System……………………………………….......2

1.3.1 Types of databases…………………………………………………………………...2

1.4 Hadoop : As a Solution of Big Data………………………………………………………2

CHAPTER 2 : TRAINING WORK UNDERTAKEN………………………………………... 4

2.1 What is Big Data?................................................................................................................4

2.1.1 Big Data Characteristics……………………………………………………………..4

2.2 What is Hadoop?..................................................................................................................5

2.2.1 History of Hadoop……………………………………………………………………5

2.2.1.1 Difference between Hadoop1 and Hadoop2………………………………………7

2.2.1.2 Difference between Hadoop2 and Hadoop3………………………………………7

2.2.2 Features of Hadoop….……………………………………………………………….8

2.3 Hadoop Ecosystem………………………………………………………………………...8

2.3.1 Hadoop Distributed File System(HDFS)…………………………………………….9

2.3.1.1 HDFS Nodes………………………………………………………………………9

[B. Tech VII Semester] Department of Computer Engineering v


Data Analysis of Goibibo and Make My Trip Dataset

2.3.1.2 HDFS Architecture………………………………………………………………10

2.3.2 MapReduce…………………………………………………………………………11

2.3.2.1 Working of MapReduce …………………………………………………………11

2.3.3 Yet Another Resource Negotiator (YARN)………………………………………...12

2.3.3.1 YARN Architecture……………………………………………………………...12

2.4 Hive………………………………………………………..................................................13

2.4.1 Hive Architecture…………………………………………………………………...13

2.4.2 Partitioning in Hive…………………………………………………………………14

2.4.3 Bucketing in Hive………………………………………………………………….15

2.5 Apache Pig………………………………………………………………………………….15

2.5.1 Pig Latin……………………………………………...……………………………….15

2.5.2 Basic Types of Data Models in Pig…………………………………………………...16

2.5.3 Pig Architecture………...…………………………………………………………….16

2.6 Power BI…………………………………………………………………………………….17

2.6.1 Power BI Products……………………………………………………………………17

2.6.2 DAX(Data Analysis Expression) in Power BI……………………………………......17

2.6.3 Tableau vs Power BI………………………………………………………………….18

CHAPTER 3 : RESULTS AND DISCUSSIONS ..................................................................... 19

[B. Tech VII Semester] Department of Computer Engineering vi


Data Analysis of Goibibo and Make My Trip Dataset

3.1 Introduction........................................................................................................................19

3.2 Agenda Behind The Project..............................................................................................19

3.3 Dataset Description………………………………………................................................19

3.4 Tables in Excel……………………………………………………………………………...21

3.5 Flow Chart of Project………………………………………………………………...……22

3.6 Queries Applied………………………………………………………………………….…23

3.6.1 For Goibibo Dataset…………………………………………………………………..23

3.6.1.1 Queries done using Hive…………………………………………………...………23

3.6.1.2 Queries done using Pig…………………………………………………………….28

3.6.2 For Make My Trip Dataset……………………………………………………………30

3.6.2.1 Queries done using Hive…………………………………………………………30

3.6.2.2 Queries done using Pig…………………………………………...………………..31

CHAPTER 3 : FUTURE SCOPE AND CONCLUSION ........................................................ 32

[B. Tech VII Semester] Department of Computer Engineering vii


Data Analysis of Goibibo and Make My Trip Dataset

LIST OF FIGURES

Figure Number Caption Page No

Figure 2.1 Big Data Characteristics 5

Figure 2.2 Hadoop V.1.x Components 6

Figure 2.3 Hadoop V.2.x Components 6

Figure 2.4 Hadoop Ecosystem 9

Figure 2.5 Architecture of HDFS 10

Figure 2.6 Working of MapReduce 12

Figure 2.7 Architecture of YARN 13

Figure 2.8 Apache Pig Architecture 16

Dataset of Goibibo in Excel


Figure 3.1 21

Dataset of Make My trip in Excel


Figure 3.2 21

Figure 3.3 Flow Chart of Project 22

Figure 3.4 Count of hotels of different ratings in different states 23

[B. Tech VII Semester] Department of Computer Engineering viii


Data Analysis of Goibibo and Make My Trip Dataset

Figure 3.5 Count of hotels of different star ratings of 10 citites 24

Count of different Property Types(Goibibo.com)


Figure 3.6 25

Graph showing count of different Property Types


Figure 3.7 25

Average of Food and Drinks rating of hotels of 5 states


Figure 3.8 26

Average of Food and Drinks rating for hotels of different ratings 27


Figure 3.9

Count of property types for Andhra Pradesh


Figure 3.10 28

Count of different property types(Makemytrip.com)


Figure 3.11 30

[B. Tech VII Semester] Department of Computer Engineering ix


Data Analysis of Goibibo and Make My Trip Dataset

LIST OF TABLES

Table No. . Title of Table Page No.

Table 1. Difference between Hadoop1 and Hadoop2 7

Table 2. Difference between Hadoop2 and Hadoop3 7

Table 3. 18
Difference between Tableau and Power BI

Description of editgoibibo.csv
Table 4. 20

Table 5. Description of editmake.csv 20

[B. Tech VII Semester] Department of Computer Engineering x


Data Analysis of Goibibo and Make My Trip Dataset

LIST OF ABBREVIATIONS

Abbreviation Acronym

DBMS Database Management System

RDBMS Relational Database Management System

SQL Structured Query Language

HDFS Hadoop Distributed File System

YARN Yet Another Resource Negotiator

HQL Hive Query Language

BI Business Intelligence

MR MapReduce

[B. Tech VII Semester] Department of Computer Engineering xi


Data Analysis of Goibibo and Make My Trip Dataset

CHAPTER I
INTRODUCTION
1.1 What is Data?

Data is raw facts or figures related to an object or event. All the facts (meaningful or meaningless)

collected for an object termed as data. The processed form of data is known as information. Data

can be categorized in three types –

i. Structured

ii. Unstructured

iii. Semi-structured

1.2 What is Big data?

The term “Big data” refers to huge amount of data that may be structured, semi-structured or

unstructured. Enterprises are generating a very huge volume of data at a high speed. Reason behind

this large volume of data are rapid innovations and emerging companies like Google, Yahoo,

Amazon, eBay and a rapid growth of internet population.

This data includes videos, images, sensor data, weblogs etc. generated from different sources like

mobile phones, military surveillances, scientific researches, ecommerce websites and so on. This

has given birth to a new type of data called Big Data which is unstructured sometime semi

structured and also unpredictable in nature. Social media websites generate data in real time which

is increasing exponentially day by day. As the data is growing constantly, it is a tedious task to

[B. Tech VII Semester] Department of Computer Engineering 1


Data Analysis of Goibibo and Make My Trip Dataset

manage and process such a huge volume of data. Organizations store a large amount of data and

often fail to create useful insights from that data, because analyzing a large data set is not easy.

1.3 Database and Database Management System

A database can be defined collection of data. Databases stores data in a systematic way and

supports access and manipulation of this data.

Manipulation of data in a database is done with the help of a software program known as Database

Management System (DBMS). A Database Management System is a collection of programs that is

designed for the manipulation and management of data stored in a database. This data

manipulation includes insertion of new data, updation or deletion of existing data. Some of the

examples of Database Management System are MySQL, Oracle, PostgreSQL, MariaDB, IBM

DB2 etc.

1.3.1 Types of Databases –

2. Hierarchical Database

3. Network Database

4. Relational Database

5. Object-Oriented Database

6. Graph Database

1.4 Hadoop : A Solution of Big data

RDBMS cannot solve the big data problem as it can only store structured data while the data

generated has no fixed format it can be structured, unstructured or semi-structured. Despite of this

RDBMS has a slow processing speed which means it takes a lots of time to process large data sets.

[B. Tech VII Semester] Department of Computer Engineering 2


Data Analysis of Goibibo and Make My Trip Dataset

Due to these disadvantages of RDBMS, Hadoop came into existence which provides solution to

the big data problem.

Hadoop is the solution to Big Data problems. It is an open-source framework, written in JAVA

developed by Apache Software Foundation. Hadoop was created by Doug Cutting. It is based on

distributed computing which stores store large data on a cluster of commodity hardware in a

distributed manner.

[B. Tech VII Semester] Department of Computer Engineering 3


Data Analysis of Goibibo and Make My Trip Dataset
CHAPTER 2

TRAINING WORK UNDERTAKEN

2.1 What is Big Data?

Big Data is a term which is used for huge amount of data. This data constitute both structured and

unstructured data that is growing at a very fast rate day by day. Organizations are facing challenges

to manage and analysis this data to produce some meaningful results, because traditional database

systems are unable to manage large data sets.

2.1.1 Big Data Characteristics (5 V’s of Big Data)

Characteristics of Big Data can be defined by these 5 V’s of Big Data –

• Volume - Volume is the amount of data generated

• Velocity – Velocity describes the speed at which the data is generated.

• Variety – Variety defines the type of data as data can be categorized in structured, semi-

structured or unstructured.

• Veracity - Veracity is the quality or trustworthiness of the data. It shows accuracy of the

data.

• Value – Value means how much meaningful or useful data we can extract.

[B. Tech VII Semester] Department of Computer Engineering 4


Data Analysis of Goibibo and Make My Trip Dataset

Fig 2.1 Big Data Characteristics

2.2 What is Hadoop?

Hadoop is the solution to Big Data problems. It is an open-source framework, written in JAVA

developed by Apache Software Foundation. Hadoop was created by Doug Cutting. It is based on

distributed computing which stores store large data on a cluster of commodity hardware in a

distributed manner.

2.2.1 History of Hadoop

• Hadoop V.1.x Components

In Hadoop 1 only two components are there – HDFS and MapReduce. HDFS is used for

storage purpose. Hadoop1 uses MapReduce for both Data Processing as well as for resource

management.

[B. Tech VII Semester] Department of Computer Engineering 5


Data Analysis of Goibibo and Make My Trip Dataset

Fig 2.2 Hadoop V.1.x Components

• Hadoop V.2.x Components

In Hadoop 2, HDFS is used again for Storage but the function of MapReduce is only Data

Processing and a new component YARN was introduced, which is used for resource

management.

Fig 2.3 Hadoop V.2.x Components

[B. Tech VII Semester] Department of Computer Engineering 6


Data Analysis of Goibibo and Make My Trip Dataset
2.2.1.1 Difference between Hadoop 1 and Hadoop 2

Table 2.1 Difference between Hadoop1 and Hadoop2

S. No Hadoop1 Hadoop2

1 Hadoop1 supports MapReduce (MR) Hadoop2 can work with MR as well as other
processing model only. distributed computing models like Spark,
Hama, Giraph & HBase coprocessors.
2 MR does both processing and cluster- YARN is used for cluster resource management and
resource management. different processing models are used for processing.

3 It supports only 4000 nodes per cluster Scalable up to 10000 nodes per cluster

4 Hadoop1 has Single Point of Failure, Supports a Stand by NameNode ,in the case of
because it has a single Namenode. Namenode failure.

5 Does not support Microsoft Windows Hadoop2 supports Microsoft windows

2.2.1.2 Difference between Hadoop2 and Hadoop3

Table 2.2 Difference between Hadoop1 and Hadoop3

S.No. Hadoop2 Hadoop3

Java version 6 was the minimum Java version 8 is the minimum requirement.
1
requirement.

HDFS supports replication for fault HDFS support for erasure encoding.
2
tolerance.

YARN timeline service Introduced YARN timeline service v.2(improved scalability


3
and reliability)

Secondary namenode was introduced as Supports more than 2 namenode


4
standby.

[B. Tech VII Semester] Department of Computer Engineering 7


Data Analysis of Goibibo and Make My Trip Dataset
2.2.2 Features of Hadoop

• Open Source

• Reliability

• Fault Tolerance

• High Availability

• Scalability

• Economic

• Distributed Processing

2.3 Hadoop Ecosystem

Three main parts of Hadoop Ecosystem are –

• Hadoop Distributed File System (HDFS)

• MapReduce

• YARN(Yet Another Resource Negotiator)

With these three components some other components of Hadoop framework are –

• Hive

• Pig

• Sqoop

• HBase

[B. Tech VII Semester] Department of Computer Engineering 8


Data Analysis of Goibibo and Make My Trip Dataset
• Flume

Fig 2.4 Hadoop Ecosystem

2.3.1 Hadoop Distributed File System (HDFS)

It is the most important component of Hadoop Ecosystem. HDFS is the primary storage system of

Hadoop. Hadoop distributed file system (HDFS) is a java based file system that provides scalable,

fault tolerance, reliable and cost efficient data storage for Big data. HDFS is a distributed file

system that runs on commodity hardware.

2.3.1.1 HDFS Nodes

HDFS consists of two nodes –

[B. Tech VII Semester] Department of Computer Engineering 9


Data Analysis of Goibibo and Make My Trip Dataset
• NameNode - It is also known as Master node. Actual file data is not stored by NameNode.

It is responsible of storing the metadata i.e. number of blocks created, location of the block,

details of the Datanode on which the data is stored and other information of files.

• DataNode - It is also known as Slave Node. HDFS Datanode stores the actual file data in

the form of blocks in HDFS. Datanode performs data read and write operations.

Datanodes also perform different block creation , deletion and replication tasks on the

instruction of namenode.

2.3.1.2 HDFS Architecture

Fig 2.5 Architecture of HDFS

[B. Tech VII Semester] Department of Computer Engineering 10


Data Analysis of Goibibo and Make My Trip Dataset
2.3.2 MapReduce

Hadoop MapReduce is the processing component of Hadoop ecosystem, which provides parallel

processing of data. All the data which is stored in the Hadoop Distributed File system is processed

by MapReduce programs. Due to this parallel processing of data large datasets can be processed at

a faster rate.

2.3.2.1 Working of MapReduce

MapReduce component of Hadoop works in two phases, which are -

• Map phase

• Reduce phase

Each phase takes input in the form of key-value pairs and also produces output in the form of key-

value form. There are two functions by which the processing is accomplished – map

function and reduce function. Map function takes a set of data as input and produces a set of key-

value pairs. Output generated by Map function is called as intermediate output. Output from the

Map function is supplied as an input to the Reduce function and the final result is generated by the

Reduce function.

[B. Tech VII Semester] Department of Computer Engineering 11


Data Analysis of Goibibo and Make My Trip Dataset

Fig 2.6 Working of MapReduce

2.3.3 Yet Another Resource Negotiator (YARN)

Apache Yarn (Yet Another Resource Negotiator) is the resource management layer of Hadoop.

The Yarn was introduced in Hadoop 2.x. Yarn allows different data processing engines like graph

processing, interactive processing, stream processing as well as batch processing to run and

process data stored in HDFS (Hadoop Distributed File System). Apart from resource management,

Yarn also does job Scheduling.

2.3.3.1 YARN Architecture

YARN consists of three components –

• Resource Manager

• Node Manager

[B. Tech VII Semester] Department of Computer Engineering 12


Data Analysis of Goibibo and Make My Trip Dataset
• Application Master

Fig 2.7 Architecture of YARN

2.4 Apache Hive

Apache Hive is an open source data warehouse, which built on top of Hadoop. It is used for

querying and analyzing large amount of data (structured or semi-structured) stored in HDFS.

Writing MapReduce jobs for a task is a complex task but with hive data can be analyzed using

SQL queries. Hive uses a SQL similar language which is known as Hive Query Language (HQL).

SQL like queries written by the programmer are automatically translated into the MapReduce jobs.

2.4.1 Hive Architecture

Apache hive consists of the following componenets-

• Metastore – It stores metadata for each of the tables like their schema and location.
[B. Tech VII Semester] Department of Computer Engineering 13
Data Analysis of Goibibo and Make My Trip Dataset
• Driver – It acts like a controller which receives the HiveQL statements. The driver starts

the execution of the statement by creating sessions. It monitors the life cycle and progress

of the execution. Driver stores the necessary metadata generated during the execution of a

HiveQL statement.

• Compiler - It performs the compilation of the HiveQL query. This converts the query to an

execution plan. The compiler in Hive converts the query to an Abstract Syntax Tree

(AST). First, check for compatibility and compile-time errors, then converts the AST to

a Directed Acyclic Graph (DAG).

• Optimizer – It performs various transformations on the execution plan to provide

optimized DAG.

• Executor – Once compilation and optimization complete, the executor executes the tasks.

Executor takes care of pipelining the tasks.

• CLI, UI, and Thrift Server – CLI (command-line interface) provides a user interface for

an external user to interact with Hive. Thrift server in Hive allows external clients to

interact with Hive over a network, similar to the JDBC or ODBC protocols.

2.4.2 Hive Partitioning

Partitioning is a way of grouping the related parts of a table based on the values of some particular

columns like city, date etc. Using partitioning a large table is divided into small partitions such that

querying on this data becomes easy.

Hive uses two types of partitioning –

• Static Partitioning

[B. Tech VII Semester] Department of Computer Engineering 14


Data Analysis of Goibibo and Make My Trip Dataset
• Dynamic Partitioning

2.4.3 Bucketing in Hive

Buckets in hive is used in segregating of hive table-data into multiple files or directories. it is used

for efficient querying.

• The data i.e. present in that partitions can be divided further into Buckets

• The division is performed based on Hash of particular columns that we selected in the

table.

• Buckets use some form of Hashing algorithm at back end to read each record and place it

into buckets.

• In Hive, we have to enable buckets by using the set.hive.enforce.bucketing=true;

2.5 Apache Pig

Hadoop Pig is nothing but an abstraction over MapReduce. While it comes to analyze large sets of

data, as well as to represent them as data flows, we use Apache Pig. Generally, we use it

with Hadoop. By using Pig, we can perform all the data manipulation operations in Hadoop.

2.5.1 Pig Latin

Pig offers a high-level language to write data analysis programs which we call as Pig Latin. Pig

Latin is a procedural programming language and fits very naturally in the pipeline paradigm. By

using Pig complex queries having most of joins and filters, can be performed easily.

[B. Tech VII Semester] Department of Computer Engineering 15


Data Analysis of Goibibo and Make My Trip Dataset
2.5.2 Basic Types of Data Models in Pig:

Pig comprises of 4 basic types of data models. They are as follows:

• Atom – It is a simple atomic data value. It is stored as a string but can be used as either a

string or a number

• Tuple – An ordered set of fields

• Bag – An collection of tuples.

• Map – set of key value pairs.

2.5.3 Apache Pig Architecture

Architecture of Apache Pig is given by the below diagram -

Fig 2.8 Apache Pig Architecture

[B. Tech VII Semester] Department of Computer Engineering 16


Data Analysis of Goibibo and Make My Trip Dataset
2.6 Power BI

Power BI is a Data Visualization and Business Intelligence (BI) tool provided by microsoft that

allows the analysis and visualization of data by providing different visuals. Dataset from different

data sources can be imported to Power BI, which can be converted to interactive dashboards and

BI reports.

2.6.1 Power BI Products

Power BI provide a set of products ,each of them has its own features. Different Power BI

products are –

• Power BI Desktop - The Windows-desktop-based application for PCs and desktops,

primarily for designing and publishing reports to the Service.

• Power BI Service - The SaaS (software as a service) based online service (formerly known

as Power BI for Office 365, now referred to as PowerBI.com or simply Power BI).

• Power BI Mobile Apps - The Power BI Mobile apps for Android and iOS devices, as well

as for Windows phones and tablets.

• Power BI Remote Server - An On-Premises Power BI Reporting solution for companies

that won't or can't store data in the cloud-based Power BI Service.

2.6.2 DAX (Data Analysis Expression) in Power BI

DAX (Data Analysis Expressions) is a formula expression language and can be used in different

BI and visualization tools. DAX is also known as function language, where the full code is kept

inside a function.

[B. Tech VII Semester] Department of Computer Engineering 17


Data Analysis of Goibibo and Make My Trip Dataset
Different DAX functions can be categorized into the following categories –

• Aggregation functions

• Counting functions

• Logical functions

• Information functions

• Text functions

• Date functions

2.6.3 Tableau vs Power BI

Table 2.3 Difference between Tableau and Power BI

Feature Tableau Power BI

Data Tableau provides strong data visualization Power BI provides a strong


Visualization and is one of the main data visualization backend data manipulation
tool in the market. feature with access to simple
visualizations.
Size of Dataset Tableau can connect much larger datasets Power BI has a limit of 1GB
as compared to Power BI. data in free version.
Costing Tableau is more expensive as compared to Power BI also comes with a
Power BI. free version

Implementation Tableau provides different implementation Power BI uses cloud storage


types as per organizational needs. and includes simple
Deployment, at an enterprise level follows implementation process.
a step-wise process which last for weeks.

[B. Tech VII Semester] Department of Computer Engineering 18


Data Analysis of Goibibo and Make My Trip Dataset
CHAPTER 3

RESULTS AND DISCUSSIONS


3.1 Introduction

This section provides the detailed description of the project “Data Analysis of Goibibo and Make

My Trip” which includes the detail of the dataset, queries applied using Pig and Hive for the

analysis of data and then visualization of the result of those queries using Power BI.

3.2 Agenda behind the Project

As the numbers of hotels are increasing, so it is required to analyze the data of these hotels. The

main aim behind this project is to produce some useful insights from the available data by

analyzing using different tools and generate a better visualization of that data which can facilitate

all sorts of users.

Here, we have analyzed the hotel booking dataset of two popular websites GoIbibo and

MakeMyTrip.

3.3 Dataset Description

We’ve collected the data from http://www.kaggle.com, which is a popular website for providing

datasets to both students as well as professionals, who are working in the field of Data Science.

Following data files that are collected:-

• editgoibibo.csv

This file contains details of hotels listed in dataset of goibibo.

[B. Tech VII Semester] Department of Computer Engineering 19


Data Analysis of Goibibo and Make My Trip Dataset
Table 3.1 Description of editgoibibo.csv

Variable Data Type Description

Hotel_id Int It contain unique id for hotels

Property name String Name of the hotel.

Property Type String It contain category of property.

State String It contains the City and State the property


situated at.
Guest recommendation Int It contains number of guest recommendations

Star rating Int(0-5) It contains star rating for a hotel.

Latitude and Longitude Float Used to locate property on map.

Other ratings Float (0-5) To differentiate between hotels.

• editmake.csv

This file contains the details of hotels listed in make my trip dataset.

Table 3.2 Description of editmake.csv

VARIABLE DATA DESCRIPTION


TYPE
Hotel_id Int It contain unique id for hotels

Property name String Name of the hotel.

Property Type String It contain category of property.

City String It contains the City the property situated at.

Hotel star rating Int (0-5) Hotels can be 5-star hotels or 1-star hotels.

Latitude and Longitude Float Used to locate property on map.

[B. Tech VII Semester] Department of Computer Engineering 20


Data Analysis of Goibibo and Make My Trip Dataset
3.4 Tables in Excel
• Goibibo dataset

Fig 3.1 Dataset of Goibibo in Excel

• Make My Trip Dataset

Fig 3.2 Dataset of Make My Trip in Excel

[B. Tech VII Semester] Department of Computer Engineering 21


Data Analysis of Goibibo and Make My Trip Dataset
3.5 Flow Chart of Project

Download data from kaggle.com

Send the data into Hadoop cluster usnig


Winscp

Use Hive QL to create database table

Analyse the data in form of query result

Export result from Hive to Ms excel

Use Power Bi to show the result

Fig 3.3 Flow Chart of Project

[B. Tech VII Semester] Department of Computer Engineering 22


Data Analysis of Goibibo and Make My Trip Dataset
3.6 Queries Applied

3.6.1 For Goibibo dataset

3.6.1.1 Queries done using Hive

• To show the number of hotels for each star rating for 10 states.

➢ create table countstarrating(state string,c_hotel int,h_star int);

➢ insert into table countstarrating select state,count(id),h_star from goibibo group by state

,h_star;

➢ select * from countstarrating;

Result of the above query is –

Fig 3.4 count of hotels of different ratings in different states

Visulaization of the above result in Power BI is shown in the below figure –

[B. Tech VII Semester] Department of Computer Engineering 23


Data Analysis of Goibibo and Make My Trip Dataset

Fig 3.5 Count of hotels of different star ratings of 10 citites

• To find out the count of different propert_type.

➢ create table c_ptype (p_type string,c_ptype int) ROW FORMAT DELIMITED FIELDS

TERMINATED BY ',';

➢ insert into table c_ptype select p_type,count(*) from goibibo group by p_type;

➢ select * from c_ptype;

Result of the above query is -

[B. Tech VII Semester] Department of Computer Engineering 24


Data Analysis of Goibibo and Make My Trip Dataset

Fig 3.6 Count of different Property Types

Below figure shows the visualization of this result in Power BI –

Fig 3.7 Graph showing count of different Property Types

[B. Tech VII Semester] Department of Computer Engineering 25


Data Analysis of Goibibo and Make My Trip Dataset
• Which state has maximum number of hotels.

➢ CREATE TABLE go_statemaxh(state string,countH int) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',';

➢ insert into table go_statemaxh select state,count(*)as c from goibibo group by state order

by c desc limit 1;

➢ select * from go_statemaxh;

• All average ratings according to state

➢ create table allavg_state(state string,site float,service float,amenity float,food float,location

float,clean float) row format delimited fields terminated by '\t';

➢ insert overwrite table allavg_state select

state,round(avg(site),2),round(avg(service),2),round(avg(amenity),2),round(avg(food),2),ro

und(avg(location),2),round(avg(clean),2) from goibibo1 group by state;

Fig 3.8 Average of Food and Drinks ratings of hotels of 5 states

[B. Tech VII Semester] Department of Computer Engineering 26


Data Analysis of Goibibo and Make My Trip Dataset
• All average ratings according to h_star

• create table allavg_hstar(h_star int,site float,service float,amenity float,food float,location

float,clean float) row format delimited fields terminated by '\t';

• insert overwrite table allavg_hstar select h_star, round(avg(site),2), round(avg(service),2),

round (avg(amenity),2), round (avg(food),2), round(avg(location),2), round(avg(clean),2)

from goibibo1 group by h_star;

Fig 3.9 Average of Food and Drinks rating for hotels of different ratings

• Count of p_type for each state.

➢ create table count__by_ptypeState (state string , p_type string, count int);

➢ insert overwrite table count__by_ptypeState select state,p_type,count(p_name) from


goibibo1 group by state,p_type;

➢ select * from c_ptype;

[B. Tech VII Semester] Department of Computer Engineering 27


Data Analysis of Goibibo and Make My Trip Dataset

Fig 3.10 Count of property types for Andhra Pradesh

3.6.1.2 Queries done using Pig

• Count the number of hotels in each state having 100 guest recommendation and

having all ratings 5.

➢ goibibo = LOAD '/editgo_tab.txt' USING PigStorage('\t') as (id:int,

p_name:chararray,p_type:chararray,city:chararray,state:chararray,guest:int,h_star:int,latitud

e:float,longitude:float,review:int,c_rev:int,site:float,service:float,amenities:float,food:float,l

ocation:float,clean:float);

➢ f = filter goibibo by guest==100 and site==5 and service==5 and amenities==5 and

food==5 and location==5 and clean==5;

➢ gp = group f by state;

➢ c = foreach gp generate group , COUNT(f.p_name) as count ;

➢ gp_c = group c all;

[B. Tech VII Semester] Department of Computer Engineering 28


Data Analysis of Goibibo and Make My Trip Dataset
➢ max = foreach gp_c

order c by count desc;

l = limit m 1;

generate FLATTEN(l);

};

➢ store max into '/max_hotels_with_5_100';

• Which state has maximum 5 star hotels.

➢ filter_5 = filter goibibo by h_star ==5 ;

➢ gp = group filter_5 by state;

➢ count= foreach gp generate group as state, COUNT(filter_5.p_name) as count;

➢ gp_all = group count all;

➢ max_state = foreach gp_all

m = order count by count desc;

l = limit m 1;

generate FLATTEN(l);

};

➢ store max into '/State_max_5_hotel’;

[B. Tech VII Semester] Department of Computer Engineering 29


Data Analysis of Goibibo and Make My Trip Dataset
3.6.2 For Make My Trip dataset

3.6.2.1 Queries done using Hive

• Which city has maximum number of hotels.

➢ CREATE TABLE mmt_citymaxh(city string,countH int) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',';

➢ insert into table mmt_citymaxh select city,count(*)as c from mmt group by city order by c

desc limit 1;

➢ select * from mmt_citymaxh;

• To find out the count of different propert_type.

➢ CREATE TABLE mmt(p_type string, count int) ROW FORMAT DELIMITED FIELDS

TERMINATED BY ',';

➢ insert into table mmt11 select p_type,count(*) from mmt group by p_type;

➢ select * from mmt11;

Fig 3.11 Count of different property types

[B. Tech VII Semester] Department of Computer Engineering 30


Data Analysis of Goibibo and Make My Trip Dataset
3.6.2.2 Queries done using Pig

• Count of number of hotels for each star rating for 10 cities.

➢ make = load '/editmake.csv' using PigStorage('\t')

as(id:int,p_name:chararray,p_type:chararray,city:chararray,h_star:int,latitude:double,longit

ude:double,mmt_review_score:float,mmt_tripadvisor_count:int,room_types:chararray,site_

review_rating:float);

➢ gp = group make by (city,p_type,h_star);

➢ count_by_ptype_rating = foreach gp generate

FLATTEN(group)as(city,p_type,h_star),COUNT(make.p_name)as count;

➢ store count_by_ptype_rating into '/count_by_ptype_rating';

• Which property type has maximum 5 star rating.

➢ filter_5 = filter make by h_star ==5 ;

➢ gp = group filter_5 by p_type;

➢ count_hotels = foreach gp generate group,COUNT(filter_5.p_name);

➢ store count_hotels in '/count_hotels';

• Which city has maximum 5 star hotels.

➢ filter_5 = filter make by h_star ==5 ;

➢ gp = group filter_5 by city;

➢ count= foreach gp generate group as city, COUNT(filter_5.p_name) as count;

➢ gp_all = group count all;

[B. Tech VII Semester] Department of Computer Engineering 31


Data Analysis of Goibibo and Make My Trip Dataset
➢ max_city = foreach gp_all

m = order count by count desc;

l = limit m 1;

generate FLATTEN(l);

};

[B. Tech VII Semester] Department of Computer Engineering 32


Data Analysis of Goibibo and Make My Trip Dataset
CHAPTER 4

FUTURE SCOPE AND CONCLUSIONS


This analysis is mainly useful for two types of user. First type of users are the regular customers

who use online websites like goibibo and make my trip for hotel booking. By using this analysis

they can differentiate between different hotels based on their property type as well as ratings.

Second type of users are business men, as this analysis is also useful for business purpose. Using

this analysis hotel owners can compare their hotels with other hotels present in their area and can

find that what kind of improvement they need to do in their hotels to provide better customer

satisfaction and get higher ratings.

With these datasets (Goibibo and Make My Trip), we can also use the data of other online

websites. By analyzing that data more accurate results can be generated. Beside this a comparison

can also be done between these online websites.

We can also apply different ML and AI algorithms in our project to improve its efficiency and to

produce more useful results from it.

Thus the task of big data analysis is not only important but also a necessity. In fact

many organizations that have implemented Big Data are realizing significant competitive

advantage compared to other organizations with no Big Data efforts. The analysis of hotel data is

important as it gives a easy way of comparison between hotels of a particular area and also the

visualization of the analyzed data provide the desired information to its users in a easier way so

that they can get benefit from it.

[B. Tech VII Semester] Department of Computer Engineering 33


REFERENCES

• https://www.edureka.co

• https://data-flair.training/blogs/big-data-tutorials-home/

• https://www.guru99.com/bigdata-tutorials.html

• https://www.tutorialspoint.com/big_data_tutorials.htm

• https://www.javatpoint.com/what-is-big-data

• https://powerbi.microsoft.com/en-us

[B. Tech VII Semester] Department of Computer Engineering 34

Anda mungkin juga menyukai