Anda di halaman 1dari 22

G.L.

Bajaj Institute of Technology & Management,


Greater Noida

ANNUAL HEALTH SURVEY


A
Training Report Submitted to
Department of Information Technology
In Partial Fulfilment of Requirements
For Award of the Degree
BACHELOR OF TECHNOLOGY
By:
AAKASH JUNEJA (1319213001)

Under the Guidance of


Mr. Akhilesh Singh

ACKNOWLEDGEMENT
I would like to express my deepest appreciation to all those who provided me the
possibility to complete this report. A special gratitude I give to our final year
project manager, Mr. Hemant Rai, whose contribution in stimulating suggestions and
encouragement, helped me to coordinate my project especially in writing this report.
Furthermore I would also like to acknowledge with much appreciation the crucial role
of the staff of Department of Information Technology, who gave the permission to use
all required equipment and the necessary material to complete the task. A special
thanks goes to go to the guide of the project, Mr. Akhilesh Singh whose have invested
his full effort in guiding me in achieving the goal. I have to appreciate the guidance
given by other supervisor as well as the panels especially in our project presentation
that has improved our presentation skills thanks to their comment and advices.

Aakash Juneja

Abstract
Annual Health Survey (Combined Household Information).It contains data of all 3 rounds of
AHS Survey. The survey is conducted in all EAG States namely Bihar,
Chhattisgarh, Jharkhand, Madhya Pradesh, Odisha, Rajasthan, Uttarakhand & Uttar
Pradesh and Assam. Despite being restricted to 9 States, the AHS is the largest
demographic survey in the world and covers two and a half times that of the Sample
Registration System. This project is based on the analysis of data and report generation
according to data.

Table of content
1.0 Introduction
2.0 System Requirement
2.1
2.2
2.3
2.3

Use of Hadoop
Use of Pig
Use of Hive
Use of R
2.4 Use of R Studio

3.0 Procedure
4.0 References
5.0 Bibliography

1.0 Introduction
It contains data of all 3 rounds of AHS Survey i.e. Baseline, First Updating Round and
Second Updating Round. The survey is conducted in all EAG States namely Bihar,
Chhattisgarh, Jharkhand, Madhya Pradesh, Odisha, Rajasthan, Uttarakhand & Uttar
Pradesh and Assam. During the Base line Survey in 2010-11, a total of 20.1 million
population and 4.14 million households and during the first updating survey in 2011-12,
20.61 million population and 4.28 million households have actually been covered. The
second updating survey (third and final round) covered a total of 20.94 million population
and 4.32 million households in 2012-13. Despite being restricted to 9 States, the AHS is the
largest demographic survey in the world and covers two and a half times that of the Sample
Registration System.
Data includes various Indicators like Whether usual Resident, Date/Month/Year of Birth,
Age, Religion, Social Group, Marital Status, Date/Month/Year of first marriage, Attending
school, not-attending school, Highest educational qualification attained, Occupation / Activity
Status during last 365 days, Whether having any form of disability, Type of treatment for
injury, Type of illness, Source of Treatment, Symptoms Pertaining to illness persisting for
more than one month, Sought medical care, Various Diagnosis, Source of Diagnosis, Getting
Regular Treatment, Person Chews/Smoke/Consume Alcohol, Status of House, Type of
Structure of the House, Ownership status of the house, Source of Drinking water, Does the
household treat the water in any way to make it safer to drink, Toilet facility, Household with
electricity, Main source of lighting, Main source of fuel used for cooking, Number of
dwelling
rooms,
Availability
of
Kitchen,
Possess
Radio/Transistor/Television/Computer/Laptop
/Telephone/Mobilephone/WashingMachine/Refrigerator/Sewingmachine/Bicycle/Motor/Scoo
ter/Moped/Car/Jeep/Van/Tractor/Water Pump/Tube Well/Cart, Land Possessed, Residential
Status, Covered by any health scheme or health insurance, Status of Household.
Analysis and visualization needs the data extraction, cleaning and mining of the data and
finally present in form for report using by any reporting tool like R, Tableau etc.

To contribute something towards Digital India I would like to analysis AHS data in
the different aspect:

1. Analysis on symptom of illness in common age group.


2. Analysis on household things in term of urban vs rural.
3. Analysis on symptom of illness on source of drinking water.

2.0 System Requirements


Minimum System Requirements

Machine with minimum Core i3 3rd gen and 4GB RAM.


Linux/ Ubuntu OS

JDK 1.8.91

Hadoop 2.7.1

Hive 1.2.1 /Pig 0.15

To perform data cleaning, extraction, transformation and analysis.

R/ RStudio

For data storage and processing

To perform final analysis and reporting.

AHS dataset

2.1 Use of Hadoop


Apache Hadoop is an open-source software framework for distributed storage and distributed
processing of very large data sets on computer clusters built from commodity hardware. All
the modules in Hadoop are designed with a fundamental assumption that hardware failures
are common and should be automatically handled by the framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File
System (HDFS), and a processing part called MapReduce. Hadoop splits files into large
blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers
packaged code for nodes to process in parallel based on the data that needs to be processed.
This approach takes advantage of data locality nodes manipulating the data they have
access to to allow the dataset to be processed faster and more efficiently than it would be in
a more conventional supercomputer architecture that relies on a parallel file system where
computation and data are distributed via high-speed networking.
The base Apache Hadoop framework is composed of the following modules:

Hadoop Common contains libraries and utilities needed by other Hadoop modules;

Hadoop Distributed File System (HDFS) a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;

Hadoop YARN a resource-management platform responsible for managing


computing resources in clusters and using them for scheduling of users' applications;
and

Hadoop MapReduce an implementation of the MapReduce programming model for


large scale data processing.

The term Hadoop has come to refer not just to the base modules above, but also to the
ecosystem, or collection of additional software packages that can be installed on top of or
alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix,
Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache
Oozie, Apache Storm.
Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on
their MapReduce and Google File System.
The Hadoop framework itself is mostly written in the Java programming language, with some
native code in C and command line utilities written as shell scripts. Though MapReduce Java
code is common, any programming language can be used with "Hadoop Streaming" to
implement the "map" and "reduce" parts of the user's program. Other projects in the Hadoop
ecosystem expose richer user interfaces
2.2 Use of Pig
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The
language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce,
Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java
MapReduce idiom into a notation which makes MapReduce programming high level, similar
to that of SQL for RDBMSs. Pig Latin can be extended using User Defined Functions
(UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call
directly from the language
2.3 Use of Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. While developed by Facebook, Apache Hive is now used
and developed by other companies such as Netflix and the Financial Industry Regulatory
Authority (FINRA). Amazon maintains a software fork of Apache Hive that is included in
Amazon Elastic MapReduce on Amazon Web Services.
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file
systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL with
schema on read and transparently converts queries to MapReduce, Apache Tez and Spark
jobs. All three execution engines can run in Hadoop YARN. To accelerate queries, it provides
indexes, including bitmap indexes. Other features of Hive include:

Indexing to provide acceleration, index type including compaction and Bitmap index
as of 0.10, more index types are planned.

Different storage types such as plain text, RCFile, HBase, ORC, and others.

Metadata storage in an RDBMS, significantly reducing the time to perform semantic


checks during query execution.

Operating on compressed data stored into the Hadoop ecosystem using algorithms
including DEFLATE, BWT, snappy, etc.

Built-in user defined functions (UDFs) to manipulate dates, strings, and other datamining tools. Hive supports extending the UDF set to handle use-cases not supported
by built-in functions.

SQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez,
or Spark jobs.

By default, Hive stores metadata in an embedded Apache Derby database, and other
client/server databases like MySQL can optionally be used.
Four file formats are supported in Hive, which are TEXTFILE, SEQUENCEFILE, ORC and
RCFILE. Apache Parquet can be read via plugin in versions later than 0.10 and natively
starting at 0.13. Additional Hive plugins support querying of the Bitcoin Blockchain.
2.4 Use of R
R is a language and environment for statistical computing and graphics. It is a GNU project
which is similar to the S language and environment which was developed at Bell Laboratories
(formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be
considered as a different implementation of S. There are some important differences, but
much code written for S runs unaltered under R.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical
tests, time-series analysis, classification, clustering, ) and graphical techniques, and is
highly extensible. The S language is often the vehicle of choice for research in statistical
methodology, and R provides an Open Source route to participation in that activity.
One of Rs strengths is the ease with which well-designed publication-quality plots can be
produced, including mathematical symbols and formulae where needed. Great care has been
taken over the defaults for the minor design choices in graphics, but the user retains full
control.
R is available as Free Software under the terms of the Free Software Foundations GNU
General Public License in source code form. It compiles and runs on a wide variety of UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

The R environment
R is an integrated suite of software facilities for data manipulation, calculation and graphical
display. It includes

an effective data handling and storage facility,

a suite of operators for calculations on arrays, in particular matrices,

a large, coherent, integrated collection of intermediate tools for data analysis,

graphical facilities for data analysis and display either on-screen or on hardcopy, and

a well-developed, simple and effective programming language which includes


conditionals, loops, user-defined recursive functions and input and output facilities.

The term environment is intended to characterize it as a fully planned and coherent system,
rather than an incremental accretion of very specific and inflexible tools, as is frequently the
case with other data analysis software.
R, like S, is designed around a true computer language, and it allows users to add additional
functionality by defining new functions. Much of the system is itself written in the R dialect
of S, which makes it easy for users to follow the algorithmic choices made. For
computationally-intensive tasks, C, C++ and Fortran code can be linked and called at run
time. Advanced users can write C code to manipulate R objects directly.
Many users think of R as a statistics system. We prefer to think of it of an environment within
which statistical techniques are implemented. R can be extended (easily) via packages. There
are about eight packages supplied with the R distribution and many more are available
through the CRAN family of Internet sites covering a very wide range of modern statistics.
R has its own LaTeX-like documentation format, which is used to supply comprehensive
documentation, both on-line in a number of formats and in hardcopy.
2.5 Use of R Studio

RStudio is a free and open-source integrated development environment (IDE) for R, a


programming language for statistical computing and graphics. RStudio was founded by JJ
Allaire, creator of the programming language ColdFusion. Hadley Wickham is the Chief
Scientist at RStudio.
RStudio is available in two editions: RStudio Desktop, where the program is run locally as a
regular desktop application; and RStudio Server, which allows accessing RStudio using a
web browser while it is running on a remote Linux server. Prepackaged distributions of
RStudio Desktop are available for Windows, OS X, and Linux.
RStudio is available in open source and commercial editions and runs on the desktop
(Windows, OS X, and Linux) or in a browser connected to RStudio Server or RStudio Server
Pro (Debian, Ubuntu, Red Hat Linux, CentOS).
RStudio is written in the C++ programming language and uses the Qt framework for its
graphical user interface.
Work on RStudio started at around December 2010, and the first public beta version (v0.92)
was officially announced in February 2011.

3.0 Procedure and Result


A. Steps to perform Second analysis that is symptom of illness in common
age group :
1. Load the data in pig variable by running command
x=load'hdfs://localhost:9000/aakash/ahs/firozabad/ahscombuttar_pradeshfirozabad.csv' using PigStorage(',');
2. Now extract the required columns from the data
y=foreach x GENERATE
$7,$5,$58,$60,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72;
3. After extracting required columns into another variable store it into another new
file.
store y into 'hdfs://localhost:9000/aakash/ahs/firozabad';
4. Now start hive Command:$hive
5. Create a database for this purpose(if u want)Command:hive> create database ahs;
6. Create a table in which you have to load that extracted file for analysisCommand:-

hive> reate table final_dataset(agegroup string,common_dis_count


int,total_population int,total_ill int,commmon_disease string)
row format delimited fields terminated by '\t' stored as textfile;
7.

Load the data into that table

8. Now execute a query to perform required analysis and store that data into a file in
hive Command:hive> insert into table final_dataset select M.agegroup, M.count,T.totalsum, P.totalill,
Q.commondisease from temp M LEFT OUTER JOIN (select agegroup, sum(count)
totalsum from final group by agegroup) T on (M.agegroup=T.agegroup) LEFT
OUTER JOIN (select agegroup, sum(count) totalill from final where symptoms!='No
Symptoms of chronic diseases' and symptoms!='Asymptomatic' group by agegroup) P
on (M.agegroup = P.agegroup) LEFT OUTER JOIN
(select R.agegroup,S.symptoms commondisease from temp R LEFT OUTER JOIN
final S on R.count = S.count) Q on (M.agegroup = Q.agegroup);

Final Data Set of Analysis 1

9. Copy the analysed hive file to local

R Command
10. Now load the file into RStudio
analysis1 <- read.csv("/home/aakash/Desktop/one.csv", header = T)
11. Represent this in form of Pie Chart
Converting Dataframe into numeric matrix
temp <- analysis1

analysis1<-analysis1[,-5]
analysis1<-analysis1[,-1]
m <- as.matrix(t(analysis1))
Setting Scientific Notation
getOption("scipen")
opt <- options("scipen" = 20)
Plotting Bar Plot
barplot(m,names.arg = temp$agegroup,beside = T,col=c('red','blue','green'))
barplot(m,names.arg = temp$agegroup,beside = T,col=c('red','blue','green'),
legend("topright", c("Common Disease Count","Total Population","Total
ill"),cex=0.75,fill=c('red','blue','green')))

AGE GROUP

COMMON SYMPTOMS

0-10

ENT problems/diseases

11-20

ENT problems/diseases

21-30

ENT problems/diseases

31-40

ENT problems/diseases

41-50

ENT problems/diseases

51-60

ENT problems/diseases

61-70

Diseases of musculo-skeletal system

71-80

Diseases of musculo-skeletal system

81-90

Diseases of musculo-skeletal system

91-100

Diseases of musculo-skeletal system

B. Steps to perform analysis that is household things in term of urban vs


rural:
1. Load the data in pig variable by running command:
x=load'hdfs://localhost:9000/aakash/ahs/firozabad/ahscombuttar_pradeshfirozabad.csv' using PigStorage(',');
2. Now extract the required columns from the data
y=foreach x GENERATE
$7,$5,$58,$60,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72;
3. After extracting required columns into another variable store it into another new
file.
store y into 'hdfs://localhost:9000/aakash/ahs/firozabad';
4. Now start hive Command:$hive
5. Create a database for this purpose(if u want)
6. Create a table in which you have to load that extracted file for analysis
hive>create table grouped_data(locality string,have_or_not string, count int, type
string)
row format delimited fields terminated by '\t' stored as textfile;
7. Load the data into that table
8. Now execute a query to perform required analysis and store that data into a file in
hive:
hive>select * from

(SELECT rural,household_have_electricity HH,COUNT(psu_id),'ELECTRICITY'


Flag FROM project.allcities group by rural,household_have_electricity
UNION ALL
SELECT rural,is_radio HH,COUNT(psu_id),'RADIO' Flag FROM project.allcities
group by rural,is_radio
UNION ALL
SELECT rural,is_television HH,COUNT(psu_id),'TELEVISION' Flag FROM
project.allcities group by rural,is_television
UNION ALL
SELECT rural,
CASE WHEN is_computer = 'With Internet connection' THEN 'Yes'
WHEN is_computer = 'Without Internet connection' THEN 'Yes'
ELSE is_computer END AS HH,COUNT(psu_id),'COMPUTER' Flag FROM
project.allcities group by rural,is_computer
UNION ALL
SELECT rural,
CASE WHEN is_telephone = 'Both' THEN 'Yes' WHEN is_telephone = 'Mobile
Phone only' THEN 'Yes' WHEN is_telephone = 'Telephone only' THEN 'Yes'
ELSE is_telephone END AS HH,COUNT(psu_id),'TELEPHONE' Flag FROM
project.allcities group by rural,is_telephone
UNION ALL
SELECT rural,is_washing_machine HH,COUNT(psu_id),'WASHING_M' Flag
FROM project.allcities group by rural,is_washing_machine
UNION ALL
SELECT rural,is_refrigerator HH,COUNT(psu_id),'REFRIGERATOR' Flag FROM
project.allcities group by rural,is_refrigerator
UNION ALL
SELECT rural,is_sewing_machine HH,COUNT(psu_id),'SEWING_M' Flag FROM
project.allcities group by rural,is_sewing_machine
UNION ALL
SELECT rural,is_bicycle HH,COUNT(psu_id),'BICYCLE' Flag FROM
project.allcities group by rural,is_bicycle
UNION ALL
SELECT rural,is_scooter HH,COUNT(psu_id),'SCOOTER' Flag FROM
project.allcities group by rural,is_scooter
UNION ALL
SELECT rural,is_car HH,COUNT(psu_id),'CAR' Flag FROM project.allcities group
by rural,is_car
UNION ALL
SELECT rural,is_tractor HH,COUNT(psu_id),'TRACTOR' Flag FROM
project.allcities group by rural,is_tractor)tmp;
9. Creating table final and final2
hive>create table final(type string,rural int);
hive>create table final2(type string,rural int);
10. Inserting Data into both the tables
hive>insert into table final

select type, sum(count) from


(select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'TELEPHONE'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'RADIO'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'WASHING_M'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'COMPUTER'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'ELECTRICITY'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'REFRIGERATOR'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'TELEVISION'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'SEWING_M'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'SCOOTER'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'BICYCLE'
UNION ALL
select type, count from grouped_data where locality = 'Rural' and have_or_not = 'Yes'
and type = 'TRACTOR')tmp group by type;
hive>insert into table final2
select type, sum(count) from
(select type, count from grouped_data where locality = 'Urban' and have_or_not =
'Yes' and type = 'TELEPHONE'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not =
'Yes' and type = 'RADIO'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'WASHING_M'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'COMPUTER'
UNION ALL

select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'ELECTRICITY'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not =
'Yes' and type = 'REFRIGERATOR'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'TELEVISION'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'SEWING_M'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'SCOOTER'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'BICYCLE'
UNION ALL
select type, count from grouped_data where locality = 'Urban' and have_or_not = 'Yes'
and type = 'TRACTOR')tmp group by type;
11. Calculating total count of Rural and Urban
hive>select count(*), rural from project.allcities where rural='Rural' or rural= 'Urban'
group by rural;

12. Creating Final_dataset final


hive>create table final_dataset(type string,rural int, urban int, rpercent float,upercent
float)
row format delimited fields terminated by ',' stored as textfile;
hive>insert into table final_dataset
select A.type, A.rural, B.urban, (A.rural/2552386)*100, (B.urban/450326)*100 from
final A
LEFT OUTER JOIN
final2 B
on (A.type = B.type)

TV
CAR
BIKE
Electricity
Cooking Fuel

Urban
100.00%
70.00%
40.00%
100.00%
100.00%

Rural
50.00%
10.00%
90.00%
100.00%
90.00%

Final Result of Analysis 2

13. Copy the analysis hive file to local

R Command
14. Now load the file into RStudio
analysis3 <- read.csv("/home/aakash/Desktop/Final_Dataset.csv",header = T)
temp<- analysis3
15. Represent this in form of Pie Chart
Removing columns
analysis3 <- analysis3[,-1]
View(analysis3)
analysis3 <- analysis3[,-1]
analysis3 <- analysis3[,-1]
Plotting Bar
barplot(m,names.arg = temp$TYPE,beside = T,col=c('red','blue'),legend("topright",
c("Rural","Urban"), cex=0.75, fill= c('red','blue')))

C. Steps to perform first analysis that is the symptom of illness on


source of drinking water

1. Load the data in pig variable by running command


x=load'hdfs://localhost:9000/aakash/ahs/firozabad/ahscombuttar_pradeshfirozabad.csv' using PigStorage(',');
2. Now extract the required columns from the data
y=foreach x GENERATE
$7,$5,$58,$60,$62,$63,$64,$65,$66,$67,$68,$69,$70,$71,$72;
3. After extracting required columns into another variable store it into another new
file.
store y into 'hdfs://localhost:9000/aakash/ahs/firozabad'
4. Now start hive
$hive;
5. Create a database for this purpose(if u want)
6. Create a table in which you have to load that extracted file for analysis
hive>create table a3 (psu_idstring, symptoms_pertaining_illnessstring,
drinking_water_source string )
row format delimited fields terminated by ',' stored as textfile;
7. Load the data into that table

8. Now execute a query to perform required analysis and store that data into a file in
hive
Inserting Required Data into a3
hive>insert into table a3 select psu_id , symptoms_pertaining_illness,
drinking_water_source from default.allcities;
Creating table a3_grouped
hive>create table a3_grouped (total int,symptoms_pertaining_illness
string,drinking_water_source string);
Inserting Data into table a3_grouped
hive>insert into table a3_grouped select count(*), symptoms_pertaining_illness ,
drinking_water_source from a3 where drinking_water_source!='NA' and
drinking_water_source!='drinking_water_source' group by
drinking_water_source, symptoms_pertaining_illness;
Calculating symptoms with maximum count in particular source of water
hive>select max(total), drinking_water_source from a3_grouped where
symptoms_pertaining_illness!='Asymptomatic' and symptoms_pertaining_illness!
='No Symptoms of chronic diseases' and
symptoms_pertaining_illness!='NA' group by drinking_water_source;
hive>create table maxcount(source string, maxx int);
hive>insert into table maxcount select drinking_water_source,max(total) from
a3_grouped where symptoms_pertaining_illness!='Asymptomatic' and
symptoms_pertaining_illness!='No Symptoms of chronic diseases' and
symptoms_pertaining_illness!='NA' group by drinking_water_source;
Calculating Total Count of People in who use particular water source
hive>select sum(total), drinking_water_source from a3_grouped group by
drinking_water_source;
Finding Symptoms with Maximum diseases
hive>select maxcount.source,A.symptoms_pertaining_illness
from maxcount
LEFT OUTER JOIN a3_grouped A on (maxcount.maxx = A.total and
maxcount.source = A.drinking_water_source);
hive>create temp(source string,symptoms string);

hive>insert into table temp select


maxcount.source,A.symptoms_pertaining_illness
from maxcount
LEFT OUTER JOIN
a3_grouped A on (maxcount.maxx = A.total and maxcount.source =
A.drinking_water_source);
Creating table Final_dataset
hive>create table final_dataset(source string,symptoms string,max int,total
int,percent float)
row format delimited fields terminated by ',' stored as textfile;
hive>insert into table final_dataset select M.source,A.symptoms, M.maxx, T.total,
(M.maxx/T.total)*100 from maxcount M
LEFT OUTER JOIN
(select sum(total) total, drinking_water_source from a3_grouped group by
drinking_water_source) T on (M.source = T.drinking_water_source)
LEFT OUTER JOIN
temp A on (M.source = A.source);

Final Result of Analysis 3

9. Copy the analysis hive file to local

R Command

10. Now load the file into RStudio


analysis2<-read.csv("/home/aakash/Desktop/final1.csv")
11. Represent this in form of Bar Chart
Changing into matrix
temp <- analysis2
analysis2 <- analysis2[,-1]
m <- as.matrix(t(analysis2))
Plotting
barplot(m,names.arg = temp$Water.Source, col = c("blue") )

4.0 References
https://en.wikipedia.org/
http://www.apache.org/

5.0 Bibliography
http://www.data.gov.in/

Anda mungkin juga menyukai