Anda di halaman 1dari 77

Apache Hive

The Apache HiveTM data warehouse software facilitates querying and managing large datasets residing in distributed storage. Built on top of Apache HadoopTM , it provides Tools to enable easy data extract/transform/load (ETL) A mechanism to impose structure on a variety of data formats Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBaseTM Query execution via MapReduce

Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce framework to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions (UDF's), aggregations (UDAF's), and table functions (UDTF's). Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats. Please see File Format and SerDe in the Developer Guide for details. Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs). What Hive values most are scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.

General Information about Hive


Getting Started Presentations and Papers about Hive A List of Sites and Applications Powered by Hive FAQ hive-users Mailing List Hive IRC Channel: #hive on irc.freenode.net About This Wiki

User Documentation
Hive Tutorial HiveQL Language Manual (Queries, DML, DDL, and CLI) Hive Operators and Functions Hive Web Interface Hive Client (JDBC, ODBC, Thrift, etc) HiveServer2 Client Hive Change Log

Avro SerDe

Administrator Documentation
Installing Hive Configuring Hive Setting Up Metastore Setting Up Hive Web Interface Setting Up Hive Server (JDBC, ODBC, Thrift, etc.) Hive on Amazon Web Services Hive on Amazon Elastic MapReduce

Resources for Contributors


Hive Developer FAQ How to Contribute Hive Contributors Meetings Hive Developer Guide Plugin Developer Kit Unit Test Parallel Execution Hive Performance Hive Architecture Overview Hive Design Docs Roadmap/Call to Add More Features Full-Text Search over All Hive Resources Becoming a Committer How to Commit How to Release Build Status on Jenkins (Formerly Hudson) Project Bylaws

For more information, please see the official Hive website. Apache Hive, Apache Hadoop, Apache HBase, Apache HDFS, Apache, the Apache feather logo, and the Apache Hive project logo are trademarks of The Apache Software Foundation.

Child Pages (13)


Hide Child Pages | Reorder Pages Page: AboutThisWiki Page: AvroSerDe Page: Bylaws Page: Dependent Tables Page: Hadoop-compatible Input-Output Format for Hive Page: HiveAmazonElasticMapReduce Page: HiveAwsEmr Page: HiveChangeLog

Page: HiveDeveloperFAQ Page: HiveServer2 Clients Page: OperatorsAndFunctions Page: PluginDeveloperKit Page: RCFileCat

Table of Contents Installation and Configuration Requirements Installing Hive from a Stable Release Building Hive from Source Compile hive on hadoop 23 Running Hive Configuration management overview Runtime configuration Hive, Map-Reduce and Local-Mode Error Logs DDL Operations Metadata Store DML Operations SQL Operations Example Queries SELECTS and FILTERS GROUP BY JOIN MULTITABLE INSERT STREAMING Simple Example Use Cases MovieLens User Ratings Apache Weblog Data

DISCLAIMER: Hive has only been tested on unix(linux) and mac systems using Java 1.6 for now although it may very well work on other similar platforms. It does not work on Cygwin. Most of our testing has been on Hadoop 0.20 - so we advise running it against this version even though it may compile/work against other versions

Installation and Configuration


Requirements
Java 1.6 Hadoop 0.20.x.

Installing Hive from a Stable Release


Start by downloading the most recent stable release of Hive from one of the Apache download mirrors (see Hive Releases). Next you need to unpack the tarball. This will result in the creation of a subdirectory named hive-x.y.z: $ tar -xzvf hive-x.y.z.tar.gz Set the environment variable HIVE_HOME to point to the installation directory: $ cd hive-x.y.z $ export HIVE_HOME={{pwd}} Finally, add $HIVE_HOME/bin to your PATH: $ export PATH=$HIVE_HOME/bin:$PATH

Building Hive from Source


The Hive SVN repository is located here: http://svn.apache.org/repos/asf/hive/trunk $ svn co http://svn.apache.org/repos/asf/hive/trunk hive $ cd hive $ ant clean package $ cd build/dist $ ls README.txt bin/ (all the shell scripts) lib/ (required jar files) conf/ (configuration files) examples/ (sample input and query files) In the rest of the page, we use build/dist and <install-dir> interchangeably.

Compile hive on hadoop 23


$ svn co http://svn.apache.org/repos/asf/hive/trunk hive $ cd hive $ ant clean package -Dhadoop.version=0.23.3 -Dhadoop-0.23.version=0.23.3 Dhadoop.mr.rev=23 $ ant clean package -Dhadoop.version=2.0.0-alpha -Dhadoop0.23.version=2.0.0-alpha -Dhadoop.mr.rev=23

Running Hive
Hive uses hadoop that means: you must have hadoop in your path OR export HADOOP_HOME=<hadoop-install-dir>

In addition, you must create /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.

Commands to perform this setup $ $ $ $ $HADOOP_HOME/bin/hadoop $HADOOP_HOME/bin/hadoop $HADOOP_HOME/bin/hadoop $HADOOP_HOME/bin/hadoop fs fs fs fs -mkdir -mkdir -chmod g+w -chmod g+w /tmp /user/hive/warehouse /tmp /user/hive/warehouse

I also find it useful but not necessary to set HIVE_HOME $ export HIVE_HOME=<hive-install-dir> To use hive command line interface (cli) from the shell: $ $HIVE_HOME/bin/hive

Configuration management overview


Hive default configuration is stored in <install-dir>/conf/hive-default.xml Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hivesite.xml The location of the Hive configuration directory can be changed by setting the HIVE_CONF_DIR environment variable. Log4j configuration is stored in <install-dir>/conf/hive-log4j.properties Hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables are inherited by default. Hive configuration can be manipulated by: Editing hive-site.xml and defining any desired variables (including hadoop variables) in it From the cli using the set command (see below) By invoking hive using the syntax: $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2 this sets the variables x1 and x2 to y1 and y2 respectively By setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2" which does the same as above

Runtime configuration
Hive queries are executed using map-reduce queries and, therefore, the behavior of such queries can be controlled by the hadoop configuration variables. The cli command 'SET' can be used to set any hadoop (or hive) configuration variable. For example: hive> SET mapred.job.tracker=myhost.mycompany.com:50030; hive> SET -v; The latter shows all the current settings. Without the -v option only the variables that differ from the base hadoop configuration are displayed

Hive, Map-Reduce and Local-Mode


Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted to the MapReduce cluster indicated by the variable: mapred.job.tracker

While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers a nifty option to run map-reduce jobs locally on the user's workstation. This can be very useful to run queries over small data sets - in such cases local mode execution is usually significantly faster than submitting jobs to a large cluster. Data is accessed transparently from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing larger data sets. Starting v-0.7, Hive fully supports local mode execution. To enable this, the user can enable the following option: hive> SET mapred.job.tracker=local; In addition, mapred.local.dir should point to a path that's valid on the local machine (for example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating local disk space). Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically. The relevant options are: hive> SET hive.exec.mode.local.auto=false; note that this feature is disabled by default. If enabled - Hive analyzes the size of each map-reduce job in a query and may run it locally if the following thresholds are satisfied: The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default) The total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default) The total number of reduce tasks required is 1 or 0.

So for queries over small data sets, or for queries with multiple map-reduce jobs where the input to subsequent jobs is substantially smaller (because of reduction/filtering in the prior job), jobs may be run locally. Note that there may be differences in the runtime environment of hadoop server nodes and the machine running the hive client (because of different jvm versions or different software libraries). This can cause unexpected behavior/errors while running in local mode. Also note that local mode execution is done in a separate, child jvm (of the hive client). If the user so wishes, the maximum amount of memory for this child jvm can be controlled via the option hive.mapred.local.mem. By default, it's set to zero, in which case Hive lets Hadoop determine the default memory limits of the child jvm.

Error Logs
Hive uses log4j for logging. By default logs are not emitted to the console by the CLI. The default logging level is WARN and the logs are stored in the folder: /tmp/<user.name>/hive.log

If the user wishes - the logs can be emitted to the console by adding the arguments shown below: bin/hive -hiveconf hive.root.logger=INFO,console

Alternatively, the user can change the logging level only by using: bin/hive -hiveconf hive.root.logger=INFO,DRFA

Note that setting hive.root.logger via the 'set' command does not change logging properties since they are determined at initialization time. Hive also stores query logs on a per hive session basis in /tmp/<user.name>/, but can be configured in hive-site.xml with the hive.querylog.location property. Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration. Usually Hadoop will produce one log file per map and reduce task stored on the cluster machine(s) where the task was executed. The log files can be obtained by clicking through to the Task Details page from the Hadoop JobTracker Web UI. When using local mode (using mapred.job.tracker=local), Hadoop/Hive execution logs are produced on the client machine itself. Starting v-0.6 - Hive uses the hive-execlog4j.properties (falling back to hive-log4j.properties only if it's missing) to determine where these logs are delivered by default. The default configuration file produces one log file per query executed in local mode and stores it under /tmp/<user.name>. The intent of providing a separate configuration file is to enable administrators to centralize execution log capture if desired (on a NFS file server for example). Execution logs are invaluable for debugging run-time errors. Error logs are very useful to debug problems. Please send them with any bugs (of which there are many!) to hive-dev@hadoop.apache.org.

DDL Operations
Creating Hive tables and browsing through them hive> CREATE TABLE pokes (foo INT, bar STRING); Creates a table called pokes with two columns, the first being an integer and the other a string hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into. By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a). hive> SHOW TABLES; lists all the tables hive> SHOW TABLES '.*s'; lists all the table that end with 's'. The pattern matching follows Java regular expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html hive> DESCRIBE invites; shows the list of columns As for altering tables, table names can be changed and additional columns can be dropped:

hive> ALTER TABLE pokes ADD COLUMNS (new_col INT); hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment'); hive> ALTER TABLE events RENAME TO 3koobecaf; Dropping tables: hive> DROP TABLE pokes;

Metadata Store
Metadata is in an embedded Derby database whose disk storage location is determined by the hive configuration variable named javax.jdo.option.ConnectionURL. By default (see conf/hive-default.xml), this location is ./metastore_db Right now, in the default configuration, this metadata can only be seen by one user at a time. Metastore can be stored in any database that is supported by JPOX. The location and the type of the RDBMS can be controlled by the two variables javax.jdo.option.ConnectionURL and javax.jdo.option.ConnectionDriverName. Refer to JDO (or JPOX) documentation for more details on supported databases. The database schema is defined in JDO metadata annotations file package.jdo at src/contrib/hive/metastore/src/model. In the future, the metastore itself can be a standalone server. If you want to run the metastore as a network server so it can be accessed from multiple nodes try HiveDerbyServerMode.

DML Operations
Loading data from flat files into Hive: hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; Loads a file that contains two columns separated by ctrl-a into pokes table. 'local' signifies that the input file is on the local file system. If 'local' is omitted then it looks for the file in HDFS. The keyword 'overwrite' signifies that existing data in the table is deleted. If the 'overwrite' keyword is omitted, data files are appended to existing data sets. NOTES: NO verification of data against the schema is performed by the load command. If the file is in hdfs, it is moved into the Hive-controlled file system namespace. The root of the Hive directory is specified by the option hive.metastore.warehouse.dir in hive-default.xml. We advise users to create this directory before trying to create tables via Hive. hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');

The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed. hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15'); The above command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.

SQL Operations
Example Queries
Some example queries are shown below. They are available in build/dist/examples/queries. More are available in the hive sources at ql/src/test/queries/positive

SELECTS and FILTERS


hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15'; selects column 'foo' from all rows of partition ds=2008-08-15 of the invites table. The results are not stored anywhere, but are displayed on the console. Note that in all the examples that follow, INSERT (into a hive table, local directory or HDFS directory) is optional. hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15'; selects all rows from partition ds=2008-08-15 of the invites table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory. NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses. Partitioned tables must always have a partition selected in the WHERE clause of the statement. hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a; Selects all rows from pokes table into a local directory hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE a.ds='2008-08-15'; hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;

hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a; Sum of a column. avg, min, max can also be used. Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

GROUP BY
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(*) WHERE a.foo > 0 GROUP BY a.bar; hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar; Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

JOIN
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;

MULTITABLE INSERT
FROM src INSERT OVERWRITE TABLE INSERT OVERWRITE TABLE and src.key < 200 INSERT OVERWRITE TABLE src.key WHERE src.key >= INSERT OVERWRITE LOCAL src.key >= 300; dest1 SELECT src.* WHERE src.key < 100 dest2 SELECT src.key, src.value WHERE src.key >= 100 dest3 PARTITION(ds='2008-04-08', hr='12') SELECT 200 and src.key < 300 DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE

STREAMING
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09'; This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduce side (please see the Hive Tutorial for examples)

Simple Example Use Cases


MovieLens User Ratings
First, create a table with tab-delimited text file format: CREATE TABLE u_data ( userid INT, movieid INT, rating INT, unixtime STRING) ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' STORED AS TEXTFILE; Then, download and extract the data files: wget http://www.grouplens.org/system/files/ml-data.tar+0.gz tar xvzf ml-data.tar+0.gz And load it into the table that was just created: LOAD DATA LOCAL INPATH 'ml-data/u.data' OVERWRITE INTO TABLE u_data; Count the number of rows in table u_data: SELECT COUNT(*) FROM u_data; Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*). Now we can do some complex data analysis on the table u_data: Create weekday_mapper.py: import sys import datetime for line in sys.stdin: line = line.strip() userid, movieid, rating, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print '\t'.join([userid, movieid, rating, str(weekday)]) Use the mapper script: CREATE TABLE u_data_new ( userid INT, movieid INT, rating INT, weekday INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; add FILE weekday_mapper.py; INSERT OVERWRITE TABLE u_data_new SELECT TRANSFORM (userid, movieid, rating, unixtime) USING 'python weekday_mapper.py' AS (userid, movieid, rating, weekday) FROM u_data; SELECT weekday, COUNT(*) FROM u_data_new GROUP BY weekday; Note that if you're using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).

Apache Weblog Data


The format of Apache weblog is customizable, while most webmasters uses the default. For default Apache weblog, we can create a table with the following command. More about !RegexSerDe can be found here: http://issues.apache.org/jira/browse/HIVE-662 add jar ../build/contrib/hive_contrib.jar; CREATE TABLE apachelog ( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" ) STORED AS TEXTFILE;

Hive User FAQ


Hive User FAQ I see errors like: Server access Error: Connection timed out url= http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz How to change the warehouse.dir location for older tables? When running a JOIN query, I see out-of-memory errors. I am using MySQL as metastore and I see errors: "com.mysql.jdbc.exceptions.jdbc4.!CommunicationsException: Communications link failure" Does Hive support Unicode? HiveQL Are HiveQL identifiers (e.g. table names, column names, etc) case sensitive? What are the maximum allowed lengths for HiveQL identifiers? Importing Data into Hive How do I import XML data into Hive? How do I import CSV data into Hive? How do I import JSON data into Hive? How do I import Thrift data into Hive? How do I import Avro data into Hive? How do I import delimited text data into Hive? How do I import fixed-width data into Hive?

How do I import ASCII logfiles (HTTP, etc) into Hive? Exporting Data from Hive Hive Data Model What is the difference between a native table and an external table? What are dynamic partitions? Can a Hive table contain data in more than one format? Is it possible to set the data format on a per-partition basis? JDBC Driver Does Hive have a JDBC Driver? ODBC Driver Does Hive have an ODBC driver?

I see errors like: Server access Error: Connection timed out url=http://archive.apache.org/dist/hadoop/core/hadoop0.20.1/hadoop-0.20.1.tar.gz
Run the following commands: cd ~/.ant/cache/hadoop/core/sources wget http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop0.20.1.tar.gz

How to change the warehouse.dir location for older tables?


To change the base location of the Hive tables, edit the hive.metastore.warehouse.dir param. This will not affect the older tables. Metadata needs to be changed in the database (MySQL or Derby). The location of Hive tables is in table SDS and column LOCATION.

When running a JOIN query, I see out-of-memory errors.


This is usually caused by the order of JOIN tables. Instead of "FROM tableA a JOIN tableB b ON ...", try "FROM tableB b JOIN tableA a ON ...". NOTE that if you are using LEFT OUTER JOIN, you might want to change to RIGHT OUTER JOIN. This trick usually solve the problem - the rule of thumb is, always put the table with a lot of rows having the same value in the join key on the rightmost side of the JOIN.

I am using MySQL as metastore and I see errors: "com.mysql.jdbc.exceptions.jdbc4.!CommunicationsException: Communications link failure"
This is usually caused by MySQL servers closing connections after the connection is idling for some time. Run the following command on the MySQL server will solve the problem "set global wait_status=120;" 1. When using MySQL as a metastore I see the error "com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes".

This is a known limitation of MySQL 5.0 and UTF8 databases. One option is to use another character set, such as 'latin1', which is known to work.

Does Hive support Unicode?

HiveQL
Are HiveQL identifiers (e.g. table names, column names, etc) case sensitive?
No. Hive is case insensitive. Executing: SELECT * FROM MyTable WHERE myColumn = 3 is strictly equivalent to select * from mytable where mycolumn = 3

What are the maximum allowed lengths for HiveQL identifiers?

Importing Data into Hive

How do I import XML data into Hive? How do I import CSV data into Hive? How do I import JSON data into Hive? How do I import Thrift data into Hive? How do I import Avro data into Hive? How do I import delimited text data into Hive? How do I import fixed-width data into Hive? How do I import ASCII logfiles (HTTP, etc) into Hive?

Exporting Data from Hive Hive Data Model


What is the difference between a native table and an external table? What are dynamic partitions? Can a Hive table contain data in more than one format? Is it possible to set the data format on a per-partition basis?

JDBC Driver

Does Hive have a JDBC Driver?


Yes. Look out to the hive-jdbc jar. The driver is 'org.apache.hadoop.hive.jdbc.HiveDriver'. It supports two modes: a local mode and a remote one. In the remote mode it connects to the hive server through its Thrift API. The JDBC url to use should be of the form: 'jdbc:hive://hostname:port/databasename' In the local mode Hive is embedded. The JDBC url to use should be 'jdbc:hive://'.

ODBC Driver
Does Hive have an ODBC driver?

Hive Tutorial
Hive Tutorial Concepts What is Hive What Hive is NOT Data Units Type System Primitive Types Complex Types Built in operators and functions Built in operators Built in functions Language capabilities Usage and Examples Creating Tables Browsing Tables and Partitions Loading Data Simple Query Partition Based Query Joins Aggregations Multi Table/File Inserts Dynamic-partition Insert Inserting into local files Sampling Union all Array Operations

Map(Associative Arrays) Operations Custom map/reduce scripts Co-Groups Altering Tables Dropping Tables and Partitions

Concepts
What is Hive
Hive is a data warehousing infrastructure based on the Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware. Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

What Hive is NOT


Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur substantial overheads in job submission and scheduling. As a result - latency for Hive queries is generally very high (minutes) even when data sets involved are very small (say a few hundred megabytes). As a result it cannot be compared with systems such as Oracle where analyses are conducted on a significantly smaller amount of data but the analyses proceed much more iteratively with the response times between iterations being less than a few minutes. Hive aims to provide acceptable (but not optimal) latency for interactive data browsing, queries over small data sets or test queries. Hive is not designed for online transaction processing and does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data (like web logs). In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of the QL language with the help of some examples.

Data Units
In the order of granularity - Hive data is organized into: Databases: Namespaces that separate tables and other data units from naming confliction. Tables: Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema):

timestamp - which is of INT type that corresponds to a unix timestamp of when the page was viewed. userid - which is of BIGINT type that identifies the user who viewed the page. page_url - which is of STRING type that captures the location of the page. referer_url - which is of STRING that captures the location of the page from where the user arrived at the current page. IP - which is of STRING type that captures the IP address from where the page request was made. Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience but it is the user's job to guarantee the relationship between partition name and data content!). Partition columns are virtual columns, they are not part of the data itself but are derived on load. Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.

Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution.

Type System
Primitive Types
Types are associated with the columns in the tables. The following Primitive types are supported: Integers TINYINT - 1 byte integer SMALLINT - 2 byte integer INT - 4 byte integer BIGINT - 8 byte integer Boolean type BOOLEAN - TRUE/FALSE Floating point numbers FLOAT - single precision DOUBLE - Double precision String type STRING - sequence of characters in a specified character set

The Types are organized in the following hierarchy (where the parent is a super type of all the children instances): Type

Primitive Type Number DOUBLE BIGINT INT TINYINT FLOAT INT TINYINT STRING BOOLEAN This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2 type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Apart from these fundamental rules for implicit conversion based on type system, Hive also allows the special case for conversion: <STRING> to <DOUBLE>

Explicit type conversion can be done using the cast operator as shown in the Built in functions section below.

Complex Types
Complex Types can be built up from primitive types and other composite types using: Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a Maps (key-value tuples): The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group'] Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.

Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can be created. For example, a type User may comprise of the following fields: gender - which is a STRING. active - which is a BOOLEAN.

Built in operators and functions


Built in operators
Relational Operators - The following operators compare the passed operands and generate a TRUE or FALSE value depending on whether the comparison between the operands holds or not. Operand types all primitive types all primitive types all primitive types all primitive types all primitive Description

Relational Operator A=B

TRUE if expression A is equivalent to expression B otherwise FALSE

A != B

TRUE if expression A is not equivalent to expression B otherwise FALSE

A<B

TRUE if expression A is less than expression B otherwise FALSE

A <= B

TRUE if expression A is less than or equal to expression B otherwise FALSE

A>B

TRUE if expression A is greater than expression B otherwise FALSE

types A >= B all primitive types all types all types TRUE if expression A is greater than or equal to expression B otherwise FALSE

A IS NULL A IS NOT NULL A LIKE B

TRUE if expression A evaluates to NULL otherwise FALSE FALSE if expression A evaluates to NULL otherwise TRUE

strings

TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, 'foobar' LIKE 'foo' evaluates to FALSE where as 'foobar' LIKE 'foo___' evaluates to TRUE and so does 'foobar' LIKE 'foo%'. To escape % use \ (% matches one % character). If the data contains a semi-colon, and you want to search for it, it needs to be escaped, columnValue LIKE 'a\;b' TRUE if string A matches the Java regular expression B (See Java regular expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to FALSE whereas 'foobar' rlike '^f.*r$' evaluates to TRUE Same as RLIKE

A RLIKE B

strings

A REGEXP B strings

Arithmetic Operators - The following operators support various common arithmetic operations on the operands. All of them return number types. Operand types Description

Arithmetic Operators A+B

all number Gives the result of adding A and B. The type of the result is the same as the common types parent(in the type hierarchy) of the types of the operands. e.g. since every integer is a float, therefore float is a containing type of integer so the + operator on a float and an int will result in a float. all number Gives the result of subtracting B from A. The type of the result is the same as the types common parent(in the type hierarchy) of the types of the operands. all number Gives the result of multiplying A and B. The type of the result is the same as the types common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy. all number Gives the result of dividing B from A. The type of the result is the same as the common types parent(in the type hierarchy) of the types of the operands. If the operands are integer

A-B

A*B

A/B

types, then the result is the quotient of the division. A%B all number Gives the reminder resulting from dividing A by B. The type of the result is the same as types the common parent(in the type hierarchy) of the types of the operands. all number Gives the result of bitwise AND of A and B. The type of the result is the same as the types common parent(in the type hierarchy) of the types of the operands. all number Gives the result of bitwise OR of A and B. The type of the result is the same as the types common parent(in the type hierarchy) of the types of the operands. all number Gives the result of bitwise XOR of A and B. The type of the result is the same as the types common parent(in the type hierarchy) of the types of the operands. all number Gives the result of bitwise NOT of A. The type of the result is the same as the type of types A. Logical Operators - The following operators provide support for creating logical expressions. All of them return boolean TRUE or FALSE depending upon the boolean values of the operands.

A&B

A|B

A^B

~A

Logical Operators Operands types Description A AND B A && B A OR B A | B NOT A !A boolean boolean boolean boolean boolean boolean TRUE if both A and B are TRUE, otherwise FALSE Same as A AND B TRUE if either A or B or both are TRUE, otherwise FALSE Same as A OR B TRUE if A is FALSE, otherwise FALSE Same as NOT A

Operators on Complex Types - The following operators provide mechanisms to access elements in Complex Types Description returns the nth element in the array A. The first element has index 0 e.g. if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' returns the value corresponding to the key in the map e.g. if M is a map comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar' returns the x field of S e.g for struct foobar {int foo, int bar} foobar.foo returns

Operator Operand types A[n] A is an Array and n is an int M is a Map<K, V> and key has type K

M[key]

S.x

S is a struct

the integer stored in the foo field of the struct.

Built in functions
Return Type BIGINT BIGINT BIGINT The following built in functions are supported in hive: (Function list in source code: FunctionRegistry.java) Function Name (Signature) Description

round(double a) floor(double a) ceil(double a)

returns the rounded BIGINT value of the double returns the maximum BIGINT value that is equal or less than the double returns the minimum BIGINT value that is equal or greater than the double returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic. returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them. returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar' returns the substring of A starting from start position with the given length e.g. substr('foobar', 4, 2) results in 'ba' returns the string resulting from converting all characters of A to upper case e.g. upper('fOoBaR') results in 'FOOBAR' Same as upper returns the string resulting from converting all characters of B to lower case e.g. lower('fOoBaR') results in 'foobar' Same as lower returns the string resulting from trimming spaces from both ends of A e.g. trim(' foobar ') results in 'foobar' returns the string resulting from trimming spaces from the beginning(left

double

rand(), rand(int seed)

string

concat(string A, string B,...)

string

substr(string A, int start)

string

substr(string A, int start, int length) upper(string A)

string

string string

ucase(string A) lower(string A)

string string

lcase(string A) trim(string A)

string

ltrim(string A)

hand side) of A. For example, ltrim(' foobar ') results in 'foobar ' string rtrim(string A) returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar' returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb' returns the number of elements in the map type returns the number of elements in the array type converts the results of the expression expr to <type> e.g. cast('1' as BIGINT) will convert the string '1' to it integral representation. A null is returned if the conversion does not succeed.

string

regexp_replace(string A, string B, string C)

int int value of <type>

size(Map<K.V>) size(Array<T>) cast(<expr> as <type>)

string

from_unixtime(int unixtime) convert the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00" to_date(string timestamp) Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 Return the month part of a date or a timestamp string: month("1970-1101 00:00:00") = 11, month("1970-11-01") = 11 Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1 Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid

string

int

year(string date)

int

month(string date)

int

day(string date)

string

get_json_object(string json_string, string path)

Return Type BIGINT

The following built in aggregate functions are supported in Hive: Aggregation Function Name (Signature) count(*), count(expr), count(DISTINCT expr[, expr_.]) Description

count(*) - Returns the total number of retrieved rows, including rows containing NULL values; count(expr) - Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and nonNULL.

DOUBLE

sum(col), sum(DISTINCT col) avg(col), avg(DISTINCT col) min(col) max(col)

returns the sum of the elements in the group or the sum of the distinct values of the column in the group returns the average of the elements in the group or the average of the distinct values of the column in the group returns the minimum value of the column in the group returns the maximum value of the column in the group

DOUBLE

DOUBLE DOUBLE

Language capabilities
Hive query language provides the basic SQL like operations. These operations work on tables or partitions. These operations are: Ability to filter rows from a table using a where clause. Ability to select certain columns from the table using a select clause. Ability to do equi-joins between two tables. Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table. Ability to store the results of a query into another table. Ability to download the contents of a table to a local (e.g., nfs) directory. Ability to store the results of a query in a hadoop dfs directory. Ability to manage tables and partitions (create, drop and alter). Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.

Usage and Examples


The following examples highlight some salient features of the system. A detailed set of query test cases can be found at Hive Query Test Cases and the corresponding results can be found at Query Test Case Results

Creating Tables
An example statement that would create the page_view table mentioned above would be like: CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) STORED AS SEQUENCEFILE; In this example the columns of the table are specified with the corresponding types. Comments can be attached both at the column level as well as at the table level. Additionally the partitioned by clause defines the partitioning columns which are different from the data columns and are actually not stored

with the data. When specified in this way, the data in the files is assumed to be delimited with ASCII 001(ctrl-A) as the field delimiter and newline as the row delimiter. The field delimiter can be parametrized if the data is not in the above format as illustrated in the following example: CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' STORED AS SEQUENCEFILE; The row deliminator currently cannot be changed since it is not determined by Hive but Hadoop. e delimiters. It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be executed against the data set. If bucketing is absent, random sampling can still be done on the table but it is not efficient as the query has to scan all the data. The following example illustrates the case of the page_view table that is bucketed on the userid column: CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE; In the example above, the table is clustered by a hash function of userid into 32 buckets. Within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries with greater efficiency. CREATE TABLE page_view(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, friends ARRAY<BIGINT>, properties MAP<STRING, STRING> ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE; In this example the columns that comprise of the table row are specified in a similar way as the definition of types. Comments can be attached both at the column level as well as at the table level. Additionally the partitioned by clause defines the partitioning columns which are different from the data columns and are

actually not stored with the data. The CLUSTERED BY clause specifies which column to use for bucketing as well as how many buckets to create. The delimited row format specifies how the rows are stored in the hive table. In the case of the delimited format, this specifies how the fields are terminated, how the items within collections (arrays or maps) are terminated and how the map keys are terminated. STORED AS SEQUENCEFILE indicates that this data is stored in a binary format (using hadoop SequenceFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS clauses in the above example represent the system defaults. Table names and column names are case insensitive.

Browsing Tables and Partitions


SHOW TABLES; To list existing tables in the warehouse; there are many of these, likely more than you want to browse. SHOW TABLES 'page.*'; To list tables with prefix 'page'. The pattern follows Java regular expression syntax (so the period is a wildcard). SHOW PARTITIONS page_view; To list partitions of a table. If the table is not a partitioned table then an error is thrown. DESCRIBE page_view; To list columns and column types of table. DESCRIBE EXTENDED page_view; To list columns and all other properties of table. This prints lot of information and that too not in a pretty format. Usually used for debugging. DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08'); To list columns and all other properties of a partition. This also prints lot of information which is usually used for debugging.

Loading Data
There are multiple ways to load data into Hive tables. The user can create an external table that points to a specified location within HDFS. In this particular usage, the user can copy a file into the specified location using the HDFS put or copy commands and create a table pointing to this location with all the relevant row format information. Once this is done, the user can transform the data and insert them into any other Hive table. For example, if the file /tmp/pv_2008-06-08.txt contains comma separated page views served on 2008-06-08, and this needs to be loaded into the page_view table in the appropriate partition, the following sequence of commands can achieve this: CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User', country STRING COMMENT 'country of origination') COMMENT 'This is the staging page view table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'

STORED AS TEXTFILE LOCATION '/user/data/staging/page_view'; hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'; In the example above nulls are inserted for the array and map types in the destination tables but potentially these can also come from the external table if the proper row formats are specified. This method is useful if there is already legacy data in HDFS on which the user wants to put some metadata so that the data can be queried and manipulated using Hive. Additionally, the system also supports syntax that can load the data from a file in the local files system directly into a Hive table where the input data format is the same as the table format. If /tmp/pv_2008-0608_us.txt already contains the data for US, then we do not need any additional filtering as shown in the previous example. The load in this case can be done using the following syntax: LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US') The path argument can take a directory (in which case all the files in the directory are loaded), a single file name, or a wildcard (in which case all the matching files are uploaded). If the argument is a directory it cannot contain subdirectories. Similarly - the wildcard must match file names only. In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, the user may decide to do a parallel load of the data (using tools that are external to Hive). Once the file is in HDFS - the following syntax can be used to load the data into a Hive table: LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US') It is assumed that the array and map fields in the input.txt files are null fields for these examples.

Simple Query
For all the active users, one can use the query of the following form: INSERT OVERWRITE TABLE user_active SELECT user.* FROM user WHERE user.active = 1; Note that unlike SQL, we always insert the results into a table. We will illustrate later how the user can inspect these results and even dump them to a local file. You can also run the following query on Hive CLI: SELECT user.* FROM user WHERE user.active = 1; This will be internally rewritten to some temporary file and displayed to the Hive client side.

Partition Based Query


What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns. For example, in order to get all the page_views in the month of 03/2008 referred from domain xyz.com, one could write the following query: INSERT OVERWRITE TABLE xyz_com_page_views SELECT page_views.* FROM page_views WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND page_views.referrer_url like '%xyz.com'; Note that page_views.date is used here because the table (above) was defined with PARTITIONED BY(date DATETIME, country STRING) ; if you name your partition something different, don't expect .date to do what you think!

Joins
In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would need to join the page_view table and the user table on the userid column. This can be accomplished with a join as shown in the following query: INSERT OVERWRITE TABLE pv_users SELECT pv.*, u.gender, u.age FROM user u JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = '2008-03-03'; In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer join (left preserved, right preserved or both sides preserved). For example, in order to do a full outer join in the query above, the corresponding syntax would look like the following query: INSERT OVERWRITE TABLE pv_users SELECT pv.*, u.gender, u.age FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = '2008-03-03'; In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the following example. INSERT OVERWRITE TABLE pv_users SELECT u.* FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id) WHERE pv.date = '2008-03-03'; In order to join more than one tables, the user can use the following syntax: INSERT OVERWRITE TABLE pv_friends SELECT pv.*, u.gender, u.age, f.friends FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid) WHERE pv.date = '2008-03-03';

Note that Hive only supports equi-joins. Also it is best to put the largest table on the rightmost side of the join to get the best performance.

Aggregations
In order to count the number of distinct users by gender one could write the following query: INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count (DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid) FROM pv_users GROUP BY pv_users.gender; however, the following query is not allowed INSERT OVERWRITE TABLE pv_gender_agg SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip) FROM pv_users GROUP BY pv_users.gender;

Multi Table/File Inserts


The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilities). e.g. if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query: FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age; The first insert clause sends the results of the first group by to a Hive table while the second one sends the results to a hadoop dfs files.

Dynamic-partition Insert

In the previous examples, the user has to know which partition to insert into and only one partition can be inserted in one insert statement. If you want to load into multiple partitions, you have to use multi-insert statement as illustrated below. FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA' INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK') SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK'; In order to load data into all country partitions in a particular day, you have to add an insert statement for each country in the input data. This is very inconvenient since you have to have the priori knowledge of the list of countries exist in the input data and create the partitions beforehand. If the list changed for another day, you have to modify your insert DML as well as the partition creation DDLs. It is also inefficient since each insert statement may be turned into a MapReduce Job. Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table. This is a newly added feature that is only available from version 0.6.0. In the dynamic partition insert, the input column values are evaluated to determine which partition this row should be inserted into. If that partition has not been created, it will create that partition automatically. Using this feature you need only one insert statement to create and populate all necessary partitions. In addition, since there is only one insert statement, there is only one corresponding MapReduce job. This significantly improves performance and reduce the Hadoop cluster workload comparing to the multiple insert case. Below is an example of loading data to all country partitions using one insert statement: FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country There are several syntactic differences from the multi-insert statement: country appears in the PARTITION specification, but with no value associated. In this case, country is a dynamic partition column. On the other hand, ds has a value associated with it, which means it is a static partition column. If a column is dynamic partition column, its value will be coming from the input column. Currently we only allow dynamic partition columns to be the last column(s) in the partition clause because the partition column order indicates its hierarchical order (meaning dt is the root partition, and country is the child partition). You cannot specify a partition clause with (dt, country='US') because that means you need to update all partitions with any date and its country sub-partition is 'US'. An additional pvs.country column is added in the select statement. This is the corresponding input column for the dynamic partition column. Note that you do not need to add an input column for the static partition column because its value is already known in the PARTITION clause. Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause.

Semantics of the dynamic partition insert statement:

When there are already non-empty partitions exists for the dynamic partition columns, (e.g., country='CA' exists under some ds root partition), it will be overwritten if the dynamic partition insert saw the same value (say 'CA') in the input data. This is in line with the 'insert overwrite' semantics. However, if the partition value 'CA' does not appear in the input data, the existing partition will not be overwritten. Since a Hive partition corresponds to a directory in HDFS, the partition value has to conform to the HDFS path format (URI in Java). Any character having a special meaning in URI (e.g., '%', ':', '/', '#') will be escaped with '%' followed by 2 bytes of its ASCII value. If the input column is a type different than STRING, its value will be first converted to STRING to be used to construct the HDFS path. If the input column value is NULL or empty string, the row will be put into a special partition, whose name is controlled by the hive parameter hive.exec.default.partition.name. The default value is__HIVE_DEFAULT_PARTITION__. Basically this partition will contain all "bad" rows whose value are not valid partition names. The caveat of this approach is that the bad value will be lost and is replaced by__HIVE_DEFAULT_PARTITION__ if you select them Hive. JIRA HIVE1309 is a solution to let user specify "bad file" to retain the input partition column values as well. Dynamic partition insert could potentially resource hog in that it could generate a large number of partitions in a short time. To get yourself buckled, we define three parameters: hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum dynamic partitions that can be created by each mapper or reducer. If one mapper or reducer created more than that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and the whole job will be killed. hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total number of dynamic partitions does, then an exception is raised at the end of the job before the intermediate data are moved to the final destination. hive.exec.max.created.files (default value being 100000) is the maximum total number of files created by all mappers and reducers. This is implemented by updating a Hadoop counter by each mapper/reducer whenever a new file is created. If the total number is exceeding hive.exec.max.created.files, a fatal error will be thrown and the job will be killed. Another situation we want to protect against dynamic partition insert is that the user may accidentally specify all partitions to be dynamic partitions without specifying one static partition, while the original intention is to just overwrite the sub-partitions of one root partition. We define another parameter hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic partition case. In the strict mode, you have to specify at least one static partition. The default mode is strict. In addition, we have a parameter hive.exec.dynamic.partition=true/false to control whether to allow dynamic partition at all. The default value is false. In Hive 0.6, dynamic partition insert does not work with hive.merge.mapfiles=true or hive.merge.mapredfiles=true, so it internally turns off the merge parameters. Merging files in dynamic partition inserts are supported in Hive 0.7 (see JIRA HIVE-1307 for details).

Troubleshooting and best practices: As stated above, there are too many dynamic partitions created by a particular mapper/reducer, a fatal error could be raised and the job will be killed. The error message looks something like: hive> set hive.exec.dynamic.partition.mode=nonstrict; hive> FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt, country)

SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country; ... 2010-05-07 11:10:19,816 Stage-1 map = 0%, reduce = 0% [Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job. Ended Job = job_201005052204_28178 with errors ... The problem of this that one mapper will take a random set of rows and it is very likely that the number of distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode. One way around it is to group the rows by the dynamic partition columns in the mapper and distribute them to the reducers where the dynamic partitions will be created. In this case the number of distinct dynamic partitions will be significantly reduced. The above example query could be rewritten to: hive> set hive.exec.dynamic.partition.mode=nonstrict; hive> FROM page_view_stg pvs INSERT OVERWRITE TABLE page_view PARTITION(dt, country) SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country DISTRIBUTE BY ds, country; This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions.

Inserting into local files


In certain situations you would want to write the output into a local file so that you could load it into an excel spreadsheet. This can be accomplished with the following command: INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum' SELECT pv_gender_sum.* FROM pv_gender_sum;

Sampling
The sampling clause allows the users to write queries for samples of the data instead of the whole table. Currently the sampling is done on the columns that are specified in the CLUSTERED BY clause of the CREATE TABLE statement. In the following example we choose 3rd bucket out of the 32 buckets of the pv_gender_sum table: INSERT OVERWRITE TABLE pv_gender_sum_sample SELECT pv_gender_sum.* FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32); In general the TABLESAMPLE syntax looks like: TABLESAMPLE(BUCKET x OUT OF y)

y has to be a multiple or divisor of the number of buckets in that table as specified at the table creation time. The buckets chosen are determined if bucket_number module y is equal to x. So in the above example the following tablesample clause TABLESAMPLE(BUCKET 3 OUT OF 16) would pick out the 3rd and 19th buckets. The buckets are numbered starting from 0. On the other hand the tablesample clause TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid) would pick out half of the 3rd bucket.

Union all
The language also supports union all, e.g. if we suppose there are two different tables that track which user has published a video and which user has published a comment, the following query joins the results of a union all with the user table to create a single annotated stream for all the video publishing and comment publishing events: INSERT OVERWRITE TABLE actions_users SELECT u.id, actions.date FROM ( SELECT av.uid AS uid FROM action_video av WHERE av.date = '2008-06-03' UNION ALL SELECT ac.uid AS uid FROM action_comment ac WHERE ac.date = '2008-06-03' ) actions JOIN users u ON(u.id = actions.uid);

Array Operations
Array columns in tables can only be created programmatically currently. We will be extending this soon to be available as part of the create table statement. For the purpose of the current example assume that pv.friends is of the type array<INT> i.e. it is an array of integers.The user can get a specific element in the array by its index as shown in the following command: SELECT pv.friends[2] FROM page_views pv; The select expressions gets the third item in the pv.friends array. The user can also get the length of the array using the size function as shown below: SELECT pv.userid, size(pv.friends) FROM page_view pv;

Map(Associative Arrays) Operations

Maps provide collections similar to associative arrays. Such structures can only be created programmatically currently. We will be extending this soon. For the purpose of the current example assume that pv.properties is of the type map<String, String> i.e. it is an associative array from strings to string. Accordingly, the following query: INSERT OVERWRITE page_views_map SELECT pv.userid, pv.properties['page type'] FROM page_views pv; can be used to select the 'page_type' property from the page_views table. Similar to arrays, the size function can also be used to get the number of elements in a map as shown in the following query: SELECT size(pv.properties) FROM page_view pv;

Custom map/reduce scripts


Users can also plug in their own custom mappers and reducers in the data stream by using features natively supported in the Hive language. e.g. in order to run a custom mapper script - map_script - and a custom reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts. Note that columns will be transformed to string and delimited by TAB before feeding to the user script, and the standard output of the user script will be treated as TAB-separated string columns. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop. FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' AS dt, uid CLUSTER BY dt) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script' AS date, count; Sample map script (weekday_mapper.py ) import sys import datetime for line in sys.stdin: line = line.strip() userid, unixtime = line.split('\t') weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() print ','.join([userid, str(weekday)]) Of course, both MAP and REDUCE are "syntactic sugar" for the more general select transform. The inner query could also have been written as such: SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS dt, uid CLUSTER BY dt FROM pv_users;

Schema-less map/reduce: If there is no "AS" clause after "USING map_script", Hive assumes the output of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab. Note that this is different from specifying "AS key, value" because in that case value will only contains the portion between the first tab and the second tab if there are multiple tabs. In this way, we allow users to migrate old map/reduce scripts without knowing the schema of the map output. User still needs to know the reduce output schema because that has to match what is in the table that we are inserting to. FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' CLUSTER BY key) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.dt, map_output.uid USING 'reduce_script' AS date, count; Distribute By and Sort By: Instead of specifying "cluster by", the user can specify "distribute by" and "sort by", so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required. FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script' AS c1, c2, c3 DISTRIBUTE BY c2 SORT BY c2, c1) map_output INSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.c1, map_output.c2, map_output.c3 USING 'reduce_script' AS date, count;

Co-Groups
Amongst the user community using map/reduce, cogroup is a fairly common operation wherein the data from multiple tables are sent to a custom reducer such that the rows are grouped by the values of certain columns on the tables. With the UNION ALL operator and the CLUSTER BY specification, this can be achieved in the Hive query language in the following way. Suppose we wanted to cogroup the rows from the actions_video and action_comments table on the uid column and send them to the 'reduce_script' custom reducer, the following syntax can be used by the user: FROM ( FROM ( FROM action_video av SELECT av.uid AS uid, av.id AS id, av.date AS date UNION ALL

FROM action_comment ac SELECT ac.uid AS uid, ac.id AS id, ac.date AS date ) union_actions SELECT union_actions.uid, union_actions.id, union_actions.date CLUSTER BY union_actions.uid) map INSERT OVERWRITE TABLE actions_reduced SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script' AS (uid, id, reduced_val);

Altering Tables
To rename existing table to a new name. If a table with new name already exists then an error is returned: ALTER TABLE old_table_name RENAME TO new_table_name; To rename the columns of an existing table. Be sure to use the same column types, and to include an entry for each preexisting column: ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...); To add columns to an existing table: ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val'); Note that a change in the schema (such as the adding of the columns), preserves the schema for the old partitions of the table in case it is a partitioned table. All the queries that access these columns and run over the old partitions implicitly return a null value or the specified default values for these columns. In the later versions we can make the behavior of assuming certain values as opposed to throwing an error in case the column is not found in a particular partition configurable.

Dropping Tables and Partitions


Dropping tables is fairly trivial. A drop on the table would implicitly drop any indexes(this is a future feature) that would have been built on the table. The associated command is DROP TABLE pv_users; To dropping a partition. Alter the table to drop the partition. ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08') Note that any data for this table or partitions will be dropped and may not be recoverable. * This is the Hive Language Manual. Hive CLI Variable Substitution Data Types Data Definition Statements Data Manipulation Statements

Select Group By Sort/Distribute/Cluster/Order By Transform and Map-Reduce Scripts Operators and User-Defined Functions XPath-specific Functions Joins Lateral View Union Sub Queries Sampling Explain Virtual Columns Locks Import/Export Configuration Properties Authorization Statistics Archiving

Child Pages (17)


Hide Child Pages | Reorder Pages Page: LanguageManual Cli Page: LanguageManual DDL Page: LanguageManual DML Page: LanguageManual Select Page: LanguageManual Joins Page: LanguageManual LateralView Page: LanguageManual Union Page: LanguageManual SubQueries Page: LanguageManual Sampling Page: LanguageManual Explain Page: LanguageManual VirtualColumns Page: Configuration Properties Page: LanguageManual ImportExport Page: LanguageManual Authorization Page: LanguageManual Types Page: Literals Page: LanguageManual VariableSubstitution

Hive Operators and Functions


Hive Plug-in Interfaces - User-Defined Functions and SerDes Reflect UDF Guide to Hive Operators and Functions Functions for Statistics and Data Mining

Hive Web Interface


What is the Hive Web Interface
The Hive web interface is a an alternative to using the Hive command line interface. Using the web interface is a great way to get started with hive.

Features
Schema Browsing
An alternative to running 'show tables' or 'show extended tables' from the CLI is to use the web based schema browser. The Hive Meta Data is presented in a hierarchical manner allowing you to start at the database level and click to get information about tables including the SerDe, column names, and column types.

Detached query execution


A power user issuing multiple hive queries simultaneously would have multiple CLI windows open. The hive web interface manages the session on the web server, not from inside the CLI window. This allows a user to start multiple queries and return to the web interface later to check the status.

No local installation
Any user with a web browser can work with Hive. This has the usual web interface benefits, In particular a user wishing to interact with hadoop or hive requires access to many ports. A remote or VPN user would only require access to the hive web interface running by default on 0.0.0.0 tcp/9999.

Configuration
Hive Web Interface made its first appearance in the 0.2 branch. If you have 0.2 or the SVN trunk you already have it. You should not need to edit the defaults for the Hive web interface. HWI uses: <property> <name>hive.hwi.listen.host</name> <value>0.0.0.0</value> <description>This is the host address the Hive Web Interface will listen on</description> </property> <property>

<name>hive.hwi.listen.port</name> <value>9999</value> <description>This is the port the Hive Web Interface will listen on</description> </property> <property> <name>hive.hwi.war.file</name> <value>${HIVE_HOME}/lib/hive_hwi.war</value> <description>This is the WAR file with the jsp content for Hive Web Interface</description> </property> You probably want to setup HiveDerbyServerMode to allow multiple sessions at the same time.

Start up
When initializing Hive with no arguments that CLI is invoked. Hive has an extension architecture used to start other hive demons. Jetty requires apache ant to start HWI. You should define ANT_LIB as an environment variable or add that to the hive invocation. export ANT_LIB=/opt/ant/lib bin/hive --service hwi Java has no direct way of demonizing. In a production environment you should create a wrapper script. nohup bin/hive --service hwi > /dev/null 2> /dev/null & If you want help on the service invocation or list of parameters you can add bin/hive --service hwi --help

Authentication
Hadoop currently uses environmental properties to determine user name and group vector. Thus Hive and Hive Web Interface can not enforce more stringent security then Hadoop can. When you first connect to the Hive Web Interface the user is prompted for a user name and groups. This feature was added to support installations using different schedulers. If you want to tighten up security you are going to need to patch the source Hive Session Manager or you may be able to tweak the JSP to accomplish this.

Accessing
In order to access the Hive Web Interface, go to <Hive Server Address>:9999/hwi on your web browser.

Tips and tricks

Result file
The result file is local to the web server. A query that produces massive output should set the result file to /dev/null.

Debug Mode
The debug mode is used when the user is interested in having the result file not only contain the result of the hive query but the other messages.

Set Processor
In the CLI a command like 'SET x=5' is not processed by the the Query Processor it is processed by the Set Processor. Use the form 'x=5' not 'set x=5'

Walk through
Authorize
Unable to render embedded object: File (1_hwi_authorize.png) not found. Unable to render embedded object: File (2_hwi_authorize.png) not found.

Schema Browser
Unable to render embedded object: File (3_schema_table.png) not found. Unable to render embedded object: File (4_schema_browser.png) not found.

Diagnostics
Unable to render embedded object: File (5_diagnostic.png) not found.

Running a query
Unable to render embedded object: File (6_newsession.png) not found. Unable to render embedded object: File (7_session_runquery.png) not found. Unable to render embedded object: File (8_session_query_1.png) not found. Unable to render embedded object: File (9_file_view.png) not found. Command Line JDBC JDBC Client Sample Code Running the JDBC Sample Code JDBC Client Setup for a Secure Cluster

Python PHP Thrift Java Client ODBC Thrift C++ Client

This page describes the different clients supported by Hive. The command line client currently only supports an embedded server. The JDBC and thrift-java clients support both embedded and standalone servers. Clients in other languages only support standalone servers. For details about the standalone server see Hive Server.

Command Line
Operates in embedded mode only, i.e., it needs to have access to the hive libraries. For more details see Getting Started.

JDBC
For embedded mode, uri is just "jdbc:hive://". For standalone server, uri is "jdbc:hive://host:port/dbname" where host and port are determined by where the hive server is run. For example, "jdbc:hive://localhost:10000/default". Currently, the only dbname supported is "default".

JDBC Client Sample Code


import import import import import java.sql.SQLException; java.sql.Connection; java.sql.ResultSet; java.sql.Statement; java.sql.DriverManager;

public class HiveJdbcClient { private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver"; /** * @param args * @throws SQLException */ public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } Connection con = DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", ""); Statement stmt = con.createStatement();

String tableName = "testHiveDriverTable"; stmt.executeQuery("drop table " + tableName); ResultSet res = stmt.executeQuery("create table " + tableName + " (key int, value string)"); // show tables String sql = "show tables '" + tableName + "'"; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); if (res.next()) { System.out.println(res.getString(1)); } // describe table sql = "describe " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1) + "\t" + res.getString(2)); } // load data into table // NOTE: filepath has to be local to the hive server // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line String filepath = "/tmp/a.txt"; sql = "load data local inpath '" + filepath + "' into table " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); // select * query sql = "select * from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2)); } // regular hive query sql = "select count(1) from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1)); } } }

Running the JDBC Sample Code


# Then on the command-line $ javac HiveJdbcClient.java # To run the program in standalone mode, we need the following jars in the classpath # from hive/build/dist/lib # hive_exec.jar

# hive_jdbc.jar # hive_metastore.jar # hive_service.jar # libfb303.jar # log4j-1.2.15.jar # # from hadoop/build # hadoop-*-core.jar # # To run the program in embedded mode, we need the following additional jars in the classpath # from hive/build/dist/lib # antlr-runtime-3.0.1.jar # derby.jar # jdo2-api-2.1.jar # jpox-core-1.2.2.jar # jpox-rdbms-1.2.2.jar # # as well as hive/build/dist/conf $ java -cp $CLASSPATH HiveJdbcClient # Alternatively, you can run the following bash script, which will seed the data file # and build your classpath before invoking the client. #!/bin/bash HADOOP_HOME=/your/path/to/hadoop HIVE_HOME=/your/path/to/hive echo -e '1\x01foo' > /tmp/a.txt echo -e '2\x01bar' >> /tmp/a.txt HADOOP_CORE={{ls $HADOOP_HOME/hadoop-*-core.jar}} CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf for i in ${HIVE_HOME}/lib/*.jar ; do CLASSPATH=$CLASSPATH:$i done java -cp $CLASSPATH HiveJdbcClient

JDBC Client Setup for a Secure Cluster


To configure Hive on a secure cluster, add the directory containing hive-site.xml to the CLASSPATH of the JDBC client.

Python
Operates only on a standalone server. Set (and export) PYTHONPATH to build/dist/lib/py. The python modules imported in the code below are generated by building hive.

Please note that the generated python module names have changed in hive trunk. #!/usr/bin/env python import sys from from from from from from try: transport = TSocket.TSocket('localhost', 10000) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport) client = ThriftHive.Client(protocol) transport.open() client.execute("CREATE TABLE r(a STRING, b INT, c DOUBLE)") client.execute("LOAD TABLE LOCAL INPATH '/path' INTO TABLE r") client.execute("SELECT * FROM r") while (1): row = client.fetchOne() if (row == None): break print row client.execute("SELECT * FROM r") print client.fetchAll() transport.close() except Thrift.TException, tx: print '%s' % (tx.message) hive import ThriftHive hive.ttypes import HiveServerException thrift import Thrift thrift.transport import TSocket thrift.transport import TTransport thrift.protocol import TBinaryProtocol

PHP
Operates only on a standalone server. <?php // set THRIFT_ROOT to php directory of the hive distribution $GLOBALS['THRIFT_ROOT'] = '/lib/php/'; // load the required files for connecting to Hive require_once $GLOBALS['THRIFT_ROOT'] . 'packages/hive_service/ThriftHive.php'; require_once $GLOBALS['THRIFT_ROOT'] . 'transport/TSocket.php'; require_once $GLOBALS['THRIFT_ROOT'] . 'protocol/TBinaryProtocol.php'; // Set up the transport/protocol/client $transport = new TSocket('localhost', 10000); $protocol = new TBinaryProtocol($transport); $client = new ThriftHiveClient($protocol); $transport->open();

// run queries, metadata calls etc $client->execute('SELECT * from src'); var_dump($client->fetchAll()); $transport->close();

Thrift Java Client


Operates both in embedded mode and on standalone server.

ODBC
Operates only on a standalone server. See Hive ODBC.

Thrift C++ Client


Operates only on a standalone server. In the works. Beeline - New Command Line shell JDBC JDBC Client Sample Code Running the JDBC Sample Code JDBC Client Setup for a Secure Cluster

This page describes the different clients supported by HiveServer2.

Beeline - New Command Line shell


HiveServer2 supports a new command shell Beeline that works with HiveServer2. Its a JDBC client that is based on SQLLine CLI (http://sqlline.sourceforge.net/). Theres an detailed documentation of the SQLLine which is applicable to Beeline as well. The Beeline shell works in the both embedded as well as remote mode. In the embedded mode, it runs an embedded Hive (similar to Hive CLI) where are remote mode is for connecting to a separate HiveServer2 process over Thrift. Example % bin/beeline Hive version 0.11.0-SNAPSHOT by Apache beeline> !connect jdbc:hive2://localhost:10000 scott tiger org.apache.hive.jdbc.HiveDriver !connect jdbc:hive2://localhost:10000 scott tiger org.apache.hive.jdbc.HiveDriver Connecting to jdbc:hive2://localhost:10000 Connected to: Hive (version 0.10.0) Driver: Hive (version 0.10.0-SNAPSHOT) Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://localhost:10000> show tables; show tables; +-------------------+ | tab_name | +-------------------+ | primitives | | src | | src1 | | src_json | | src_sequencefile | | src_thrift | | srcbucket | | srcbucket2 | | srcpart | +-------------------+ 9 rows selected (1.079 seconds)

JDBC
HiveServere2 has a new JDBC driver. It supports both embedded and remote access to HiveServer2. The JDBC connection URL format has prefix is jdbc:hive2:// and the Driver class is org.apache.hive.jdbc.HiveDriver. Note that this is different from the old hiveserver. For remote server, the URL format is jdbc:hive2://<host>:<port>/<db> (default port for HiveServer2 is 10000). For embedded server, the URL format is jdbc:hive2:// (no host or port).

JDBC Client Sample Code


import import import import import java.sql.SQLException; java.sql.Connection; java.sql.ResultSet; java.sql.Statement; java.sql.DriverManager;

public class HiveJdbcClient { private static String driverName = "org.apache.hive.jdbc.HiveDriver"; /** * @param args * @throws SQLException */ public static void main(String[] args) throws SQLException { try { Class.forName(driverName); } catch (ClassNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); System.exit(1); } //replace "hive" here with the name of the user the queries should run as

Connection con = DriverManager.getConnection("jdbc:hive2://localhost:10000/default", "hive", ""); Statement stmt = con.createStatement(); String tableName = "testHiveDriverTable"; stmt.execute("drop table if exists " + tableName); stmt.execute("create table " + tableName + " (key int, value string)"); // show tables String sql = "show tables '" + tableName + "'"; System.out.println("Running: " + sql); ResultSet res = stmt.executeQuery(sql); if (res.next()) { System.out.println(res.getString(1)); } // describe table sql = "describe " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1) + "\t" + res.getString(2)); } // load data into table // NOTE: filepath has to be local to the hive server // NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line String filepath = "/tmp/a.txt"; sql = "load data local inpath '" + filepath + "' into table " + tableName; System.out.println("Running: " + sql); stmt.execute(sql); // select * query sql = "select * from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2)); } // regular hive query sql = "select count(1) from " + tableName; System.out.println("Running: " + sql); res = stmt.executeQuery(sql); while (res.next()) { System.out.println(res.getString(1)); } } }

Running the JDBC Sample Code


{noformat} # Then on the command-line $ javac HiveJdbcClient.java

# To run the program in standalone mode, we need the following jars in the classpath # from hive/build/dist/lib # hive-jdbc*.jar # hive-service*.jar # libfb303-0.9.0.jar# libthrift-0.9.0.jar# log4j-1.2.16.jar# slf4j-api1.6.1.jar# slf4j-log4j12-1.6.1.jar# commons-logging-1.0.4.jar# # # Following additional jars are needed for the kerberos secure mode # hive-exec*.jar # commons-configuration-1.6.jar # and from hadoop - hadoop-*core.jar # To run the program in embedded mode, we need the following additional jars in the classpath # from hive/build/dist/lib # hive-exec*.jar # hive-metastore*.jar # antlr-runtime-3.0.1.jar # derby.jar # jdo2-api-2.1.jar # jpox-core-1.2.2.jar # jpox-rdbms-1.2.2.jar # # from hadoop/build # hadoop-*-core.jar # as well as hive/build/dist/conf, any HIVE_AUX_JARS_PATH set, and hadoop jars necessary to run MR jobs (eg lzo codec) $ java -cp $CLASSPATH HiveJdbcClient # Alternatively, you can run the following bash script, which will seed the data file # and build your classpath before invoking the client. The script adds all the # additional jars needed for using HiveServer2 in embedded mode as well. #!/bin/bash HADOOP_HOME=/your/path/to/hadoop HIVE_HOME=/your/path/to/hive echo -e '1\x01foo' > /tmp/a.txt echo -e '2\x01bar' >> /tmp/a.txt HADOOP_CORE={{ls $HADOOP_HOME/hadoop-*-core.jar}} CLASSPATH=.:$HIVE_HOME/conf:`hadoop classpath` for i in ${HIVE_HOME}/lib/*.jar ; do CLASSPATH=$CLASSPATH:$i done java -cp $CLASSPATH HiveJdbcClient {noformat}

JDBC Client Setup for a Secure Cluster


When connecting to HiveServer2 with kerberos authentication, the URL format is jdbc:hive2://<host>:<port>/<db>;principal=<Server_Principal_of_HiveServer2>. The client needs to have a valid Kerberos ticket in the ticket cache before connecting. In case of LDAP or customer pass through authentication, the client needs to pass the valid user name and password to JDBC connection API.

This page documents changes that are visible to users. Hive Trunk (0.8.0-dev) Hive 0.7.1 Hive 0.7.0 Hive 0.6.0 Hive 0.5.0

Hive Trunk (0.8.0-dev) Hive 0.7.1 Hive 0.7.0


HIVE-1790: Add support for HAVING clause.

Hive 0.6.0 Hive 0.5.0


Earliest version AvroSerde is available The AvroSerde is available in Hive 0.9.1 and greater.

Overview - Working with Avro from Hive


The AvroSerde allows users to read or write Avro data as Hive tables. The AvroSerde's bullet points: Infers the schema of the Hive table from the Avro schema. Reads all Avro files within a table against a specified schema, taking advantage of Avro's backwards compatibility abilities

Supports arbitrarily nested schemas. Translates all Avro data types into equivalent Hive types. Most types map exactly, but some Avro types don't exist in Hive and are automatically converted by the AvroSerde. Understands compressed Avro files. Transparently converts the Avro idiom of handling nullable types as Union[T, null] into just T and returns null when appropriate. Writes any Hive table to Avro files. Has worked reliably against our most convoluted Avro schemas in our ETL process.

Requirements
The AvroSerde has been built and tested against Hive 0.9.1 and Avro 1.5.

Avro to Hive type conversion


While most Avro types convert directly to equivalent Hive types, there are some which do not exist in Hive and are converted to reasonable equivalents. Also, the AvroSerde special cases unions of null and another type, as described below: Avro type null Becomes Hive type void Note

boolean boolean int long float double bytes string record map list union int bigint float double Array[smallint] string struct map array union Unions of [T, null] transparently convert to nullable T, other types translate directly to Hive converts these to signed bytes.

Hive's unions of those types. However, unions were introduced in Hive 7 and are not currently able to be used in where/group-by statements. They are essentially look-atonly. Because the AvroSerde transparently converts [T,null], to nullable T, this limitation only applies to unions of multiple types or unions not of a single type and null. enum fixed string Array[smallint] Hive has no concept of enums Hive converts the bytes to signed int

Creating Avro-backed Hive tables


To create a the Avro-backed table, specify the serde as org.apache.hadoop.hive.serde2.avro.AvroSerDe, specify the inputformat as org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat, and the outputformat as org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat. Also provide a location from which the AvroSerde will pull the most current schema for the table. For example: CREATE TABLE kst PARTITIONED BY (ds string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='http://schema_provider/kst.avsc'); In this example we're pulling the source-of-truth reader schema from a webserver. Other options for providing the schema are described below. Add the Avro files to the database (or create an external table) using [standard Hive operations](http://wiki.apache.org/hadoop/Hive/LanguageManual/DML). This table might result in a description as below: hive> describe kst; OK string1 string from deserializer string2 string from deserializer int1 int from deserializer boolean1 boolean from deserializer long1 bigint from deserializer float1 float from deserializer double1 double from deserializer inner_record1 struct<int_in_inner_record1:int,string_in_inner_record1:string> from deserializer enum1 string from deserializer array1 array<string> from deserializer map1 map<string,string> from deserializer union1 uniontype<float,boolean,string> from deserializer fixed1 array<tinyint> from deserializer null1 void from deserializer

unionnullint int bytes1 array<tinyint>

from deserializer from deserializer

At this point, the Avro-backed table can be worked with in Hive like any other table.

Writing tables to Avro files


The AvroSerde can serialize any Hive table to Avro files. This makes it effectively an any-Hive-type to Avro converter. In order to write a table to an Avro file, you must first create an appropriate Avro schema. Create as select type statements are not currently supported. Types translate as detailed in the table above. For types that do not translate directly, there are a few items to keep in mind: Types that may be null must be defined as a union of that type and Null within Avro. A null in a field that is not so defined with result in an exception during the save. No changes need be made to the Hive schema to support this, as all fields in Hive can be null. Avro Bytes type should be defined in Hive as lists of tiny ints. the AvroSerde will convert these to Bytes during the saving process. Avro Fixed type should be defined in Hive as lists of tiny ints. the AvroSerde will convert these to Fixed during the saving process. Avro Enum type should be defined in Hive as strings, since Hive doesn't have a concept of enums. Ensure that only valid enum values are present in the table - trying to save a non-defined enum will result in an exception.

Example
Consider the following Hive table, which coincidentally covers all types of Hive data types, making it a good example: CREATE TABLE test_serializer(string1 STRING, int1 INT, tinyint1 TINYINT, smallint1 SMALLINT, bigint1 BIGINT, boolean1 BOOLEAN, float1 FLOAT, double1 DOUBLE, list1 ARRAY<STRING>, map1 MAP<STRING,INT>, struct1 STRUCT<sint:INT,sboolean:BOOLEAN,sstring:STRING>, union1 uniontype<FLOAT, BOOLEAN, STRING>, enum1 STRING, nullableint INT, bytes1 ARRAY<TINYINT>, fixed1 ARRAY<TINYINT>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS TERMINATED BY '#' LINES TERMINATED BY '\n' STORED AS TEXTFILE; To save this table as an Avro file, create an equivalent Avro schema (the namespace and actual name of the record are not important): {

"namespace": "com.linkedin.haivvreo", "name": "test_serializer", "type": "record", "fields": [ { "name":"string1", "type":"string" }, { "name":"int1", "type":"int" }, { "name":"tinyint1", "type":"int" }, { "name":"smallint1", "type":"int" }, { "name":"bigint1", "type":"long" }, { "name":"boolean1", "type":"boolean" }, { "name":"float1", "type":"float" }, { "name":"double1", "type":"double" }, { "name":"list1", "type":{"type":"array", "items":"string"} }, { "name":"map1", "type":{"type":"map", "values":"int"} }, { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [ { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } }, { "name":"union1", "type":["float", "boolean", "string"] }, { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} }, { "name":"nullableint", "type":["int", "null"] }, { "name":"bytes1", "type":"bytes" }, { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} } ] } If the table were backed by a csv such as: why 4 3 1 hell 2 0 o 0 the re ano 9 4 1 the 8 0 r 1 rec ord thir d rec ord 4 5 1 5 0 2 1412 341 tr u e 42 85.23423 .4 424 3 alpha:bet Earth#42:Contr 17:true:A 0:3.1 a:gamma ol#86:Bob#31 be 4145 Linkedin 9 BL UE 72 0:1:2 :3:4: 5 50:5 1:53

9999 999

fa 99 0.000000 ls .8 09 e 9

beta

Earth#101

1134:fals 1:tru e:wazzup e

RE D

N 6:7:8 UL :9:10 L

54:5 5:56

9999 9999 9

tr u e

89 0.000000 alpha:ga .9 00000009 mma 9

Earth#237:Bob 102:false: 2:Tim GR #723 BNL e to EE go N home

N 11:12 57:5 UL :13 8:59 L

one can write it out to Avro with: CREATE TABLE as_avro ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.url'='file:///path/to/the/schema/test_serializer.avsc'); insert overwrite table as_avro select * from test_serializer; The files that are written by the Hive job are valid Avro files, however, MapReduce doesn't add the standard .avro extension. If you copy these files out, you'll likely want to rename them with .avro. Hive is very forgiving about types: it will attempt to store whatever value matches the provided column in the equivalent column position in the new table. No matching is done on column names, for instance. Therefore, it is incumbent on the query writer to make sure the the target column types are correct. If they are not, Avro may accept the type or it may throw an exception, this is dependent on the particular combination of types.

Specifying the Avro schema for a table


There are three ways to provide the reader schema for an Avro table, all of which involve parameters to the serde. As the schema involves, one can update these values by updating the parameters in the table.

Use avro.schema.url
Specifies a url to access the schema from. For http schemas, this works for testing and small-scale clusters, but as the schema will be accessed at least once from each task in the job, this can quickly turn the job into a DDOS attack against the URL provider (a web server, for instance). Use caution when using this parameter for anything other than testing. The schema can also point to a location on HDFS, for instance: hdfs://your-nn:9000/path/to/avsc/file. the AvroSerde will then read the file from HDFS, which should provide resiliency against many reads at once. Note that the serde will read this file from every mapper, so it's a good idea to turn the replication of the schema file to a high value to provide good locality for the readers. The schema file itself should be relatively small, so this does not add a significant amount of overhead to the process.

Use schema.literal and embed the schema in the create statement


One can embed the schema directly into the create statement. This works if the schema doesn't have any single quotes (or they are appropriately escaped), as Hive uses this to define the parameter value. For instance: CREATE TABLE embedded COMMENT "just drop the schema right into the HQL" ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.literal'='{ "namespace": "com.howdy", "name": "some_schema", "type": "record", "fields": [ { "name":"string1","type":"string"}]

}'); Note that the value is enclosed in single quotes and just pasted into the create statement.

Use avro.schema.literal and pass the schema into the script


Hive can do simple variable substitution and one can pass the schema embedded in a variable to the script. Note that to do this, the schema must be completely escaped (carriage returns converted to \n, tabs to \t, quotes escaped, etc). An example: set hiveconf:schema; DROP TABLE example; CREATE TABLE example ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.literal'='${hiveconf:schema}'); To execute this script file, assuming $SCHEMA has been defined to be the escaped schema value: hive -hiveconf schema="${SCHEMA}" -f your_script_file.sql Note that $SCHEMA is interpolated into the quotes to correctly handle spaces within the schema.

Use none to ignore either avro.schema.literal or avro.schema.url


Hive does not provide an easy way to unset or remove a property. If you wish to switch from using url or schema to the other, set the to-be-ignored value to none and the AvroSerde will treat it as if it were not set.

If something goes wrong


Hive tends to swallow exceptions from the AvroSerde that occur before job submission. To force Hive to be more verbose, it can be started with *hive -hiveconf hive.root.logger=INFO,console*, which will spit orders of magnitude more information to the console and will likely include any information the AvroSerde is trying to get you about what went wrong. If the AvroSerde encounters an error during MapReduce, the stack trace will be provided in the failed task log, which can be examined from the JobTracker's web interface. the AvroSerde only emits the AvroSerdeException; look for these. Please include these in any bug reports. The most common is expected to be exceptions while attempting to serializing an incompatible type from what Avro is expecting.

FAQ
Why do I get error-error-error-error-error-error-error and a message to check avro.schema.literal and avro.schema.url when describing a table or running a query against a table?

The AvroSerde returns this message when it has trouble finding or parsing the schema provided by either the avro.schema.literal or avro.avro.schema.url value. It is unable to be more specific because Hive

expects all calls to the serde config methods to be successful, meaning we are unable to return an actual exception. By signaling an error via this message, the table is left in a good state and the incorrect value can be corrected with a call to alter table T set TBLPROPERTIES.

nstalling Hive
Installing Hive is simple and only requires having Java 1.6 and Ant installed on your machine. Hive is available via SVN at http://svn.apache.org/repos/asf/hive/trunk. You can download it by running the following command. $ svn co http://svn.apache.org/repos/asf/hive/trunk hive To build hive, execute the following command on the base directory: $ ant package It will create the subdirectory build/dist with the following contents: README.txt: readme file. bin/: directory containing all the shell scripts lib/: directory containing all required jar files) conf/: directory with configuration files examples/: directory with sample input and query files

Subdirectory build/dist should contain all the files necessary to run hive. You can run it from there or copy it to a different location, if you prefer. In order to run Hive, you must have hadoop in your path or have defined the environment variable HADOOP_HOME with the hadoop installation directory. Moreover, we strongly advise users to create the HDFS directories /tmp and /user/hive/warehouse (aka hive.metastore.warehouse.dir) and set them chmod g+w before tables are created in Hive. To use hive command line interface (cli) go to the hive home directory (the one with the contents of build/dist) and execute the following command: $ bin/hive Metadata is stored in an embedded Derby database whose disk storage location is determined by the hive configuration variable named javax.jdo.option.ConnectionURL. By default (see conf/hivedefault.xml), this location is ./metastore_db Using Derby in embedded mode allows at most one user at a time. To configure Derby to run in server mode, look at HiveDerbyServerMode.

Configuring Hive

A number of configuration variables in Hive can be used by the administrator to change the behavior for their installations and user sessions. These variables can be configured in any of the following ways, shown in the order of preference: Using the set command in the cli for setting session level values for the configuration variable for all statements subsequent to the set command. e.g. set hive.exec.scratchdir=/tmp/mydir; sets the scratch directory (which is used by hive to store temporary output and plans) to /tmp/mydir for all subseq Using -hiveconf option on the cli for the entire session. e.g. bin/hive -hiveconf hive.exec.scratchdir=/tmp/mydir In hive-site.xml. This is used for setting values for the entire Hive configuration. e.g. <property> <name>hive.exec.scratchdir</name> <value>/tmp/mydir</value> <description>Scratch space for Hive jobs</description> </property>

hive-default.xml.template contains the default values for various configuration variables that come with prepackaged in a Hive distribution. In order to override any of the values, create hivesite.xml instead and set the value in that file as shown above. Please note that this file is not used by Hive at all (as of Hive 0.9.0) and so it might be out of date or out of sync with the actual values. The canonical list of configuration options is now only managed in the HiveConf java class. hive-default.xml.template is located in the conf directory in your installation root. hivesite.xml should also be created in the same directory. Broadly the configuration variables are categorized into:

Hive Configuration Variables


Variable Name hive.ddl.output.format Description Default Value

The data format to use for DDL output text (e.g. DESCRIBE table). One of "text" (for human readable text) or "json" (for a json object). (as of Hive 0.9.0) Wrapper around any invocations to script operator e.g. if this is set to python, the script passed to the script operator will be invoked as python <script command>. If the value is null or not set, the script is invoked as <script command>. null

hive.exec.script.wrapper

hive.exec.plan hive.exec.scratchdir This directory is used by hive to store

null /tmp/<user.name>/hive

the plans for different map/reduce stages for the query as well as to stored the intermediate outputs of these stages. hive.exec.submitviachild Determines whether the map/reduce jobs should be submitted through a separate jvm in the non local mode. false - By default jobs are submitted through the same jvm as the compiler

hive.exec.script.maxerrsize Maximum number of serialization errors allowed in a user script invoked through TRANSFORM or MAP or REDUC E constructs. hive.exec.compress.output Determines whether the output of the final map/reduce job in a query is compressed or not. hive.exec.compress.interm ediate Determines whether the output of the intermediate map/reduce jobs in a query is compressed or not. The location of hive_cli.jar that is used when submitting jobs in a separate jvm. The location of the plugin jars that contain implementations of user defined functions and serdes.

100000

false

false

hive.jar.path

hive.aux.jars.path

hive.partition.pruning

A strict value for this variable indicates nonstrict that an error is thrown by the compiler in case no partition predicate is provided on a partitioned table. This is used to protect against a user inadvertently issuing a query against all the partitions of the table. Determines whether the map side aggregation is on or not. true

hive.map.aggr

hive.join.emit.interval hive.map.aggr.hash.percen tmemory hive.default.fileformat Default file format for CREATE TABLE

1000 (float)0.5

TextFile

statement. Options are TextFile, SequenceFile and RCFile hive.merge.mapfiles Merge small files at the end of a maponly job. Merge small files at the end of a mapreduce job. Size of merged files at the end of the job. When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for map-reduce jobs if hive.merge.mapredfiles is true. true

hive.merge.mapredfiles

false

hive.merge.size.per.task

256000000

hive.merge.smallfiles.avgsi ze

16000000

hive.querylog.enable.plan. progress

Whether to log the plan's progress every true time a job's progress is checked. These logs are written to the location specified byhive.querylog.location (as of Hive 0.10) Directory where structured hive query logs are created. One file per session is created in this directory. If this variable set to empty string structured log will not be created. /tmp/<user.name>

hive.querylog.location

hive.querylog.plan.progres s.interval

The interval to wait between logging the 60000 plan's progress in milliseconds. If there is a whole number percentage change in the progress of the mappers or the reducers, the progress is logged regardless of this value. The actual interval will be the ceiling of (this value divided by the value of hive.exec.counters.pull.in terval) multiplied by the value ofhive.exec.counters.pull.in terval i.e. if it is not divide evenly by the value of hive.exec.counters.pull.in

tervalit will be logged less frequently than specified. This only has an effect if hive.querylog.enable.plan. progress is set totrue. (as of Hive 0.10) hive.stats.autogather A flag to gather statistics automatically during the INSERT OVERWRITE command. (as of Hive 0.7.0) true

hive.stats.dbclass

The default database that stores jdbc:derby temporary hive statistics. Valid values are hbase and jdbc while jdbc shoul d have a specification of the Database to use, separatey by a colon (e.g. jdbc:mysql (as of Hive 0.7.0) jdbc:derby:;databaseName=TempStatsSto re;create=true

hive.stats.dbconnectionstri The default connection string for the ng database that stores temporary hive statistics. (as of Hive 0.7.0) hive.stats.jdbcdriver The JDBC driver for the database that stores temporary hive statistics. (as of Hive 0.7.0) Whether queries will fail because stats cannot be collected completely accurately. If this is set to true, reading/writing from/into a partition may fail becuase the stats could not be computed accurately (as of Hive 0.10.0) If enabled, enforces inserts into bucketed tables to also be bucketed Substitutes variables in Hive statements which were previously set using the set command, system variables or environment variables. See HIVE1096 for details. (as of Hive 0.7.0) The maximum replacements the substitution engine will do. (as of Hive 0.10.0)

org.apache.derby.jdbc.EmbeddedDriver

hive.stats.reliable

false

hive.enforce.bucketing

false

hive.variable.substitute

true

hive.variable.substitute.de pth

40

Hive Metastore Configuration Variables


Please see the Admin Manual's section on the Metastore for details.

Hive Configuration Variables used to interact with Hadoop


Variable Name hadoop.bin.path Description The location of hadoop script which is used to submit jobs to hadoop when submitting through a separate jvm. The location of the configuration directory of the hadoop installation Default Value $HADOOP_HOME/bin/hadoop

hadoop.config.dir

$HADOOP_HOME/conf

fs.default.name map.input.file mapred.job.tracker The url to the jobtracker. If this is set to local then map/reduce is run in the local mode.

file:/// null local

mapred.reduce.tasks The number of reducers for each map/reduce stage in the query plan. mapred.job.name The name of the map/reduce job

null

Hive Variables used to pass run time information


Variable Name Description Default Value

hive.session.id hive.query.string hive.query.planid

The id of the Hive Session. The query string passed to the map/reduce job. The id of the plan for the map/reduce stage. 50

hive.jobname.length The maximum length of the jobname. hive.table.name The name of the hive table. This is passed to the user scripts through the script operator. The name of the hive partition. This is passed to the user scripts through the

hive.partition.name

script operator. hive.alias The alias being processed. This is also passed to the user scripts through the script operator.

Temporary Folders
Hive uses temporary folders both on the machine running the Hive client and the default HDFS instance. These folders are used to store per-query temporary/intermediate data sets and are normally cleaned up by the hive client when the query is finished. However, in cases of abnormal hive client termination, some data may be left behind. The configuration details are as follows: On the HDFS cluster this is set to /tmp/hive-<username> by default and is controlled by the configuration variable hive.exec.scratchdir On the client machine, this is hardcoded to /tmp/<username>

Note that when writing data to a table/partition, Hive will first write to a temporary location on the target table's filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table. This applies in all cases - whether tables are stored in HDFS (normal case) or in file systems like S3 or even NFS.

Log Files
Hive client produces logs and history files on the client machine. Please see Error Logs on configuration details. Introduction Embedded Metastore Local Metastore Remote Metastore

Introduction
All the metadata for Hive tables and partitions are stored in Hive Metastore. Metadata is persisted using JPOX ORM solution so any store that is supported by it. Most of the commercial relational databases and many open source datstores are supported. Any datastore that has JDBC driver can probably be used. You can find an E/R diagram for the metastore here. There are 3 different ways to setup metastore server using different Hive configurations. The relevant configuration parameters are Config Param javax.jdo.option.ConnectionURL Description JDBC connection string for the data store which contains metadata

javax.jdo.option.ConnectionDriverName JDBC Driver class name for the data store which contains metadata hive.metastore.uris Hive connects to this URI to make metadata requests for a remote metastore local or remote metastore (Removed as of Hive 0.10: If hive.metastore.uris is empty local mode is assumed, remote otherwise) URI of the default location for native tables

hive.metastore.local

hive.metastore.warehouse.dir

These variables were carried over from old documentation without a guarantee that they all still exist: Variable Name Description Default Value

hive.metastore.metadb.dir hive.metastore.usefilestore hive.metastore.rawstore.impl org.jpox.autoCreateSchema Creates necessary schema on startup if one doesn't exist. (e.g. tables, columns...) Set to false after creating it once. Whether the datastore schema is fixed.

org.jpox.fixedDatastore hive.metastore.checkForDefaultDb

hive.metastore.ds.connection.url.hook Name of the hook to use for retriving the JDO connection URL. If empty, the value in javax.jdo.option.ConnectionURL is used as the connection URL hive.metastore.ds.retry.attempts The number of times to retry a call to the backing datastore if there were a connection error The number of miliseconds between datastore retry attempts 1

hive.metastore.ds.retry.interval hive.metastore.server.min.threads hive.metastore.server.max.threads

1000

Minimum number of worker threads in the Thrift server's pool. 200 Maximum number of worker threads in the Thrift server's pool. 10000

Default configuration sets up an embedded metastore which is used in unit tests and is described in the next section. More practical options are described in the subsequent sections.

Embedded Metastore
Mainly used for unit tests and only one process can connect to metastore at a time. So it is not really a practical solution but works well for unit tests. Config Param javax.jdo.option.Connectio nURL Config Value jdbc:derby:;databaseName=../build/test/junit_me tastore_db;create=true Comment derby database located at hive/trunk/ build... Derby embeded JDBC driver class

javax.jdo.option.Connectio nDriverName

org.apache.derby.jdbc.EmbeddedDriver

hive.metastore.uris hive.metastore.local

not needed since this is a local metastore true embeded is local unit test data goes in here on your local filesystem

file://${user.dir}/../build/ql/test/data/wareho hive.metastore.warehouse. use dir

If you want to run the metastore as a network server so it can be accessed from multiple nodes try HiveDerbyServerMode.

Local Metastore
In local metastore setup, each Hive Client will open a connection to the datastore and make SQL queries against it. The following config will setup a metastore in a MySQL server. Make sure that the server accessible from the machines where Hive queries are executed since this is a local store. Also the jdbc client library is in the classpath of Hive Client. Config Param javax.jdo.option.ConnectionURL Config Value jdbc:mysql://<host name>/<database name>?createDatabaseIfNotExist=true Comment metadata is stored in a MySQL server MySQL JDBC

javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver

driver class javax.jdo.option.ConnectionUserName <user name> user name for connecting to mysql server password for connecting to mysql server

javax.jdo.option.ConnectionPassword

<password>

hive.metastore.uris hive.metastore.local

not needed because this is local store true this is local store default location for Hive tables.

hive.metastore.warehouse.dir

<base hdfs path>

Remote Metastore
In remote metastore setup, all Hive Clients will make a connection a metastore server which in turn queries the datastore (MySQL in this example) for metadata. Metastore server and client communicate using Thrift Protocol. Starting with Hive 0.5.0, you can start a thrift server by executing the following command: hive --service metastore In versions of Hive earlier than 0.5.0, it's instead necessary to run the thrift server via direct execution of Java: $JAVA_HOME/bin/java -Xmx1024m Dlog4j.configuration=file://$HIVE_HOME/conf/hms-log4j.properties Djava.library.path=$HADOOP_HOME/lib/native/Linux-amd64-64/ -cp $CLASSPATH org.apache.hadoop.hive.metastore.HiveMetaStore If you execute Java directly, then JAVA_HOME, HIVE_HOME, HADOOP_HOME must be correctly set; CLASSPATH should contain Hadoop, Hive (lib and auxlib), and Java jars. Server Configuration Parameters Config Param javax.jdo.option.ConnectionURL Config Value jdbc:mysql://<host name>/<database name>?createDatabaseIfNotExist=true Comment metadata is stored in a MySQL server MySQL JDBC driver class

javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver

javax.jdo.option.ConnectionUserName

<user name>

user name for connecting to mysql server password for connecting to mysql server default location for Hive tables.

javax.jdo.option.ConnectionPassword

<password>

hive.metastore.warehouse.dir

<base hdfs path>

Client Configuration Parameters Config Param hive.metastore.uris hive.metastore.local Config Value Comment

thrift://<host_name>:<port> host and port for the thrift metastore server false this is local store default location for Hive tables.

hive.metastore.warehouse.dir <base hdfs path>

If you are using MySQL as the datastore for metadata, put MySQL client libraries in HIVE_HOME/lib before starting Hive Client or HiveMetastore Server.

Hive Web Interface


What is the Hive Web Interface
The Hive web interface is a an alternative to using the Hive command line interface. Using the web interface is a great way to get started with hive.

Features
Schema Browsing
An alternative to running 'show tables' or 'show extended tables' from the CLI is to use the web based schema browser. The Hive Meta Data is presented in a hierarchical manner allowing you to start at the database level and click to get information about tables including the SerDe, column names, and column types.

Detached query execution


A power user issuing multiple hive queries simultaneously would have multiple CLI windows open. The hive web interface manages the session on the web server, not from inside the CLI window. This allows a user to start multiple queries and return to the web interface later to check the status.

No local installation
Any user with a web browser can work with Hive. This has the usual web interface benefits, In particular a user wishing to interact with hadoop or hive requires access to many ports. A remote or VPN user would only require access to the hive web interface running by default on 0.0.0.0 tcp/9999.

Configuration
Hive Web Interface made its first appearance in the 0.2 branch. If you have 0.2 or the SVN trunk you already have it. You should not need to edit the defaults for the Hive web interface. HWI uses: <property> <name>hive.hwi.listen.host</name> <value>0.0.0.0</value> <description>This is the host address the Hive Web Interface will listen on</description> </property> <property> <name>hive.hwi.listen.port</name> <value>9999</value> <description>This is the port the Hive Web Interface will listen on</description> </property> <property> <name>hive.hwi.war.file</name> <value>${HIVE_HOME}/lib/hive_hwi.war</value> <description>This is the WAR file with the jsp content for Hive Web Interface</description> </property> You probably want to setup HiveDerbyServerMode to allow multiple sessions at the same time.

Start up
When initializing Hive with no arguments that CLI is invoked. Hive has an extension architecture used to start other hive demons. Jetty requires apache ant to start HWI. You should define ANT_LIB as an environment variable or add that to the hive invocation. export ANT_LIB=/opt/ant/lib bin/hive --service hwi

Java has no direct way of demonizing. In a production environment you should create a wrapper script. nohup bin/hive --service hwi > /dev/null 2> /dev/null & If you want help on the service invocation or list of parameters you can add bin/hive --service hwi --help

Authentication
Hadoop currently uses environmental properties to determine user name and group vector. Thus Hive and Hive Web Interface can not enforce more stringent security then Hadoop can. When you first connect to the Hive Web Interface the user is prompted for a user name and groups. This feature was added to support installations using different schedulers. If you want to tighten up security you are going to need to patch the source Hive Session Manager or you may be able to tweak the JSP to accomplish this.

Accessing
In order to access the Hive Web Interface, go to <Hive Server Address>:9999/hwi on your web browser.

Tips and tricks


Result file
The result file is local to the web server. A query that produces massive output should set the result file to /dev/null.

Debug Mode
The debug mode is used when the user is interested in having the result file not only contain the result of the hive query but the other messages.

Set Processor
In the CLI a command like 'SET x=5' is not processed by the the Query Processor it is processed by the Set Processor. Use the form 'x=5' not 'set x=5'

Walk through

Authorize
Unable to render embedded object: File (1_hwi_authorize.png) not found. Unable to render embedded object: File (2_hwi_authorize.png) not found.

Schema Browser
Unable to render embedded object: File (3_schema_table.png) not found. Unable to render embedded object: File (4_schema_browser.png) not found.

Diagnostics
Unable to render embedded object: File (5_diagnostic.png) not found.

Running a query
Unable to render embedded object: File (6_newsession.png) not found. Unable to render embedded object: File (7_session_runquery.png) not found. Unable to render embedded object: File (8_session_query_1.png) not found. Unable to render embedded object: File (9_file_view.png) not found.

Setting Up Hive Server


Setting up HiveServer2 Setting Up Thrift Hive Server Setting Up Hive JDBC Server Setting Up Hive ODBC Server

Child Pages (1)


Hide Child Pages | Reorder Pages Page: Setting up HiveServer2

= Hive and Amazon Web Services =

Background
This document explores the different ways of leveraging Hive on Amazon Web Services namely S3, EC2 and Elastic Map-Reduce. Hadoop already has a long tradition of being run on EC2 and S3. These are well documented in the links below which are a must read:

Hadoop and S3 Amazon and EC2

The second document also has pointers on how to get started using EC2 and S3. For people who are new to S3 - there's a few helpful notes in S3 for n00bs section below. The rest of the documentation below assumes that the reader can launch a hadoop cluster in EC2, copy files into and out of S3 and run some simple Hadoop jobs.

Introduction to Hive and AWS


There are three separate questions to consider when running Hive on AWS: 1. Where to run the Hive CLI from and store the metastore db (that contains table and schema definitions). 2. How to define Hive tables over existing datasets (potentially those that are already in S3) 3. How to dispatch Hive queries (which are all executed using one or more map-reduce programs) to a Hadoop cluster running in EC2. We walk you through the choices involved here and show some practical case studies that contain detailed setup and configuration instructions.

Running the Hive CLI


The CLI takes in Hive queries, compiles them into a plan (commonly, but not always, consisting of mapreduce jobs) and then submits them to a Hadoop Cluster. While it depends on Hadoop libraries for this purpose - it is otherwise relatively independent of the Hadoop cluster itself. For this reason the CLI can be run from any node that has a Hive distribution, a Hadoop distribution, a Java Runtime Engine. It can submit jobs to any compatible hadoop cluster (whose version matches that of the Hadoop libraries that Hive is using) that it can connect to. The Hive CLI also needs to access table metadata. By default this is persisted by Hive via an embedded Derby database into a folder named metastore_db on the local file system (however state can be persisted in any database - including remote mysql instances). There are two choices on where to run the Hive CLI from: 1. Run Hive CLI from within EC2 - the Hadoop master node being the obvious choice. There are several problems with this approach: Lack of comprehensive AMIs that bundle different versions of Hive and Hadoop distributions (and the difficulty in doing so considering the large number of such combinations). Cloudera provides some AMIs that bundle Hive with Hadoop - although the choice in terms of Hive and Hadoop versions may be restricted. Any required map-reduce scripts may also need to be copied to the master/Hive node. If the default Derby database is used - then one has to think about persisting state beyond the lifetime of one hadoop cluster. S3 is an obvious choice - but the user must restore and backup Hive metadata at the launch and termination of the Hadoop cluster. 2. Run Hive CLI remotely from outside EC2. In this case, the user installs a Hive distribution on a personal workstation, - the main trick with this option is connecting to the Hadoop cluster - both for submitting jobs and for reading and writing files to HDFS. The section on Running jobs from a remote machine details how this can be done. [Case Study 1] goes into the setup for this in more detail. This option solves the problems mentioned above:

Stock Hadoop AMIs can be used. The user can run any version of Hive on their workstation, launch a Hadoop cluster with the desired Hadoop version etc. on EC2 and start running queries. Map-reduce scripts are automatically pushed by Hive into Hadoop's distributed cache at job submission time and do not need to be copied to the Hadoop machines. Hive Metadata can be stored on local disk painlessly.

However - the one downside of Option 2 is that jar files are copied over to the Hadoop cluster for each map-reduce job. This can cause high latency in job submission as well as incur some AWS network transmission costs. Option 1 seems suitable for advanced users who have figured out a stable Hadoop and Hive (and potentially external libraries) configuration that works for them and can create a new AMI with the same.

Loading Data into Hive Tables


It is useful to go over the main storage choices for Hadoop/EC2 environment: S3 is an excellent place to store data for the long term. There are a couple of choices on how S3 can be used: Data can be either stored as files within S3 using tools like aws and s3curl as detailed in S3 for n00bs section. This suffers from the restriction of 5G limit on file size in S3. But the nice thing is that there are probably scores of tools that can help in copying/replicating data to S3 in this manner. Hadoop is able to read/write such files using the S3N filesystem. Alternatively Hadoop provides a block based file system using S3 as a backing store. This does not suffer from the 5G max file size restriction. However - Hadoop utilities and libraries must be used for reading/writing such files. HDFS instance on the local drives of the machines in the Hadoop cluster. The lifetime of this is restricted to that of the Hadoop instance - hence this is not suitable for long lived data. However it should provide data that can be accessed much faster and hence is a good choice for intermediate/tmp data.

Considering these factors, the following makes sense in terms of Hive tables: 1. For long-lived tables, use S3 based storage mechanisms 2. For intermediate data and tmp tables, use HDFS [Case Study 1] shows you how to achieve such an arrangement using the S3N filesystem. If the user is running Hive CLI from their personal workstation - they can also use Hive's 'load data local' commands as a convenient alternative (to dfs commands) to copy data from their local filesystems (accessible from their workstation) into tables defined over either HDFS or S3.

Submitting jobs to a Hadoop cluster


This applies particularly when Hive CLI is run remotely. A single Hive CLI session can switch across different hadoop clusters (especially as clusters are bought up and terminated). Only two configuration variables: fs.default.name

mapred.job.tracker need to be changed to point the CLI from one Hadoop cluster to another. Beware though that tables stored in previous HDFS instance will not be accessible as the CLI switches from one cluster to another. Again - more details can be found in [Case Study 1].

Case Studies
1. [Querying files in S3 using EC2, Hive and Hadoop ]

Appendix
<<Anchor(S3n00b)>>

S3 for n00bs
One of the things useful to understand is how S3 is used as a file system normally. Each S3 bucket can be considered as a root of a File System. Different files within this filesystem become objects stored in S3 - where the path name of the file (path components joined with '/') become the S3 key within the bucket and file contents become the value. Different tools like [S3Fox|https:-addons.mozilla.org-en-US-firefox-addon-3247] and native S3 !FileSystem in Hadoop (s3n) show a directory structure that's implied by the common prefixes found in the keys. Not all tools are able to create an empty directory. In particular - S3Fox does (by creating a empty key representing the directory). Other popular tools like aws, s3cmd and s3curl provide convenient ways of accessing S3 from the command line - but don't have the capability of creating empty directories.

Amazon Elastic MapReduce and Hive


Amazon Elastic MapReduce is a web service that makes it easy to launch managed, resizable Hadoop clusters on the web-scale infrastructure of Amazon Web Services (AWS). Elastic Map Reduce makes it easy for you to launch a Hive and Hadoop cluster, provides you with flexibility to choose different cluster sizes, and allows you to tear them down automatically when processing has completed. You pay only for the resources that you use with no minimums or long-term commitments. Amazon Elastic MapReduce simplifies the use of Hive clusters by: 1. Handling the provisioning of Hadoop clusters of up to thousands of EC2 instances 2. Installing Hadoop across the master and slave nodes of your cluster and configuring Hadoop based on your chosen hardware 3. Installing Hive on the master node of your cluster and configuring it for communication with the Hadoop JobTracker and NameNode 4. Providing a simple API, a web UI, and purpose-built tools for managing, monitoring, and debugging Hadoop tasks throughout the life of the cluster

5. Providing deep integration, and optimized performance, with AWS services such as S3 and EC2 and AWS features such as Spot Instances, Elastic IPs, and Identity and Access Management (IAM) Please refer to the following link to view the Amazon Elastic MapReduce Getting Started Guide: http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/ Amazon Elastic MapReduce provides you with multiple clients to run your Hive cluster. You can launch a Hive cluster using the AWS Management Console, the Amazon Elastic MapReduce Ruby Client, or the AWS Java SDK. You may also install and run multiple versions of Hive on the same cluster, allowing you to benchmark a newer Hive version alongside your previous version. You can also install a newer Hive version directly onto an existing Hive cluster.

Supported versions:
Hadoop Version Hive Version 0.18 0.20 0.4 0.5, 0.7, 0.7.1

Hive Defaults
Thrift Communication port
Hive Version Thrift port 0.4 0.5 0.7 0.7.1 10000 10000 10001 10002

Log File
Hive Version Log location

0.4 0.5 0.7 0.7.1

/mnt/var/log/apps/hive.log /mnt/var/log/apps/hive_05.log /mnt/var/log/apps/hive_07.log /mnt/var/log/apps/hive_07_1.log

MetaStore
By default, Amazon Elastic MapReduce uses MySQL, preinstalled on the Master Node, for its Hive metastore. Alternatively, you can use the Amazon Relational Database Service (Amazon RDS) to ensure the metastore is persisted beyond the life of your cluster. This also allows you to share the metastore between multiple Hive clusters. Simply override the default location of the MySQL database to the external persistent storage location.

Hive CLI
EMR configures the master node to allow SSH access. You can log onto the master node and execute Hive commands using the Hive CLI. If you have multiple versions of Hive installed on the cluster you can access each one of them via a separate command: Hive Version Hive command 0.4 0.5 0.7 0.7.1 hive hive-0.5 hive-0.7 hive-0.7.1

EMR sets up a separate Hive metastore and Hive warehouse for each installed Hive version on a given cluster. Hence, creating tables using one version does not interfere with the tables created using another version installed. Please note that if you point multiple Hive tables to same location, updates to one table become visible to other tables.

Hive Server

EMR runs a Thrift Hive server on the master node of the Hive cluster. It can be accessed using any JDBC client (for example, squirrel SQL) via Hive JDBC drivers. The JDBC drivers for different Hive versions can be downloaded via the following links: Hive Version Hive JDBC 0.5 0.7 0.7.1 http://aws.amazon.com/developertools/0196055244487017 http://aws.amazon.com/developertools/1818074809286277 http://aws.amazon.com/developertools/8084613472207189

Here is the process to connect to the Hive Server using a JDBC driver: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Hive.html#Hiv eJDBCDriver

Running Batch Queries


You can also submit queries from the command line client remotely. Please note that currently there is a limit of 256 steps on each cluster. If you have more than 256 steps to execute, it is recommended that you run the queries directly using the Hive CLI or submit queries via a JDBC driver.

Hive S3 Tables
An Elastic MapReduce Hive cluster comes configured for communication with S3. You can create tables and point them to your S3 location and Hive and Hadoop will communicate with S3 automatically using your provided credentials. Once you have moved data to an S3 bucket, you simply point your table to that location in S3 in order to read or process data via Hive. You can also create partitioned tables in S3. Hive on Elastic MapReduce provides support for dynamic partitioning in S3.

Hive Logs
Hive application logs: All Hive application logs are redirected to /mnt/var/log/apps/ directory. Hadoop daemon logs: Hadoop daemon logs are available in /mnt/var/log/hadoop/ folder. Hadoop task attempt logs are available in /mnt/var/log/hadoop/userlogs/ folder on each slave node in the cluster.

Tutorials

The following Hive tutorials are available for you to get started with Hive on Elastic MapReduce: 1. Finding trending topics using Google Books n-grams data and Apache Hive on Elastic MapReduce http://aws.amazon.com/articles/Elastic-MapReduce/5249664154115844 2. Contextual Advertising using Apache Hive and Amazon Elastic MapReduce with High Performance Computing instances http://aws.amazon.com/articles/Elastic-MapReduce/2855 3. Operating a Data Warehouse with Hive, Amazon Elastic MapReduce and Amazon SimpleDB http://aws.amazon.com/articles/Elastic-MapReduce/2854 4. Running Hive on Amazon ElasticMap Reduce http://aws.amazon.com/articles/2857 In addition, Amazon provides step-by-step video tutorials: http://aws.amazon.com/articles/2862

Support
You can ask questions related to Hive on Elastic MapReduce on Elastic MapReduce forums at: https://forums.aws.amazon.com/forum.jspa?forumID=52 Please also refer to the EMR developer guide for more information: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/ Contributed by: Vaibhav Aggarwal