The Apache HiveTM data warehouse software facilitates querying and managing large datasets residing in
distributed storage. Built on top of Apache HadoopTM, it provides
Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data.
At the same time, this language also allows programmers who are familiar with the MapReduce framework to be
able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be
supported by the built-in capabilities of the language. QL can also be extended with custom scalar functions
(UDF's), aggregations (UDAF's), and table functions (UDTF's).
Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally
well on Thrift, control delimited, or your specialized data formats. Please see File Format and SerDe in the
Developer Guide for details.
Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used
for batch jobs over large sets of append-only data (like web logs). What Hive values most are scalability (scale out
with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and
UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.
Hive Tutorial
Concepts
What is Hive
Hive is a data warehousing infrastructure based on the Hadoop. Hadoop provides massive scale out and fault
tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on
commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It
provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with
SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows
traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more
sophisticated analysis that may not be supported by the built-in capabilities of the language.
Hive is not designed for online transaction processing and does not offer real-time queries and row level updates.
It is best used for batch jobs over large sets of immutable data (like web logs).
In the following sections we provide a tutorial on the capabilities of the system. We start by describing the
concepts of data types, tables and partitions (which are very similar to what you would find in a traditional
relational DBMS) and then illustrate the capabilities of the QL language with the help of some examples.
Data Units
In the order of granularity - Hive data is organized into:
Databases: Namespaces that separate tables and other data units from naming confliction.
Tables: Homogeneous units of data which have the same schema. An example of a table could be
page_views table, where each row could comprise of the following columns (schema):
timestamp - which is of INT type that corresponds to a unix timestamp of when the page was
viewed.
userid - which is of BIGINT type that identifies the user who viewed the page.
page_url - which is of STRING type that captures the location of the page.
referer_url - which is of STRING that captures the location of the page from where the user
arrived at the current page.
IP - which is of STRING type that captures the IP address from where the page request was
made.
Partitions: Each Table can have one or more partition Keys which determines how the data is stored.
Partitions - apart from being storage units - also allow the user to efficiently identify the rows that satisfy a
certain criteria. For example, a date_partition of type STRING and country_partition of type STRING.
Each unique value of the partition keys defines a partition of the Table. For example all "US" data from
"2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data
for 2009-12-23, you can run that query only on the relevant partition of the table thereby speeding up the
analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean
that it contains all or only data from that date; partitions are named after dates for convenience but it is
the user's job to guarantee the relationship between partition name and data content!). Partition columns
are virtual columns, they are not part of the data itself but are derived on load.
Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a
hash function of some column of the Table. For example the page_views table may be bucketed by
userid, which is one of the columns, other than the partitions columns, of the page_view table. These can
be used to efficiently sample the data.
Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to
prune large quantities of data during query processing, resulting in faster query execution.
Type System
Primitive Types
Types are associated with the columns in the tables. The following Primitive types are supported:
Integers
TINYINT - 1 byte integer
SMALLINT - 2 byte integer
INT - 4 byte integer
BIGINT - 8 byte integer
Boolean type
BOOLEAN - TRUE/FALSE
Floating point numbers
FLOAT - single precision
DOUBLE - Double precision
String type
STRING - sequence of characters in a specified character set
The Types are organized in the following hierarchy (where the parent is a super type of all the children instances):
Type
Primitive Type
Number
DOUBLE
BIGINT
INT
TINYINT
FLOAT
INT
TINYINT
STRING
BOOLEAN
This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is
allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2
type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Apart from these
fundamental rules for implicit conversion based on type system, Hive also allows the special case for conversion:
<STRING> to <DOUBLE>
Explicit type conversion can be done using the cast operator as shown in the Built in functions section below.
Complex Types
Complex Types can be built up from primitive types and other composite types using:
Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a
column c of type STRUCT {a INT; b INT} the a field is accessed by the expression c.a
Maps (key-value tuples): The elements are accessed using ['element name'] notation. For example in a
map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group']
Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be
accessed using the [n] notation where n is an index (zero-based) into the array. For example for an array
A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.
Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can
be created. For example, a type User may comprise of the following fields:
Built in operators
Relational Operators - The following operators compare the passed operands and generate a TRUE or
FALSE value depending on whether the comparison between the operands holds or not.
A=B all primitive types TRUE if expression A is equivalent to expression B otherwise FALSE
A != B all primitive types TRUE if expression A is not equivalent to expression B otherwise FALSE
A<B all primitive types TRUE if expression A is less than expression B otherwise FALSE
A <= B all primitive types TRUE if expression A is less than or equal to expression B otherwise
FALSE
A>B all primitive types TRUE if expression A is greater than expression B otherwise FALSE
A >= B all primitive types TRUE if expression A is greater than or equal to expression B otherwise
FALSE
A IS NOT NULL all types FALSE if expression A evaluates to NULL otherwise TRUE
A LIKE B strings TRUE if string A matches the SQL simple regular expression B, otherwise
FALSE. The comparison is done character by character. The _ character in
B matches any character in A (similar to . in posix regular expressions), and
the % character in B matches an arbitrary number of characters in A (similar
to .* in posix regular expressions). For example, 'foobar' LIKE 'foo'
evaluates to FALSE where as 'foobar' LIKE 'foo___' evaluates to
TRUE and so does 'foobar' LIKE 'foo%'. To escape % use \ (%
matches one % character). If the data contains a semi-colon, and you want
to search for it, it needs to be escaped, columnValue LIKE 'a\;b'
A RLIKE B strings TRUE if string A matches the Java regular expression B (See Java regular
expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo'
evaluates to FALSE whereas 'foobar' rlike '^f.*r$' evaluates to TRUE
Arithmetic Operators - The following operators support various common arithmetic operations on the
operands. All of them return number types.
A+B all number types Gives the result of adding A and B. The type of the result is the same as the
common parent(in the type hierarchy) of the types of the operands. e.g. since
every integer is a float, therefore float is a containing type of integer so the +
operator on a float and an int will result in a float.
A-B all number types Gives the result of subtracting B from A. The type of the result is the same as the
common parent(in the type hierarchy) of the types of the operands.
A*B all number types Gives the result of multiplying A and B. The type of the result is the same as the
common parent(in the type hierarchy) of the types of the operands. Note that if the
multiplication causing overflow, you will have to cast one of the operators to a type
higher in the type hierarchy.
A/B all number types Gives the result of dividing B from A. The type of the result is the same as the
common parent(in the type hierarchy) of the types of the operands. If the operands
are integer types, then the result is the quotient of the division.
A%B all number types Gives the reminder resulting from dividing A by B. The type of the result is the
same as the common parent(in the type hierarchy) of the types of the operands.
A&B all number types Gives the result of bitwise AND of A and B. The type of the result is the same as
the common parent(in the type hierarchy) of the types of the operands.
A|B all number types Gives the result of bitwise OR of A and B. The type of the result is the same as the
common parent(in the type hierarchy) of the types of the operands.
A^B all number types Gives the result of bitwise XOR of A and B. The type of the result is the same as
the common parent(in the type hierarchy) of the types of the operands.
~A all number types Gives the result of bitwise NOT of A. The type of the result is the same as the type
of A.
Logical Operators - The following operators provide support for creating logical expressions. All of them
return boolean TRUE or FALSE depending upon the boolean values of the operands.
A | B boolean Same as A OR B
Operators on Complex Types - The following operators provide mechanisms to access elements in
Complex Types
A[n] A is an Array and n is returns the nth element in the array A. The first element has index 0 e.g. if A is
an int an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'
M[key] M is a Map<K, V> returns the value corresponding to the key in the map e.g. if M is a map
and key has type K comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'
S.x S is a struct returns the x field of S e.g for struct foobar {int foo, int bar} foobar.foo returns
the integer stored in the foo field of the struct.
Built in functions
The following built in functions are supported in hive:
(Function list in source code: FunctionRegistry.java)
BIGINT floor(double a) returns the maximum BIGINT value that is equal or less than the
double
BIGINT ceil(double a) returns the minimum BIGINT value that is equal or greater than the
double
double rand(), rand(int seed) returns a random number (that changes from row to row). Specifiying
the seed will make sure the generated random number sequence is
deterministic.
string concat(string A, string B,...) returns the string resulting from concatenating B after A. For example,
concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary
number of arguments and return the concatenation of all of them.
string substr(string A, int start) returns the substring of A starting from start position till the end of
string A. For example, substr('foobar', 4) results in 'bar'
string substr(string A, int start, int returns the substring of A starting from start position with the given
length) length e.g. substr('foobar', 4, 2) results in 'ba'
string upper(string A) returns the string resulting from converting all characters of A to upper
case e.g. upper('fOoBaR') results in 'FOOBAR'
string lower(string A) returns the string resulting from converting all characters of B to lower
case e.g. lower('fOoBaR') results in 'foobar'
string trim(string A) returns the string resulting from trimming spaces from both ends of A
e.g. trim(' foobar ') results in 'foobar'
string ltrim(string A) returns the string resulting from trimming spaces from the
beginning(left hand side) of A. For example, ltrim(' foobar ') results in
'foobar '
string rtrim(string A) returns the string resulting from trimming spaces from the end(right
hand side) of A. For example, rtrim(' foobar ') results in ' foobar'
string regexp_replace(string A, returns the string resulting from replacing all substrings in B that match
string B, string C) the Java regular expression syntax(See Java regular expressions
syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns
'fb'
value of cast(<expr> as <type>) converts the results of the expression expr to <type> e.g. cast('1' as
<type> BIGINT) will convert the string '1' to it integral representation. A null is
returned if the conversion does not succeed.
string from_unixtime(int unixtime) convert the number of seconds from unix epoch (1970-01-01 00:00:00
UTC) to a string representing the timestamp of that moment in the
current system time zone in the format of "1970-01-01 00:00:00"
string to_date(string timestamp) Return the date part of a timestamp string: to_date("1970-01-01
00:00:00") = "1970-01-01"
int year(string date) Return the year part of a date or a timestamp string: year("1970-01-01
00:00:00") = 1970, year("1970-01-01") = 1970
int month(string date) Return the month part of a date or a timestamp string: month("1970-11-
01 00:00:00") = 11, month("1970-11-01") = 11
int day(string date) Return the day part of a date or a timestamp string: day("1970-11-01
00:00:00") = 1, day("1970-11-01") = 1
string get_json_object(string Extract json object from a json string based on json path specified, and
json_string, string path) return json string of the extracted json object. It will return null if the
input json string is invalid
BIGINT count(*), count(expr), count(*) - Returns the total number of retrieved rows, including rows
count(DISTINCT expr[, containing NULL values; count(expr) - Returns the number of rows for which
expr_.]) the supplied expression is non-NULL; count(DISTINCT expr[, expr]) -
Returns the number of rows for which the supplied expression(s) are unique
and non-NULL.
DOUBLE sum(col), returns the sum of the elements in the group or the sum of the distinct
sum(DISTINCT col) values of the column in the group
DOUBLE avg(col), avg(DISTINCT returns the average of the elements in the group or the average of the
col) distinct values of the column in the group
DOUBLE min(col) returns the minimum value of the column in the group
DOUBLE max(col) returns the maximum value of the column in the group
Language capabilities
Hive query language provides the basic SQL like operations. These operations work on tables or partitions. These
operations are:
Creating Tables
An example statement that would create the page_view table mentioned above would be like:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
STORED AS SEQUENCEFILE;
In this example the columns of the table are specified with the corresponding types. Comments can be attached
both at the column level as well as at the table level. Additionally the partitioned by clause defines the partitioning
columns which are different from the data columns and are actually not stored with the data. When specified in
this way, the data in the files is assumed to be delimited with ASCII 001(ctrl-A) as the field delimiter and newline
as the row delimiter.
The field delimiter can be parametrized if the data is not in the above format as illustrated in the following
example:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
STORED AS SEQUENCEFILE;
The row deliminator currently cannot be changed since it is not determined by Hive but Hadoop. e delimiters.
It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be executed
against the data set. If bucketing is absent, random sampling can still be done on the table but it is not efficient as
the query has to scan all the data. The following example illustrates the case of the page_view table that is
bucketed on the userid column:
CREATE TABLE page_view(viewTime INT, userid BIGINT,
STORED AS SEQUENCEFILE;
In the example above, the table is clustered by a hash function of userid into 32 buckets. Within each bucket the
data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on
the clustered column - in this case userid. The sorting property allows internal operators to take advantage of the
better-known data structure while evaluating queries with greater efficiency.
CREATE TABLE page_view(viewTime INT, userid BIGINT,
STORED AS SEQUENCEFILE;
In this example the columns that comprise of the table row are specified in a similar way as the definition of types.
Comments can be attached both at the column level as well as at the table level. Additionally the partitioned by
clause defines the partitioning columns which are different from the data columns and are actually not stored with
the data. The CLUSTERED BY clause specifies which column to use for bucketing as well as how many buckets
to create. The delimited row format specifies how the rows are stored in the hive table. In the case of the delimited
format, this specifies how the fields are terminated, how the items within collections (arrays or maps) are
terminated and how the map keys are terminated. STORED AS SEQUENCEFILE indicates that this data is stored
in a binary format (using hadoop SequenceFiles) on hdfs. The values shown for the ROW FORMAT and
STORED AS clauses in the above example represent the system defaults.
To list existing tables in the warehouse; there are many of these, likely more than you want to browse.
To list partitions of a table. If the table is not a partitioned table then an error is thrown.
DESCRIBE page_view;
To list columns and all other properties of table. This prints lot of information and that too not in a pretty format.
Usually used for debugging.
DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');
To list columns and all other properties of a partition. This also prints lot of information which is usually used for
debugging.
Loading Data
There are multiple ways to load data into Hive tables. The user can create an external table that points to a
specified location within HDFS. In this particular usage, the user can copy a file into the specified location using
the HDFS put or copy commands and create a table pointing to this location with all the relevant row format
information. Once this is done, the user can transform the data and insert them into any other Hive table. For
example, if the file /tmp/pv_2008-06-08.txt contains comma separated page views served on 2008-06-08, and this
needs to be loaded into the page_view table in the appropriate partition, the following sequence of commands can
achieve this:
CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';
This method is useful if there is already legacy data in HDFS on which the user wants to put some metadata so
that the data can be queried and manipulated using Hive.
Additionally, the system also supports syntax that can load the data from a file in the local files system directly into
a Hive table where the input data format is the same as the table format. If /tmp/pv_2008-06-08_us.txt already
contains the data for US, then we do not need any additional filtering as shown in the previous example. The load
in this case can be done using the following syntax:
The path argument can take a directory (in which case all the files in the directory are loaded), a single file name,
or a wildcard (in which case all the matching files are uploaded). If the argument is a directory - it cannot contain
subdirectories. Similarly - the wildcard must match file names only.
In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, the user may decide to do a parallel load of
the data (using tools that are external to Hive). Once the file is in HDFS - the following syntax can be used to load
the data into a Hive table:
It is assumed that the array and map fields in the input.txt files are null fields for these examples.
Simple Query
For all the active users, one can use the query of the following form:
SELECT user.*
FROM user
WHERE user.active = 1;
Note that unlike SQL, we always insert the results into a table. We will illustrate later how the user can inspect
these results and even dump them to a local file. You can also run the following query on Hive CLI:
SELECT user.*
FROM user
WHERE user.active = 1;
This will be internally rewritten to some temporary file and displayed to the Hive client side.
FROM page_views
Note that page_views.date is used here because the table (above) was defined with PARTITIONED BY(date
DATETIME, country STRING) ; if you name your partition something different, don't expect .date to do what you
think!
Joins
In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would need to join the
page_view table and the user table on the userid column. This can be accomplished with a join as shown in the
following query:
In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or FULL OUTER
keywords in order to indicate the kind of outer join (left preserved, right preserved or both sides preserved). For
example, in order to do a full outer join in the query above, the corresponding syntax would look like the following
query:
INSERT OVERWRITE TABLE pv_users
In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the
following example.
INSERT OVERWRITE TABLE pv_users
SELECT u.*
In order to join more than one tables, the user can use the following syntax:
INSERT OVERWRITE TABLE pv_friends
Aggregations
In order to count the number of distinct users by gender one could write the following query:
INSERT OVERWRITE TABLE pv_gender_sum
FROM pv_users
GROUP BY pv_users.gender;
Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT
columns .e.g while the following is possible
FROM pv_users
GROUP BY pv_users.gender;
FROM pv_users
GROUP BY pv_users.gender;
GROUP BY pv_users.gender
GROUP BY pv_users.age;
The first insert clause sends the results of the first group by to a Hive table while the second one sends the results
to a hadoop dfs files.
Dynamic-partition Insert
In the previous examples, the user has to know which partition to insert into and only one partition can be inserted
in one insert statement. If you want to load into multiple partitions, you have to use multi-insert statement as
illustrated below.
FROM page_view_stg pvs
In order to load data into all country partitions in a particular day, you have to add an insert statement for each
country in the input data. This is very inconvenient since you have to have the priori knowledge of the list of
countries exist in the input data and create the partitions beforehand. If the list changed for another day, you have
to modify your insert DML as well as the partition creation DDLs. It is also inefficient since each insert statement
may be turned into a MapReduce Job.
Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically
determining which partitions should be created and populated while scanning the input table. This is a newly
added feature that is only available from version 0.6.0. In the dynamic partition insert, the input column values are
evaluated to determine which partition this row should be inserted into. If that partition has not been created, it will
create that partition automatically. Using this feature you need only one insert statement to create and populate all
necessary partitions. In addition, since there is only one insert statement, there is only one corresponding
MapReduce job. This significantly improves performance and reduce the Hadoop cluster workload comparing to
the multiple insert case.
Below is an example of loading data to all country partitions using one insert statement:
FROM page_view_stg pvs
country appears in the PARTITION specification, but with no value associated. In this case, country is a
dynamic partition column. On the other hand, ds has a value associated with it, which means it is a
static partition column. If a column is dynamic partition column, its value will be coming from the input
column. Currently we only allow dynamic partition columns to be the last column(s) in the partition clause
because the partition column order indicates its hierarchical order (meaning dt is the root partition, and
country is the child partition). You cannot specify a partition clause with (dt, country='US') because that
means you need to update all partitions with any date and its country sub-partition is 'US'.
An additional pvs.country column is added in the select statement. This is the corresponding input column
for the dynamic partition column. Note that you do not need to add an input column for the static partition
column because its value is already known in the PARTITION clause. Note that the dynamic partition
values are selected by ordering, not name, and taken as the last columns from the select clause.
When there are already non-empty partitions exists for the dynamic partition columns, (e.g., country='CA'
exists under some ds root partition), it will be overwritten if the dynamic partition insert saw the same
value (say 'CA') in the input data. This is in line with the 'insert overwrite' semantics. However, if the
partition value 'CA' does not appear in the input data, the existing partition will not be overwritten.
Since a Hive partition corresponds to a directory in HDFS, the partition value has to conform to the HDFS
path format (URI in Java). Any character having a special meaning in URI (e.g., '%', ':', '/', '#') will be
escaped with '%' followed by 2 bytes of its ASCII value.
If the input column is a type different than STRING, its value will be first converted to STRING to be used
to construct the HDFS path.
If the input column value is NULL or empty string, the row will be put into a special partition, whose name
is controlled by the hive parameter hive.exec.default.partition.name. The default value is
__HIVE_DEFAULT_PARTITION__. Basically this partition will contain all "bad" rows whose value are not
valid partition names. The caveat of this approach is that the bad value will be lost and is replaced by
__HIVE_DEFAULT_PARTITION__ if you select them Hive. JIRA HIVE-1309 is a solution to let user
specify "bad file" to retain the input partition column values as well.
Dynamic partition insert could potentially resource hog in that it could generate a large number of
partitions in a short time. To get yourself buckled, we define three parameters:
hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum dynamic
partitions that can be created by each mapper or reducer. If one mapper or reducer created more
than that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and
the whole job will be killed.
hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic
partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the
total number of dynamic partitions does, then an exception is raised at the end of the job before
the intermediate data are moved to the final destination.
hive.exec.max.created.files (default value being 100000) is the maximum total number of files
created by all mappers and reducers. This is implemented by updating a Hadoop counter by each
mapper/reducer whenever a new file is created. If the total number is exceeding
hive.exec.max.created.files, a fatal error will be thrown and the job will be killed.
Another situation we want to protect against dynamic partition insert is that the user may accidentally
specify all partitions to be dynamic partitions without specifying one static partition, while the original
intention is to just overwrite the sub-partitions of one root partition. We define another parameter
hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic partition case. In the strict mode, you
have to specify at least one static partition. The default mode is strict. In addition, we have a parameter
hive.exec.dynamic.partition=true/false to control whether to allow dynamic partition at all. The default
value is false.
In Hive 0.6, dynamic partition insert does not work with hive.merge.mapfiles=true or
hive.merge.mapredfiles=true, so it internally turns off the merge parameters. Merging files in dynamic
partition inserts are supported in Hive 0.7 (see JIRA HIVE-1307 for details).
As stated above, there are too many dynamic partitions created by a particular mapper/reducer, a fatal
error could be raised and the job will be killed. The error message looks something like:
hive> set hive.exec.dynamic.partition.mode=nonstrict;
...
[Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job.
...
The problem of this that one mapper will take a random set of rows and it is very likely that the number of
distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode. One way
around it is to group the rows by the dynamic partition columns in the mapper and distribute them to the
reducers where the dynamic partitions will be created. In this case the number of distinct dynamic
partitions will be significantly reduced. The above example query could be rewritten to:
This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be
converted to a plan to the mappers and the output will be distributed to the reducers based on the value
of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the
dynamic partitions.
SELECT pv_gender_sum.*
FROM pv_gender_sum;
Sampling
The sampling clause allows the users to write queries for samples of the data instead of the whole table. Currently
the sampling is done on the columns that are specified in the CLUSTERED BY clause of the CREATE TABLE
statement. In the following example we choose 3rd bucket out of the 32 buckets of the pv_gender_sum table:
INSERT OVERWRITE TABLE pv_gender_sum_sample
SELECT pv_gender_sum.*
y has to be a multiple or divisor of the number of buckets in that table as specified at the table creation time. The
buckets chosen are determined if bucket_number module y is equal to x. So in the above example the following
tablesample clause
would pick out the 3rd and 19th buckets. The buckets are numbered starting from 0.
Union all
The language also supports union all, e.g. if we suppose there are two different tables that track which user has
published a video and which user has published a comment, the following query joins the results of a union all
with the user table to create a single annotated stream for all the video publishing and comment publishing
events:
FROM (
FROM action_video av
UNION ALL
FROM action_comment ac
Array Operations
Array columns in tables can only be created programmatically currently. We will be extending this soon to be
available as part of the create table statement. For the purpose of the current example assume that pv.friends is
of the type array<INT> i.e. it is an array of integers.The user can get a specific element in the array by its index as
shown in the following command:
SELECT pv.friends[2]
The select expressions gets the third item in the pv.friends array.
The user can also get the length of the array using the size function as shown below:
SELECT pv.userid, size(pv.friends)
can be used to select the 'page_type' property from the page_views table.
Similar to arrays, the size function can also be used to get the number of elements in a map as shown in the
following query:
SELECT size(pv.properties)
Note that columns will be transformed to string and delimited by TAB before feeding to the user script, and the
standard output of the user script will be treated as TAB-separated string columns. User scripts can output debug
information to standard error which will be shown on the task detail page on hadoop.
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script'
AS dt, uid
USING 'reduce_script'
AS date, count;
import datetime
line = line.strip()
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
Of course, both MAP and REDUCE are "syntactic sugar" for the more general select transform. The inner query
could also have been written as such:
Schema-less map/reduce: If there is no "AS" clause after "USING map_script", Hive assumes the output of the
script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab. Note that this
is different from specifying "AS key, value" because in that case value will only contains the portion between the
first tab and the second tab if there are multiple tabs.
In this way, we allow users to migrate old map/reduce scripts without knowing the schema of the map output.
User still needs to know the reduce output schema because that has to match what is in the table that we are
inserting to.
FROM (
FROM pv_users
USING 'map_script'
USING 'reduce_script'
AS date, count;
Distribute By and Sort By: Instead of specifying "cluster by", the user can specify "distribute by" and "sort by", so
the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix
of sort columns, but that is not required.
FROM (
FROM pv_users
USING 'map_script'
AS c1, c2, c3
DISTRIBUTE BY c2
USING 'reduce_script'
AS date, count;
Co-Groups
Amongst the user community using map/reduce, cogroup is a fairly common operation wherein the data from
multiple tables are sent to a custom reducer such that the rows are grouped by the values of certain columns on
the tables. With the UNION ALL operator and the CLUSTER BY specification, this can be achieved in the Hive
query language in the following way. Suppose we wanted to cogroup the rows from the actions_video and
action_comments table on the uid column and send them to the 'reduce_script' custom reducer, the following
syntax can be used by the user:
FROM (
FROM (
FROM action_video av
UNION ALL
FROM action_comment ac
) union_actions
SELECT union_actions.uid, union_actions.id, union_actions.date
Altering Tables
To rename existing table to a new name. If a table with new name already exists then an error is returned:
ALTER TABLE old_table_name RENAME TO new_table_name;
To rename the columns of an existing table. Be sure to use the same column types, and to include an entry for
each preexisting column:
Note that a change in the schema (such as the adding of the columns), preserves the schema for the old
partitions of the table in case it is a partitioned table. All the queries that access these columns and run over the
old partitions implicitly return a null value or the specified default values for these columns.
In the later versions we can make the behavior of assuming certain values as opposed to throwing an error in
case the column is not found in a particular partition configurable.
Note that any data for this table or partitions will be dropped and may not be recoverable. *
LanguageManual
Hive Cli
$HIVE_HOME/bin/hive is a shell utility which can be used to run hive queries in either interactive or batch mode.
Hive Command line Options
Usage:
However, -i can be used with any other options. Multiple instances of -i can be
used to execute multiple init scripts.
Example of dumping data out from a query into a file using silent mode
HIVE_HOME/bin/hive -S -e 'select a.col from tab1 a' > a.txt
HIVE_HOME/bin/hive -f /home/my/hive-script.sql
Use ";" (semicolon) to terminate commands. Comments in scripts can be specified using the "--" prefix.
*Command * Description
Quit Use quit or exit to come out of interactive shell.
set <key>=<value> Use this to set value of particular configuration variable. One thing to note here is that if
you misspell the variable name, cli will not show an error.
Set This will print list of configuration variables that overridden by user or hive.
list FILE <value>* Check given resources are already added or not.
dfs <dfs command> execute dfs command command from hive shell
Sample Usage:
hive> set mapred.reduce.tasks=32;
hive> set;
hive> !ls;
Logging
Hive uses log4j for logging. These logs are not emitted to the standard output by default but are instead captured
to a log file specified by Hive's log4j properties file. By default Hive will use hive-log4j.default in the conf/
directory of the hive installation which writes out logs to /tmp/<userid>/hive.log and uses the WARN level.
It is often desirable to emit the logs to the standard output and/or change the logging level for debugging
purposes. These can be done from the command line as follows:
hive.root.logger specifies the logging level as well as the log destination. Specifying console as the target
sends the logs to the standard error (instead of the log file).
Hive Resources
Hive can manage the addition of resources to a session where those resources need to be made available at
query execution time. Any locally accessible file can be added to the session. Once a file is added to a session,
hive query can refer to this file by its name (in map/reduce/transform clauses) and this file is available locally at
execution time on the entire hadoop cluster. Hive uses Hadoop's Distributed Cache to distribute the added files to
all the machines in the cluster at query execution time.
Usage:
ADD { FILE[S] | JAR[S] | ARCHIVE[S] } <filepath1> [<filepath2>]*
FILE resources are just added to the distributed cache. Typically, this might be something like a transform
script to be executed.
JAR resources are also added to the Java classpath. This is required in order to reference objects they
contain such as UDF's.
ARCHIVE resources are automatically unarchived as part of distributing them.
Example:
/tmp/tt.py
hive> from networks a MAP a.networkid USING 'python tt.py' as nn where a.ds =
'2009-01-04' limit 10;
It is not neccessary to add files to the session if the files used in a transform script are already available on all
machines in the hadoop cluster using the same path name. For example:
... MAP a.networkid USING 'wc -l' ...: here wc is an executable available on all machines
... MAP a.networkid USING '/home/nfsserv1/hadoopscripts/tt.py' ...: here tt.py may be accessible via a nfs
mount point that's configured identically on all the cluster nodes.
Literals
Integral types
Integral literals are assumed to be INT by default, unless the number exceeds the range of INT in which case it is
interpreted as a BIGINT, or if one of the following postfixes is present on the number.
TINYINT Y 100Y
SMALLINT S 100S
BIGINT L 100L
String types
String literals can be expressed with either single quotes (') or double quotes ("). Hive uses C-style escaping
within the strings.
Create/Drop Database
Create Database
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
The use of SCHEMA and DATABASE are interchangeable they mean the same thing.
Drop Database
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
The use of SCHEMA and DATABASE are interchangeable they mean the same thing.
Create/Drop Table
Create Table
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[COMMENT table_comment]
[LOCATION hdfs_path]
[AS select_statement] (Note: this feature is only available starting with 0.5.0,
and is not supported when creating external tables.)
LIKE existing_table_or_view_name
[LOCATION hdfs_path]
data_type
: primitive_type
| array_type
| map_type
| struct_type
primitive_type
: TINYINT
| SMALLINT
| INT
| BIGINT
| BOOLEAN
| FLOAT
| DOUBLE
| STRING
array_type
struct_type
row_format
file_format:
: SEQUENCEFILE
| TEXTFILE
CREATE TABLE creates a table with the given name. An error is thrown if a table or view with the same name
already exists. You can use IF NOT EXISTS to skip the error.
The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default
location for this table. This comes in handy if you already have data generated. When dropping an EXTERNAL
table, data in the table is NOT deleted from the file system.
The LIKE form of CREATE TABLE allows you to copy an existing table definition exactly (without copying its
data). Before 0.8.0, CREATE TABLE LIKE view_name would make a copy of the view. From 0.8.0, CREATE
TABLE LIKE view_name creates a table by adopting the schema of view_name (fields and partition columns)
using defaults for serde and file formats.
You can create tables with custom SerDe or using native SerDe. A native SerDe is used if ROW FORMAT is not
specified or ROW FORMAT DELIMITED is specified. You can use the DELIMITED clause to read delimited files.
Use the SERDE clause to create a table with custom SerDe. Refer to SerDe section of the User Guide for more
information on SerDe.
You must specify a list of a columns for tables that use a native SerDe. Refer to the Types part of the User Guide
for the allowable column types. A list of columns for tables that use a custom SerDe may be specified but Hive will
query the SerDe to determine the actual list of columns for this table.
Use STORED AS TEXTFILE if the data needs to be stored as plain text files. Use STORED AS SEQUENCEFILE
if the data needs to be compressed. Please read more about CompressedStorage if you are planning to keep
data compressed in your Hive tables. Use INPUTFORMAT and OUTPUTFORMAT to specify the name of a
corresponding InputFormat and OutputFormat class as a string literal, e.g.
'org.apache.hadoop.hive.contrib.fileformat.base64.Base64TextInputFormat'.
Use STORED BY to create a non-native table, for example in HBase. See StorageHandlers for more information
on this option.
Partitioned tables can be created using the PARTITIONED BY clause. A table can have one or more partition
columns and a separate data directory is created for each distinct value combination in the partition columns.
Further, tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that
bucket via SORT BY columns. This can improve performance on certain kinds of queries.
If, when creating a partitioned table, you get this error: "FAILED: Error in semantic analysis: Column repeated in
partitioning columns," it means you are trying to include the partitioned column in the data of the table itself. You
probably really do have the column defined. However, the partition you create makes a pseudocolumn on which
you can query, so you must rename your table column to something else (that users should not query on!).
id int,
date date,
name varchar
Now you want to partition on date. Your Hive definition would be this:
create table table_name (
id int,
dtDontQuery string,
name string
Now your users will still query on "where date = '...'" but the 2nd column will be the original values.
Table names and column names are case insensitive but SerDe and property names are case sensitive. Table
and column comments are string literals (single-quoted). The TBLPROPERTIES clause allows you to tag the
table definition with your own metadata key/value pairs.
Tables can also be created and populated by the results of a query in one create-table-as-select (CTAS)
statement. The table created by CTAS is atomic, meaning that the table is not seen by other users until all the
query results are populated. So other users will either see the table with the complete results of the query or will
not see the table at all.
There are two parts in CTAS, the SELECT part can be any SELECT statement supported by HiveQL. The
CREATE part of the CTAS takes the resulting schema from the SELECT part and creates the target table with
other table properties such as the SerDe and storage format. The only restrictions in CTAS is that the target table
cannot be a partitioned table (nor can it be an external table).
Examples:
STORED AS SEQUENCEFILE;
The statement above creates the page_view table with viewTime, userid, page_url, referrer_url, and ip columns
(including comments). The table is also partitioned and data is stored in sequence files. The data format in the
files is assumed to be field-delimited by ctrl-A and row-delimited by newline.
CREATE TABLE page_view(viewTime INT, userid BIGINT,
STORED AS SEQUENCEFILE;
The above statement lets you create the same table as the previous table.
STORED AS SEQUENCEFILE;
In the example above, the page_view table is bucketed (clustered by) userid and within each bucket the data is
sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the
clustered column - in this case userid. The sorting property allows internal operators to take advantage of the
better-known data structure while evaluating queries, also increasing efficiency. MAP KEYS and COLLECTION
ITEMS keywords can be used if any of the columns are lists or maps.
In all the examples until now the data is stored in <hive.metastore.warehouse.dir>/page_view. Specify a value for
the key hive.metastore.warehouse.dir in Hive config file hive-site.xml.
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
STORED AS TEXTFILE
LOCATION '<hdfs_location>';
You can use the above statement to create a page_view table which points to any hdfs location for its storage.
But you still have to make sure that the data is delimited as specified in the query above.
STORED AS RCFile AS
FROM key_value_store
The above CTAS statement creates the target table new_key_value_store with the schema (new_key DOUBLE,
key_value_pair STRING) derived from the results of the SELECT statement. If the SELECT statement does not
specify column aliases, the column names will be automatically assigned to _col0, _col1, and _col2 etc. In
addition, the new target table is created using a specific SerDe and a storage format independent of the source
tables in the SELECT statement.
LIKE key_value_store;
In contrast, the statement above creates a new empty_key_value_store table whose definition exactly matches
the existing key_value_store in all particulars other than table name. The new table contains no rows.
Drop Table
DROP TABLE [IF EXISTS] table_name
DROP TABLE removes metadata and data for this table. The data is actually moved to the .Trash/Current
directory if Trash is configured. The metadata is completely lost.
When dropping an EXTERNAL table, data in the table will NOT be deleted from the file system.
When dropping a table referenced by views, no warning is given (the views are left dangling as invalid and must
be dropped or recreated by the user).
See the next section on ALTER TABLE for how to drop partitions.
Otherwise, the table information is removed from the metastore and the raw data is removed as if by 'hadoop dfs -
rm'. In many cases, this results in the table data being moved into the user's .Trash folder in their home directory;
users who mistakenly DROP TABLEs mistakenly may thus be able to recover their lost data by re-creating a table
with the same schema, re-creating any necessary partitions, and then moving the data back into place manually
using Hadoop. This solution is subject to change over time or across installations as it relies on the underlying
implementation; users are strongly encouraged not to drop tables capriciously.
In Hive 0.70 or later, DROP returns an error if the table doesn't exist, unless IF EXISTS is specified or the
configuration variable hive.exec.drop.ignorenonexistent is set to true.
Add Partitions
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION
'location1'] partition_spec [LOCATION 'location2'] ...
partition_spec:
You can use ALTER TABLE ADD PARTITION to add partitions to a table. Partition values should be quoted only
if they are strings. The location must be a directory inside of which data files reside.
Note that it is proper syntax to have multiple partition_spec in a single ALTER TABLE, but if you do this in version
0.7, your partitioning scheme will fail. That is, every query specifying a partition will always use only the first
partition. Instead, you should use the following form if you want to add many partitions:
...
Specifically, the following example (which was the default example before) will FAIL silently and without error, and
all queries will go only to dt='2008-08-08' partition, no matter which partition you specify.
An error is thrown if the partition_spec for the table already exists. You can use IF NOT EXISTS to skip the error.
Drop Partitions
ALTER TABLE table_name DROP [IF EXISTS] partition_spec, partition_spec,...
You can use ALTER TABLE DROP PARTITION to drop a partition for a table. This removes the data and
metadata for this partition.
In Hive 0.70 or later, DROP returns an error if the partition doesn't exist, unless IF EXISTS is specified or the
configuration variable hive.exec.drop.ignorenonexistent is set to true.
Rename Table
ALTER TABLE table_name RENAME TO new_table_name
This statement lets you change the name of a table to a different name.
As of version 0.6, a rename on a managed table moves its HDFS location as well. (Older Hive versions just
renamed the table in the metastore without moving the HDFS location.)
This command will allow users to change a column's name, data type, comment, or position, or an arbitrary
combination of them.
"ALTER TABLE test_change CHANGE a a1 INT;" will change column a's name to a1.
"ALTER TABLE test_change CHANGE a a1 STRING AFTER b;" will change column a's name to a1, a's data type
to string, and put it after column b. The new table's structure is: b int, a1 string, c int.
"ALTER TABLE test_change CHANGE b b1 INT FIRST;" will change column b's name to b1, and put it as the first
column. The new table's structure is: b1 int, a string, c int.
NOTE: The column change command will only modify Hive's metadata, and will NOT touch data. Users should
make sure the actual data layout conforms with the metadata definition.
Add/Replace Columns
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT
col_comment], ...)
ADD COLUMNS lets you add new columns to the end of the existing columns but before the partition columns.
REPLACE COLUMNS removes all existing columns and adds the new set of columns. This can be done only for
tables with native serde (DynamicSerDe or MetadataTypedColumnsetSerDe). Refer to SerDe section of User
Guide for more information. REPLACE COLUMNS can also be used to drop columns. For example:
"ALTER TABLE test_change REPLACE COLUMNS (a int, b int);" will remove column `c' from test_change's
schema. Note that this does not delete underlying data, it just changes the schema.
table_properties:
: (property_name = property_value, property_name = property_value, ... )
You can use this statement to add your own metadata to the tables. Currently last_modified_user,
last_modified_time properties are automatically added and managed by Hive. Users can add their own properties
to this list. You can do DESCRIBE EXTENDED TABLE to get this information.
serde_properties:
This statement enables you to add user defined metadata to table SerDe object. The serde properties are passed
to the table's SerDe when it is being initialized by Hive to serialize and deserialize data. So users can store any
information required for their custom serde here. Refer to SerDe section of Users Guide for more information.
This statement changes the table's (or partition's) file format. For available file_format options, see the section
above on CREATE TABLE.
NOTE: These commands will only modify Hive's metadata, and will NOT reorganize or reformat existing data.
Users should make sure the actual data layout conforms with the metadata definition.
TOUCH reads the metadata, and writes it back. This has the effect of causing the pre/post execute hooks to fire.
An example use case is if you have a hook that logs all the tables/partitions that were modified, along with an
external script that alters the files on HDFS directly. Since the script modifies files outside of hive, the modification
wouldn't be logged by the hook. The external script could call TOUCH to fire the hook and mark the said table or
partition as modified.
Also, it may be useful later if we incorporate reliable last modified times. Then touch would update that time as
well.
Note that TOUCH doesn't create a table or partition if it doesn't already exist. (See Create Table)
Archiving is a feature to moves a partition's files into a Hadoop Archive (HAR). Note that only the file count will be
reduced; HAR does not provide any compression. See LanguageManual Archiving for more information
Protection on data can be set at either the table or partition level. Enabling NO_DROP prevents a table or
partition from being dropped. Enabling OFFLINE prevents the data in a table or partition from being queried, but
the metadata can still be accessed. Note, if any partition in a table has NO_DROP enabled, the table cannot be
dropped either.
Create/Drop View
Note: View support is only available starting in Hive 0.6.
Create View
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...)
]
[COMMENT view_comment]
AS SELECT ...
CREATE VIEW creates a view with the given name. An error is thrown if a table or view with the same name
already exists. You can use IF NOT EXISTS to skip the error.
If no column names are supplied, the names of the view's columns will be derived automatically from the defining
SELECT expression. (If the SELECT contains unaliased scalar expressions such as x+y, the resulting view
column names will be generated in the form _C0, _C1, etc.) When renaming columns, column comments can also
optionally be supplied. (Comments are not automatically inherited from underlying columns.)
A CREATE VIEW statement will fail if the view's defining SELECT expression is invalid.
Note that a view is a purely logical object with no associated storage. (No support for materialized views is
currently available in Hive.) When a query references a view, the view's definition is evaluated in order to produce
a set of rows for further processing by the query. (This is a conceptual description; in fact, as part of query
optimization, Hive may combine the view's definition with the query's, e.g. pushing filters from the query down into
the view.)
A view's schema is frozen at the time the view is created; subsequent changes to underlying tables (e.g. adding a
column) will not be reflected in the view's schema. If an underlying table is dropped or changed in an incompatible
fashion, subsequent attempts to query the invalid view will fail.
Views are read-only and may not be used as the target of LOAD/INSERT/ALTER. For changing metadata, see
ALTER VIEW.
A view may contain ORDER BY and LIMIT clauses. If a referencing query also contains these clauses, the query-
level clauses are evaluated after the view clauses (and after any other operations in the query). For example, if a
view specifies LIMIT 5, and a referencing query is executed as (select * from v LIMIT 10), then at most 5 rows will
be returned.
AS
FROM page_view
WHERE page_url='http://www.theonion.com';
Drop View
DROP VIEW [IF EXISTS] view_name
DROP VIEW removes metadata for the specified view. (It is illegal to use DROP TABLE on a view.)
When dropping a view referenced by other views, no warning is given (the dependent views are left dangling as
invalid and must be dropped or recreated by the user).
In Hive 0.70 or later, DROP returns an error if the view doesn't exist, unless IF EXISTS is specified or the
configuration variable hive.exec.drop.ignorenonexistent is set to true.
Example:
DROP VIEW onion_referrers;
table_properties:
As with ALTER TABLE, you can use this statement to add your own metadata to a view.
Create/Drop Function
Create Function
CREATE TEMPORARY FUNCTION function_name AS class_name
This statement lets you create a function that is implemented by the class_name. You can use this function in
Hive queries as long as the session lasts. You can use any class that is in the class path of Hive. You can add
jars to class path by executing 'ADD FILES' statements. Please refer to the CLI section in the User Guide for
more information on how to add/delete files from the Hive classpath. Using this, you can register User Defined
Functions (UDF's).
Drop Function
You can unregister a UDF as follows:
DROP TEMPORARY FUNCTION [IF EXISTS] function_name
In Hive 0.70 or later, DROP returns an error if the function doesn't exist, unless IF EXISTS is specified or the
configuration variable hive.exec.drop.ignorenonexistent is set to true.
Create/Drop Index
Not available until 0.7 release
Create Index
CREATE INDEX index_name
AS index_type
| STORED BY ...
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
CREATE INDEX creates an index on a table using the given list of columns as keys. See
http://wiki.apache.org/hadoop/Hive/IndexDev#CREATE_INDEX
Drop Index
DROP INDEX [IF EXISTS] index_name ON table_name
DROP INDEX drops the index, as well as deleting the index table.
In Hive 0.70 or later, DROP returns an error if the index doesn't exist, unless IF EXISTS is specified or the
configuration variable hive.exec.drop.ignorenonexistent is set to true.
Show/Describe Statements
These statements provide a way to query the Hive metastore for existing data and metadata accessible to this
Hive system.
Show Databases
SHOW (DATABASES|SCHEMAS) [LIKE identifier_with_wildcards];
SHOW DATABASES lists all of the databases defined in the metastore. The optional LIKE clause allows the list of
databases to be filtered using a regular expression. The regular expression may only contain '' for any
character[s] or '|' for a choice. Examples are 'employees', 'emp', "emp*|*ees', all of which will match the
database named 'employees'.
Show Tables
SHOW TABLES identifier_with_wildcards
SHOW TABLES lists all the base tables and views with names matching the given regular expression. Regular
expression can contain only '' for any characters or '|' for a choice. Examples are 'page_view', 'page_v',
'view|page', all which will match 'page_view' table. Matching tables are listed in alphabetical order. It is not an
error if there are no matching tables found in metastore.
Show Partitions
SHOW PARTITIONS table_name
SHOW PARTITIONS lists all the existing partitions for a given base table. Partitions are listed in alphabetical
order.
It is also possible to specify parts of a partition specification to filter the resulting list. Note: This feature is only
available starting in version 0.6.
SHOW PARTITIONS table_name PARTITION(ds='2010-03-03');
Show Functions
SHOW FUNCTIONS "a.*"
SHOW FUNCTIONS lists all the user defined and builtin functions matching the regular expression. To get all
functions use ".*"
Show Indexes
SHOW [FORMATTED] (INDEX|INDEXES) ON table_with_index [(FROM|IN) db_name]
SHOW INDEXES shows all of the indexes on a certain column, as well as information about them: index name,
table name, names of the columns used as keys, index table name, index type, and comment. If the
FORMATTED keyword is used, then column titles are printed for each column. Not available until 0.7 release.
Describe Database
DESCRIBE DATABASE db_name
DESCRIBE DATABASE will show the name of the database, its comment (if one has been set), and its root
location on the filesystem. Not available until the 0.7 release.
Describe Table/Column
DESCRIBE [EXTENDED|FORMATTED] table_name[DOT col_name ( [DOT field_name] | [DOT
'$elem$'] | [DOT '$key$'] | [DOT '$value$'] )* ]
DESCRIBE TABLE shows the list of columns including partition columns for the given table. If the EXTENDED
keyword is specified then it will show all the metadata for the table in Thrift serialized form. This is generally only
useful for debugging and not for general use. If the FORMATTED keyword is specified, then it will show the
metadata in a tabular format.
If a table has complex column then you can examine the attributes of this column by specifying
table_name.complex_col_name (and '$elem$' for array element, '$key$' for map key, and '$value$' for map
value). You can specify this recursively to explore the complex column type.
For a view, DESCRIBE TABLE EXTENDED can be used to retrieve the view's definition. Two relevant attributes
are provided: both the original view definition as specified by the user, and an expanded definition used internally
by Hive.
DESCRIBE [EXTENDED] table_name partition_spec
This statement lists metadata for a given partition. The output is similar to that of DESCRIBE TABLE. Presently,
the column information associated with a particular partition is not used while preparing plans.
Example:
DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');
Hive Data Manipulation Language
There are two primary ways of modifying data in Hive:
Syntax
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION
(partcol1=val1, partcol2=val2 ...)]
Synopsis
Load operations are current pure copy/move operations that move datafiles into locations corresponding to Hive
tables.
filepath can be a
relative path, eg: project/data1
absolute path, eg: /user/hive/project/data1
a full URI with scheme and (optionally) an authority, eg:
hdfs://namenode:9000/user/hive/project/data1
The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a
specific partition of the table by specifying values for all of the partitioning columns.
filepath can refer to a file (in which case hive will move the file into the table) or it can be a directory (in
which case hive will move all the files within that directory into the table). In either case filepath
addresses a set of files.
If the keyword LOCAL is specified, then:
the load command will look for filepath in the local file system. If a relative path is specified - it
will be interpreted relative to the current directory of the user. User can specify a full URI for local
files as well - for example: file:///user/hive/project/data1
the load command will try to copy all the files addressed by filepath to the target filesystem. The
target file system is inferred by looking at the location attribute of the table. The copied data files
will then be moved to the table.
If the keyword LOCAL is not specified, then Hive will either use the full URI of filepath if one is specified.
Otherwise the following rules are applied:
If scheme or authority are not specified, Hive will use the scheme and authority from hadoop
configuration variable fs.default.name that specifies the Namenode URI.
If the path is not absolute - then Hive will interpret it relative to /user/<username>
Hive will move the files addressed by filepath into the table (or partition)
if the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and
replaced with the files referred to by filepath. Otherwise the files referred by filepath will be added to
the table.
Note that if the target table (or partition) already has a file whose name collides with any of the
filenames contained in filepath - then the existing file will be replaced with the new file.
Notes
filepath cannot contain subdirectories.
If we are not using the keyword LOCAL - filepath must refer to files within the same filesystem as the
table (or partition's) location.
Hive does some minimal checks to make sure that the files being loaded match the target table. Currently
it checks that if the table is stored in sequencefile format - that the files being loaded are also
sequencefiles and vice versa.
Please read CompressedStorage if your datafile is compressed
Syntax
Standard syntax:
FROM from_statement
FROM from_statement
Synopsis
INSERT OVERWRITE will overwrite any existing data in the table or partition
INSERT INTO will append to the table or partition keeping the existing data in tact. (Note: INSERT INTO
syntax is only available starting in version 0.8)
Inserts can be done to a table or a partition. If the table is partitioned, then one must specify a specific
partition of the table by specifying values for all of the partitioning columns.
Multiple insert clauses (also known as Multi Table Insert) can be specified in the same query
The output of each of the select statements is written to the chosen table (or partition). Currently the
OVERWRITE keyword is mandatory and implies that the contents of the chosen table or partition are
replaced with the output of corresponding select statement.
The output format and serialization class is determined by the table's metadata (as specified via DDL
commands on the table)
In the dynamic partition inserts, users can give partial partition specification, which means you just specify
the list of partition column names in the PARTITION clause. The column values are optional. If a partition
column value is given, we call this static partition, otherwise dynamic partition. Each dynamic partition
column has a corresponding input column from the select statement. This means that the dynamic
partition creation is determined by the value of the input column. The dynamic partition columns must be
specified last among the columns in the SELECT statement and in the same order in which they
appear in the PARTITION() clause.
Notes
Multi Table Inserts minimize the number of data scans required. Hive can insert data into multiple tables
by scanning the input data just once (and applying different query operators) to the input data.
Syntax
Standard syntax:
FROM from_statement
Synopsis
directory can be full URI. If scheme or authority are not specified, Hive will use the scheme and authority
from hadoop configuration variable fs.default.name that specifies the Namenode URI.
if LOCAL keyword is used - then Hive will write data to the directory on the local file system.
Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by
newlines. If any of the columns are not of primitive type - then those columns are serialized to JSON
format.
Notes
INSERT OVERWRITE statements to directories, local directories and tables (or partitions) can all be used
together within the same query.
INSERT OVERWRITE statements to HDFS filesystem directories is the best way to extract large amounts
of data from Hive. Hive can write to HDFS directories in parallel from within a map-reduce job.
The directory is, as you would expect, OVERWRITten, in other words, if the specified path exists, it is
clobbered and replaced with the output.
Select
Select Syntax
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[CLUSTER BY col_list
[LIMIT number]
SELECT * FROM t1
WHERE Clause
The where condition is a boolean [expression]. For example, the following query returns only those sales records
which have an amount greater than 10 from the US region. Hive does not support IN, EXISTS or subqueries in
the WHERE clause.
1 3
1 3
1 4
2 5
1 4
2 5
FROM page_views
If a table page_views is joined with another table dim_users, you can specify a range of partitions in the ON
clause as follows:
SELECT page_views.*
HAVING Clause
Hive added support for the HAVING clause in version 0.7.0. In older versions of Hive it is possible to achieve the
same effect by using a subquery, e.g:
SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10
SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE
t2.col2sum > 10
LIMIT Clause
Limit indicates the number of rows to be returned. The rows returned are chosen at random. The following query
returns 5 rows from t1 at random.
SELECT * FROM t1 LIMIT 5
Top k queries. The following query returns the top 5 sales records wrt amount.
SET mapred.reduce.tasks = 1
Group By Syntax
groupByClause: GROUP BY groupByExpression (, groupByExpression)*
groupByExpression: expression
Simple Examples
In order to count the number of rows in a table:
SELECT COUNT(*) FROM table2;
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT .
In order to count the number of distinct users by gender one could write the following query:
FROM pv_users
GROUP BY pv_users.gender;
Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT
columns .e.g while the following is possible
FROM pv_users
GROUP BY pv_users.gender;
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT .
However, the following query is not allowed. We don't allow multiple DISTINCT expressions in the same query.
INSERT OVERWRITE TABLE pv_gender_agg
FROM pv_users
GROUP BY pv_users.gender;
Advanced Features
Multi-Group-By Inserts
The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs
files (which can then be manipulated using hdfs utilitites). e.g. if along with the gender breakdown, one needed to
find the breakdown of unique page views by age, one could accomplish that with the following query:
FROM pv_users
GROUP BY pv_users.gender
GROUP BY pv_users.age;
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT
Syntax of Order By
The ORDER BY syntax in Hive QL is similar to the syntax of ORDER BY in SQL language.
colOrder: ( ASC | DESC )
Syntax of Sort By
The SORT BY syntax is similar to the syntax of ORDER BY in SQL language.
colOrder: ( ASC | DESC )
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be
dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If
the column is of string type, then the sort order will be lexicographical order.
Basically, the data in each reducer will be sorted according to the order that the user specified. The following
example shows
SELECT key, value FROM src SORT BY key ASC, value DESC
0 5
0 3
3 6
9 1
0 4
0 3
1 1
2 5
SELECT TRANSFORM(value)
USING 'mapper'
USING 'reducer'
AS whatever
x2
x4
x3
x1
Reducer 1 got
x1
x2
x1
Reducer 2 got
x4
x3
Note that all rows with the same key x1 is guaranteed to be distributed to the same reducer (reducer 1 in this
case), but they are not guaranteed to be clustered in adjacent positions.
In contrast, if we use Cluster By x, the two reducers will further sort rows on x:
Reducer 1 got
x1
x1
x2
Reducer 2 got
x3
x4
Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns
and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but
that is not required.
SELECT col1, col2 FROM t1 CLUSTER BY col1
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC
FROM (
FROM pv_users
USING 'map_script'
AS c1, c2, c3
DISTRIBUTE BY c2
USING 'reduce_script'
AS date, count;
Transform/Map-Reduce Syntax
Users can also plug in their own custom mappers and reducers in the data stream by using features natively
supported in the Hive 2.0 language. e.g. in order to run a custom mapper script - map_script - and a custom
reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM clause to
embed the mapper and the reducer scripts.
By default, columns will be transformed to STRING and delimited by TAB before feeding to the user script;
similarly, all NULL values will be converted to the literal string \N in order to differentiate NULL values from empty
strings. The standard output of the user script will be treated as TAB-separated STRING columns, any cell
containing only \N will be re-interpreted as a NULL, and then the resulting STRING column will be cast to the data
type specified in the table declaration in the usual way. User scripts can output debug information to standard
error which will be shown on the task detail page on hadoop. These defaults can be overridden with ROW
FORMAT ....
NOTE: It is your responsibility to sanitize any STRING columns prior to transformation. If your STRING column
contains tabs, an identity transformer will not give you back what you started with! To help with this, see
REGEXP_REPLACE and replace the tabs with some other character on their way into the TRANSFORM() call.
Formally, MAP ... and REDUCE ... are syntactic transformations of SELECT TRANSFORM ( ... ). In
other words, they serve as comments or notes to the reader of the query. BEWARE: Use of these keywords may
be dangerous as (e.g.) typing "REDUCE" does not force a reduce phase to occur and typing "MAP" does not
force a new map phase!
Please also see Sort By / Cluster By / Distribute By and Larry Ogrodnek's blog post.
clusterBy: CLUSTER BY colName (',' colName)*
rowFormat
: ROW FORMAT
[ESCAPED BY char]
property_name=property_value,
property_name=property_value, ...])
outRowFormat : rowFormat
inRowFormat : rowFormat
query:
FROM (
FROM src
(inRowFormat)?
USING 'my_map_script'
(outRowFormat)? (outRecordReader)?
(inRowFormat)?
USING 'my_reduce_script'
(outRowFormat)? (outRecordReader)?
FROM (
FROM src
(inRowFormat)?
USING 'my_map_script'
(outRowFormat)? (outRecordReader)?
(inRowFormat)?
USING 'my_reduce_script'
(outRowFormat)? (outRecordReader)?
Example #1:
FROM (
FROM pv_users
USING 'map_script'
AS dt, uid
USING 'reduce_script'
AS date, count;
FROM (
FROM pv_users
USING 'map_script'
AS dt, uid
USING 'reduce_script'
AS date, count;
Example #2
FROM (
FROM src
USING '/bin/cat'
RECORDREADER 'org.apache.hadoop.hive.ql.exec.TypedBytesRecordReader'
) tmap
Note that we can directly do CLUSTER BY key without specifying the output schema of the scripts.
FROM (
FROM pv_users
USING 'map_script'
USING 'reduce_script'
AS date, count;
USING 'script'
AS thing1, thing2
USING 'script'
Built-in Operators
Relational Operators
The following operators compare the passed operands and generate a TRUE or FALSE value depending on
whether the comparison between the operands holds.
A <> B All primitive NULL if A or B is NULL, TRUE if expression A is NOT equal to expression B otherwise
types FALSE
A<B All primitive NULL if A or B is NULL, TRUE if expression A is less than expression B otherwise
types FALSE
A <= B All primitive NULL if A or B is NULL, TRUE if expression A is less than or equal to expression B
types otherwise FALSE
A>B All primitive NULL if A or B is NULL, TRUE if expression A is greater than expression B otherwise
types FALSE
A >= B All primitive NULL if A or B is NULL, TRUE if expression A is greater than or equal to expression B
types otherwise FALSE
A LIKE B strings NULL if A or B is NULL, TRUE if string A matches the SQL simple regular expression
B, otherwise FALSE. The comparison is done character by character. The _ character
in B matches any character in A(similar to . in posix regular expressions) while the %
character in B matches an arbitrary number of characters in A(similar to .* in posix
regular expressions) e.g. 'foobar' like 'foo' evaluates to FALSE where as 'foobar' like
'foo_ _ _' evaluates to TRUE and so does 'foobar' like 'foo%'
A RLIKE B strings NULL if A or B is NULL, TRUE if string A matches the Java regular expression B(See
Java regular expressions syntax), otherwise FALSE e.g. 'foobar' rlike 'foo' evaluates to
FALSE where as 'foobar' rlike '^f.*r$' evaluates to TRUE
Arithmetic Operators
The following operators support various common arithmetic operations on the operands. All return number types;
if any of the operands are NULL, then the result is also NULL.
A+B All number Gives the result of adding A and B. The type of the result is the same as the common
types parent(in the type hierarchy) of the types of the operands. e.g. since every integer is a
float, therefore float is a containing type of integer so the + operator on a float and an int
will result in a float.
A-B All number Gives the result of subtracting B from A. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands.
A*B All number Gives the result of multiplying A and B. The type of the result is the same as the common
types parent(in the type hierarchy) of the types of the operands. Note that if the multiplication
causing overflow, you will have to cast one of the operators to a type higher in the type
hierarchy.
A/B All number Gives the result of dividing B from A. The result is a double type.
types
A%B All number Gives the reminder resulting from dividing A by B. The type of the result is the same as
types the common parent(in the type hierarchy) of the types of the operands.
A&B All number Gives the result of bitwise AND of A and B. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands.
A|B All number Gives the result of bitwise OR of A and B. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands.
A^B All number Gives the result of bitwise XOR of A and B. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands.
~A All number Gives the result of bitwise NOT of A. The type of the result is the same as the type of A.
types
Logical Operators
The following operators provide support for creating logical expressions. All of them return boolean TRUE,
FALSE, or NULL depending upon the boolean values of the operands. NULL behaves as an "unknown" flag, so if
the result depends on the state of an unknown, the result itself is unknown.
A AND B boolean TRUE if both A and B are TRUE, otherwise FALSE. NULL if A or B is NULL
A OR B boolean TRUE if either A or B or both are TRUE; FALSE OR NULL is NULL; otherwise
FALSE
A || B boolean Same as A OR B
struct (val1, val2, val3, ...) Creates a struct with the given field values. Struct field names
will be col1, col2, ...
named_struct (name1, val1, name2, val2, ...) Creates a struct with the given field names and values.
array (val1, val2, ...) Creates an array with the given elements
A[n] A is an Array and n is Returns the nth element in the array A. The first element has index 0 e.g. if A
an int is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns
'bar'
M[key] M is a Map<K, V> Returns the value corresponding to the key in the map e.g. if M is a map
and key has type K comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'
S.x S is a struct Returns the x field of S. e.g for struct foobar {int foo, int bar} foobar.foo returns
the integer stored in the foo field of the struct.
Built-in Functions
Mathematical Functions
The following built-in mathematical functions are supported in hive; most return NULL when the argument(s) are
NULL:
BIGINT floor(double a) Returns the maximum BIGINT value that is equal or less than the double
BIGINT ceil(double a), Returns the minimum BIGINT value that is equal or greater than the double
ceiling(double a)
double rand(), rand(int seed) Returns a random number (that changes from row to row) that is distributed
uniformly from 0 to 1. Specifiying the seed will make sure the generated
random number sequence is deterministic.
a
double exp(double a) Returns e where e is the base of the natural logarithm
double log(double base, Return the base "base" logarithm of the argument
double a)
p
double pow(double a, double Return a
p) power(double a,
double p)
string hex(BIGINT a) If the argument is an int, hex returns the number as a string in hex format.
hex(string a) Otherwise if the number is a string, it converts each character into its hex
representation and returns the resulting string. (see
[http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex])
string unhex(string a) Inverse of hex. Interprets each pair of characters as a hexidecimal number and
converts to the character represented by the number.
string conv(BIGINT num, Converts a number from a given base to another (see
int from_base, int [http://dev.mysql.com/doc/refman/5.0/en/mathematical-
to_base) functions.html#function_conv])
Collection Functions
The following built-in collection functions are supported in hive:
array<K> map_keys(Map<K.V>) Returns an unordered array containing the keys of the input map
array<V> map_values(Map<K.V>) Returns an unordered array containing the values of the input
map
Expected "=" to cast(expr as Converts the results of the expression expr to <type> e.g. cast('1' as
follow "type" <type>) BIGINT) will convert the string '1' to it integral representation. A null is
returned if the conversion does not succeed.
Date Functions
The following built-in date functions are supported in hive:
string from_unixtime(bigint Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC)
unixtime[, string to a string representing the timestamp of that moment in the current system
format]) time zone in the format of "1970-01-01 00:00:00"
bigint unix_timestamp() Gets current time stamp using the default time zone.
bigint unix_timestamp(string Converts time string in format yyyy-MM-dd HH:mm:ss to Unix time stamp,
date) return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
string to_date(string Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") =
timestamp) "1970-01-01"
int year(string date) Returns the year part of a date or a timestamp string: year("1970-01-01
00:00:00") = 1970, year("1970-01-01") = 1970
int month(string date) Returns the month part of a date or a timestamp string: month("1970-11-01
00:00:00") = 11, month("1970-11-01") = 11
int day(string date) Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00")
dayofmonth(date) = 1, day("1970-11-01") = 1
int hour(string date) Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12,
hour('12:58:59') = 12
int datediff(string enddate, Return the number of days from startdate to enddate: datediff('2009-03-01',
string startdate) '2009-02-27') = 2
Conditional Functions
Return Name(Signature) Description
Type
T COALESCE(T v1, T v2, ...) Return the first v that is not NULL, or NULL if all v's are
NULL
T CASE a WHEN b THEN c [WHEN d THEN When a = b, returns c; when a = d, return e; else return
e]* [ELSE f] END f
T CASE WHEN a THEN b [WHEN c THEN When a = true, returns b; when c = true, return d; else
d]* [ELSE e] END return e
String Functions
The following are built-in String functions are supported in hive:
string concat(string A, string B...) Returns the string resulting from concatenating the
strings passed in as parameters in order. e.g.
concat('foo', 'bar') results in 'foobar'. Note that this
function can take any number of input strings.
string concat_ws(string SEP, string A, Like concat() above, but with custom separator SEP.
string B...)
string substr(string A, int start) Returns the substring of A starting from start position
substring(string A, int start) till the end of string A e.g. substr('foobar', 4) results in
'bar' (see
[http://dev.mysql.com/doc/refman/5.0/en/string-
functions.html#function_substr])
string substr(string A, int start, int len) Returns the substring of A starting from start position
substring(string A, int start, int with length len e.g. substr('foobar', 4, 1) results in 'b'
len) (see [http://dev.mysql.com/doc/refman/5.0/en/string-
functions.html#function_substr])
string upper(string A) ucase(string A) Returns the string resulting from converting all
characters of A to upper case e.g. upper('fOoBaR')
results in 'FOOBAR'
string lower(string A) lcase(string A) Returns the string resulting from converting all
characters of B to lower case e.g. lower('fOoBaR')
results in 'foobar'
string regexp_extract(string subject, Returns the string extracted using the pattern. e.g.
string pattern, int index) regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns
'bar.' Note that some care is necessary in using
predefined character classes: using '\s' as the second
argument will match the letter s; '
s' is necessary to match whitespace, etc. The 'index'
parameter is the Java regex Matcher group() method
index. See docs/api/java/util/regex/Matcher.html for
more information on the 'index' or Java regex group()
method.
string parse_url(string urlString, string Returns the specified part from the URL. Valid values
partToExtract [, string for partToExtract include HOST, PATH, QUERY,
keyToExtract]) REF, PROTOCOL, AUTHORITY, FILE, and
USERINFO. e.g.
parse_url('http://facebook.com/path1/p.php?k1=v1&k2
=v2#Ref1', 'HOST') returns 'facebook.com'. Also a
value of a particular key in QUERY can be extracted
by providing the key as the third argument, e.g.
parse_url('http://facebook.com/path1/p.php?k1=v1&k2
=v2#Ref1', 'QUERY', 'k1') returns 'v1'.
string get_json_object(string Extract json object from a json string based on json
json_string, string path) path specified, and return json string of the extracted
json object. It will return null if the input json string is
invalid. NOTE: The json path can only have the
characters [0-9a-z_], i.e., no upper-case or special
characters. Also, the keys *cannot start with
numbers.* This is due to restrictions on Hive column
names.
int ascii(string str) Returns the numeric value of the first character of str
string lpad(string str, int len, string Returns str, left-padded with pad to a length of len
pad)
string rpad(string str, int len, string Returns str, right-padded with pad to a length of len
pad)
array split(string str, string pat) Split str around pat (pat is a regular expression)
int find_in_set(string str, string Returns the first occurance of str in strList where
strList) strList is a comma-delimited string. Returns null if
either argument is null. Returns 0 if the first argument
contains any commas. e.g. find_in_set('ab',
'abc,b,ab,c,def') returns 3
int locate(string substr, string str[, Returns the position of the first occurrence of substr in
int pos]) str after position pos
int instr(string str, string substr) Returns the position of the first occurence of substr in
str
map<string,string> str_to_map(text[, delimiter1, Splits text into key-value pairs using two delimiters.
delimiter2]) Delimiter1 separates text into K-V pairs, and
Delimiter2 splits each K-V pair. Default delimiters are
',' for delimiter1 and '=' for delimiter2.
array<array<string>> sentences(string str, string lang, Tokenizes a string of natural language text into words
string locale) and sentences, where each sentence is broken at the
appropriate sentence boundary and returned as an
array of words. The 'lang' and 'locale' are optional
arguments. e.g. sentences('Hello there! How are
you?') returns ( ("Hello", "there"), ("How", "are", "you")
)
boolean in_file(string str, string filename) Returns true if the string str appears as an entire line
in filename.
Misc. Functions
xpath
The following functions are described in LanguageManual XPathUDF:
get_json_object
A limited version of JSONPath is supported:
$ : Root object
. : Child operator
[] : Subscript operator for array
* : Wildcard for []
json
+----+
{"store":
{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red"}
},
"email":"amy@only_for_json_udf_test.net",
"owner":"amy"
+----+
The fields of the json object can be extracted using these queries:
hive> SELECT get_json_object(src_json.json, '$.owner') FROM src_json;
amy
{"weight":8,"type":"apple"}
NULL
bigint count(*), count(expr), count(*) - Returns the total number of retrieved rows, including
count(DISTINCT expr[, rows containing NULL values; count(expr) - Returns the
expr_.]) number of rows for which the supplied expression is non-NULL;
count(DISTINCT expr[, expr]) - Returns the number of rows for
which the supplied expression(s) are unique and non-NULL.
double sum(col), sum(DISTINCT col) Returns the sum of the elements in the group or the sum of the
distinct values of the column in the group
double avg(col), avg(DISTINCT col) Returns the average of the elements in the group or the
average of the distinct values of the column in the group
double max(col) Returns the maximum value of the column in the group
double variance(col), var_pop(col) Returns the variance of a numeric column in the group
double covar_pop(col1, col2) Returns the population covariance of a pair of numeric columns
in the group
double covar_samp(col1, col2) Returns the sample covariance of a pair of a numeric columns
in the group
th
double percentile(BIGINT col, p) Returns the exact p percentile of a column in the group (does
not work with floating point types). p must be between 0 and 1.
NOTE: A true percentile can only be computed for integer
values. Use PERCENTILE_APPROX if your input is non-
integral.
array<double> percentile(BIGINT col, array(p1 Returns the exact percentiles p1, p2, ... of a column in the group
[, p2]...)) (does not work with floating point types). pi must be between 0
and 1. NOTE: A true percentile can only be computed for
integer values. Use PERCENTILE_APPROX if your input is
non-integral.
th
double percentile_approx(DOUBLE Returns an approximate p percentile of a numeric column
col, p [, B]) (including floating point types) in the group. The B parameter
controls approximation accuracy at the cost of memory. Higher
values yield better approximations, and the default is 10,000.
When the number of distinct values in col is smaller than B, this
gives an exact percentile value.
array<double> percentile_approx(DOUBLE Same as above, but accepts and returns an array of percentile
col, array(p1 [, p2]...) [, B]) values instead of a single one.
array<struct histogram_numeric(col, b) Computes a histogram of a numeric column in the group using
{'x','y'}> b non-uniformly spaced bins. The output is an array of size b of
double-valued (x,y) coordinates that represent the bin centers
and heights
explode
explode() takes in an array as an input and outputs the elements of the array as separate rows. UDTF's can be
used in the SELECT expression list and as a part of LATERAL VIEW.
Consider a table named myTable that has a single column (myCol) and two rows:
Array<int> myCol
[1,2,3]
[4,5,6]
Will produce:
(int) myNewCol
Please see LanguageManual LateralView for an alternative syntax that does not have these limitations.
Array Type explode(array<TYPE> a) For each element in a, explode() generates a row containing that
element
stack(INT n, v_1, v_2, ..., Breaks up v_1, ..., v_k into n rows. Each row will have k/n columns. n
v_k) must be constant.
json_tuple
A new json_tuple() UDTF is introduced in hive 0.7. It takes a set of names (keys) and a JSON string, and returns
a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve
more than one key from a single JSON string. In any case where a single JSON string would be parsed more
than once, your query will be more efficient if you parse it once, which is what JSON_TUPLE is for. As
JSON_TUPLE is a UDTF, you will need to use the LATERAL VIEW syntax in order to achieve the same goal.
For example,
should be changed to
select a.timestamp, b.*
parse_url_tuple
The parse_url_tuple() UDTF is similar to parse_url(), but can extract multiple parts of a given URL, returning the
data in a tuple. Values for a particular key in QUERY can be extracted by appending a colon and the key to the
partToExtract argument, e.g. parse_url_tuple('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY:k1',
'QUERY:k2') returns a tuple with values of 'v1','v2'. This is more efficient than calling parse_url() multiple times. All
the input parameters and output column types are string.
SELECT b.*
Because you are not able to GROUP BY or SORT BY a column alias on which a function has been applied.
There are two workarounds. First, you can reformulate this query with subqueries, which is somewhat
complicated:
group by sq.fc,col1,col2,...,colN
Or you can make sure not to use a column alias, which is simpler:
select f(col) as fc, count(*) from table_name group by f(col)
Contact Tim Ellis (tellis) at RiotGames dot com if you would like to discuss this in further detail.
XPathUDF
UDFs
Overview
The xpath family of UDFs are wrappers around the Java XPath library javax.xml.xpath provided by the JDK.
The library is based on the XPath 1.0 specification. Please refer to
http://java.sun.com/javase/6/docs/api/javax/xml/xpath/package-summary.html for detailed information on the Java
XPath library.
All functions follow the form: xpath_*(xml_string, xpath_expression_string). The XPath expression
string is compiled and cached. It is reused if the expression in the next input row matches the previous.
Otherwise, it is recompiled. So, the xml string is always parsed for every input row, but the xpath expression is
precompiled and reused for the vast majority of use cases.
[1","2]
Each function returns a specific Hive type given the XPath expression:
The UDFs are schema agnostic - no XML validation is performed. However, malformed xml (e.g.,
<a><b>1</b></aa>) will result in a runtime exception being thrown.
xpath
The xpath() function always returns a hive array of strings. If the expression results in a non-text value (e.g.,
another xml node) the function will return an empty array. There are 2 primary uses for this function: to get a list of
node text values or to get a list of attribute values.
Examples:
[]
[b1","b2]
[foo","bar]
Get a list of node texts for nodes where the 'class' attribute equals 'bb':
[b1","c1]
xpath_string
The xpath_string() function returns the text of the first matching node.
Get the text for node 'a/b':
> SELECT xpath_string ('<a><b>bb</b><c>cc</c></a>', 'a/b') FROM src LIMIT 1 ;
bb
Get the text for node 'a'. Because 'a' has children nodes with text, the result is a composite of text from the
children.
bbcc
b1
b2
Gets the text from the first node that has an attribute 'id' with value 'b_2':
b2
xpath_boolean
Returns true if the XPath expression evaluates to true, or if a matching node is found.
Match found:
> SELECT xpath_boolean ('<a><b>b</b></a>', 'a/b') FROM src LIMIT 1 ;
true
No match found:
> SELECT xpath_boolean ('<a><b>b</b></a>', 'a/c') FROM src LIMIT 1 ;
false
Match found:
> SELECT xpath_boolean ('<a><b>b</b></a>', 'a/b = "b"') FROM src LIMIT 1 ;
true
No match found:
> SELECT xpath_boolean ('<a><b>10</b></a>', 'a/b < 10') FROM src LIMIT 1 ;
false
No match:
> SELECT xpath_int ('<a>b</a>', 'a = 10') FROM src LIMIT 1 ;
Non-numeric match:
> SELECT xpath_int ('<a>this is not a number</a>', 'a') FROM src LIMIT 1 ;
> SELECT xpath_int ('<a>this 2 is not a number</a>', 'a') FROM src LIMIT 1 ;
Adding values:
15
Overflow:
2147483647
No match:
> SELECT xpath_double ('<a>b</a>', 'a = 10') FROM src LIMIT 1 ;
0.0
Non-numeric match:
> SELECT xpath_double ('<a>this is not a number</a>', 'a') FROM src LIMIT 1 ;
NaN
8.0E19
UDAFs
UDTFs
Joins
Join Syntax
Hive supports the following syntax for joining tables:
join_table:
table_reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )
join_condition:
expression = expression
Only equality joins, outer joins, and left semi joins are supported in Hive. Hive does not support join conditions
that are not equality conditions as it is very difficult to express such conditions as a map/reduce job. Also, more
than two tables can be joined in Hive.
Some salient points to consider when writing join queries are as follows:
is NOT allowed
More than 2 tables can be joined in the same query e.g.
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key
= b.key2)
is a valid join.
Hive converts joins over multiple tables into a single map/reduce job if for every table the same column is
used in the join clauses e.g.
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key
= b.key1)
is converted into a single map/reduce job as only key1 column for b is involved in the join. On the other
hand
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON
(c.key = b.key2)
is converted into two map/reduce jobs because key1 column from b is used in the first join condition and
key2 column from b is used in the second one. The first map/reduce job joins a with b and the results are
then joined with c in the second map/reduce job.
In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers
where as the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for
buffering the rows for a particular value of the join key by organizing the tables such that the largest
tables appear last in the sequence. e.g. in
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key
= b.key1)
all the three tables are joined in a single map/reduce job and the values for a particular value of the key
for tables a and b are buffered in the memory in the reducers. Then for each row retrieved from c, the join
is computed with the buffered rows. Similarly for
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON
(c.key = b.key2)
there are two map/reduce jobs involved in computing the join. The first of these joins a with b and buffers
the values of a while streaming the values of b in the reducers. The second of one of these jobs buffers
the results of the first join while streaming the values of c through the reducers.
In every map/reduce stage of the join, the table to be streamed can be specified via a hint. e.g. in
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key =
b.key1) JOIN c ON (c.key = b.key1)
all the three tables are joined in a single map/reduce job and the values for a particular value of the key
for tables b and c are buffered in the memory in the reducers. Then for each row retrieved from a, the join
is computed with the buffered rows.
LEFT, RIGHT, and FULL OUTER joins exist in order to provide more control over ON clauses for which
there is no match. For example, this query:
will return a row for every row in a. This output row will be a.val,b.val when there is a b.key that equals
a.key, and the output row will be a.val,NULL when there is no corresponding b.key. Rows from b which
have no corresponding a.key will be dropped. The syntax "FROM a LEFT OUTER JOIN b" must be
written on one line in order to understand how it works--a is to the LEFT of b in this query, and so all rows
from a are kept; a RIGHT OUTER JOIN will keep all rows from b, and a FULL OUTER JOIN will keep all
rows from a and all rows from b. OUTER JOIN semantics should conform to standard SQL specs.
Joins occur BEFORE WHERE CLAUSES. So, if you want to restrict the OUTPUT of a join, a requirement
should be in the WHERE clause, otherwise it should be in the JOIN clause. A big point of confusion for
this issue is partitioned tables:
SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
will join a on b, producing a list of a.val and b.val. The WHERE clause, however, can also reference other
columns of a and b that are in the output of the join, and then filter them out. However, whenever a row
from the JOIN has found a key for a and no key for b, all of the columns of b will be NULL, including the
ds column. This is to say, you will filter out all rows of join output for which there was no valid b.key, and
thus you have outsmarted your LEFT OUTER requirement. In other words, the LEFT OUTER part of the
join is irrelevant if you reference any column of b in the WHERE clause. Instead, when OUTER JOINing,
use this syntax:
..the result is that the output of the join is pre-filtered, and you won't get post-filtering trouble for rows that
have a valid a.key but no matching b.key. The same logic applies to RIGHT and FULL joins.
Joins are NOT commutative! Joins are left-associative regardless of whether they are LEFT or RIGHT
joins.
SELECT a.val1, a.val2, b.val, c.val
FROM a
...first joins a on b, throwing away everything in a or b that does not have a corresponding key in the other
table. The reduced table is then joined on c. This provides unintuitive results if there is a key that exists in
both a and c but not b: The whole row (including a.val1, a.val2, and a.key) is dropped in the "a JOIN b"
step because it is not in b. The result does not have a.key in it, so when it is LEFT OUTER JOINed with c,
c.val does not make it in because there is no c.key that matches an a.key (because that row from a was
removed). Similarly, if this were a RIGHT OUTER JOIN (instead of LEFT), we would end up with an even
weirder effect: NULL, NULL, NULL, c.val, because even though we specified a.key=c.key as the join key,
we dropped all rows of a that did not match the first JOIN.
To achieve the more intuitive effect, we should instead do FROM c LEFT OUTER JOIN a ON (c.key =
a.key) LEFT OUTER JOIN b ON (c.key = b.key).
LEFT SEMI JOIN implements the correlated IN/EXISTS subquery semantics in an efficient way. Since
Hive currently does not support IN/EXISTS subqueries, you can rewrite your queries using LEFT SEMI
JOIN. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be
referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.
FROM a
WHERE a.key in
(SELECT b.key
FROM B);
If all but one of the tables being joined are small, the join can be performed as a map only job. The query
SELECT /*+ MAPJOIN(b) */ a.key, a.value
does not need a reducer. For every mapper of A, B is read completely. The restriction is that a
FULL/RIGHT OUTER JOIN b cannot be performed
If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a
multiple of the number of buckets in the other table, the buckets can be joined with each other. If table A
has 4 buckets and table B has 4 buckets, the following join
SELECT /*+ MAPJOIN(b) */ a.key, a.value
can be done on the mapper only. Instead of fetching B completely for each mapper of A, only the required
buckets are fetched. For the query above, the mapper processing bucket 1 for A will only fetch bucket 1
of B. It is not the default behavior, and is governed by the following parameter
set hive.optimize.bucketmapjoin = true
If the tables being joined are sorted and bucketized on the join columns, and they have the same number
of buckets, a sort-merge join can be performed. The corresponding buckets are joined with each other at
the mapper. If both A and B have 4 buckets,
SELECT /*+ MAPJOIN(b) */ a.key, a.value
can be done on the mapper only. The mapper for the bucket for A will traverse the corresponding bucket
for B. This is not the default behavior, and the following parameters need to be set:
set
hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
Description
Lateral view is used in conjunction with user-defined table generating functions such as explode(). As mentioned
in Built-in Table-Generating Functions, a UDTF generates one or more output rows for each input row. A lateral
view first applies the UDTF to each row of base table and then joins resulting output rows to the input rows to
form a virtual table having the supplied table alias.
Important note
Currently, lateral view does not support the predicate push-down optimization. If you use a WHERE clause, your
query may not compile. Look for the fix to come out at https://issues.apache.org/jira/browse/HIVE-1056
Until then, try adding set hive.optimize.ppd=false; before your query.
Example
Consider the following base table named pageAds. It has two columns: pageid (name of the page) and adid_list
(an array of ads appearing on the page):
"front_page" [1, 2, 3]
"contact_page" [3, 4, 5]
and the user would like to count the total number of times an ad appears across all pages.
A lateral view with explode() can be used to convert adid_list into separate rows using the query:
"front_page" 2
"front_page" 3
"contact_page" 3
"contact_page" 4
"contact_page" 5
Then in order to count the number of times a particular ad appears, count/group by can be used:
SELECT adid, count(1)
GROUP BY adid;
1 1
2 1
3 2
4 1
5 1
LATERAL VIEW clauses are applied in the order that they appear. For example with the following base table:
The query:
SELECT myCol1, col2 FROM baseTable
Will produce:
Will produce:
1 "a"
1 "b"
1 "c"
2 "a"
2 "b"
2 "c"
3 "d"
3 "e"
3 "f"
4 "d"
4 "e"
4 "f"
Union Syntax
select_statement UNION ALL select_statement UNION ALL select_statement ...
UNION is used to combine the result from multiple SELECT statements into a single result set. We currently only
support UNION ALL (bag union) i.e. duplicates are not eliminated. The number and names of columns returned
by each select_statement has to be the same. Otherwise, a schema error is thrown.
If some additional processing has to be done on the result of the UNION, the entire statement expression can be
embedded in a FROM clause like below:
SELECT *
FROM (
select_statement
UNION ALL
select_statement
) unionResult
For example, if we suppose there are two different tables that track which user has published a video and which
user has published a comment, the following query joins the results of a union all with the user table to create a
single annotated stream for all the video publishing and comment publishing events:
FROM (
FROM action_video av
UNION ALL
FROM action_comment ac
Hive supports subqueries only in the FROM clause. The subquery has to be given a name because every table in
a FROM clause must have a name. Columns in the subquery select list must have unique names. The columns in
the subquery select list are available in the outer query just like columns of a table. The subquery can also be a
query expression with UNION. Hive supports arbitrary levels of sub-queries.
FROM (
FROM t1
) t2
FROM (
FROM t1
UNION ALL
FROM t2
) t3
Sampling Syntax
The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table.
The TABLESAMPLE clause can be added to any table in the FROM clause. The buckets are numbered starting
from 1. colname indicates the column on which to sample each row in the table. colname can be one of the non-
partition columns in the table or rand() indicating sampling on the entire row instead of an individual column. The
rows of the table are 'bucketed' on the colname randomly into y buckets numbered 1 through y. Rows which
belong to bucket x are returned.
In the following example the 3rd bucket out of the 32 buckets of the table source. 's' is the table alias.
SELECT *
Example:
So in the above example, if table 'source' was created with 'CLUSTERED BY id INTO 32 BUCKETS'
would pick out the 3rd and 19th clusters as each bucket would be composed of (32/16)=2 clusters.
would pick out half of the 3rd cluster as each bucket would be composed of (32/64)=1/2 of a cluster.
Block Sampling
It is a feature that is still on trunk and is not yet in any release version.
block_sample: TABLESAMPLE (n PERCENT)
This will allow Hive to pick up at least n% data size (notice it doesn't necessarily mean number of rows) as inputs.
Only CombineHiveInputFormat is supported and some special compression formats are not handled. If we fail to
sample it, the input of MapReduce job will be the whole table/partition. We do it in HDFS block level so that the
sampling granularity is block size. For example, if block size is 256MB, even if n% of input size is only 100MB,
you get 256MB of data.
In the following example the input size 0.1% or more will be used for the query.
SELECT *
Sometimes you want to sample the same data with different blocks, you can change this seed number:
set hive.sample.seednumber=<INTEGER>;
EXPLAIN Syntax
Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is
as follows:
The use of EXTENDED in the EXPLAIN statement produces extra information about the operators in the plan.
This is typically physical information like file names.
A Hive query gets converted into a sequence (it is more an Directed Acyclic Graph) of stages. These stages may
be map/reduce stages or they may even be stages that do metastore or file system operations like move and
rename. The explain output comprises of three parts:
This shows that Stage-1 is the root stage, Stage-2 is executed after Stage-1 is done and Stage-0 is
executed after Stage-2 is done.
The plans of each Stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
src
key expressions:
expr: key
type: string
sort order: +
expr: rand()
type: double
tag: -1
value expressions:
expr: substr(value, 4)
type: string
Group By Operator
aggregations:
expr: sum(UDFToDouble(VALUE.0))
keys:
expr: KEY.0
type: string
mode: partial1
compressed: false
table:
name: binary_table
Stage: Stage-2
Map Reduce
/tmp/hive-zshao/67494501/106593589.10001
key expressions:
expr: 0
type: string
sort order: +
expr: 0
type: string
tag: -1
value expressions:
expr: 1
type: double
Group By Operator
aggregations:
expr: sum(VALUE.0)
keys:
expr: KEY.0
type: string
mode: final
Select Operator
expressions:
expr: 0
type: string
expr: 1
type: double
Select Operator
expressions:
expr: UDFToInteger(0)
type: int
expr: 1
type: double
compressed: false
table:
output format:
org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
name: dest_g1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
output format:
org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
name: dest_g1
In this example there are 2 map/reduce stages (Stage-1 and Stage-2) and 1 File System related stage
(Stage-0). Stage-0 basically moves the results from a temporary directory to the directory corresponding
to the table dest_g1.
A mapping from table alias to Map Operator Tree - This mapping tells the mappers which operator tree to
call in order to process the rows from a particular table or result of a previous map/reduce stage. In
Stage-1 in the above example, the rows from src table are processed by the operator tree rooted at a
Reduce Output Operator. Similarly, in Stage-2 the rows of the results of Stage-1 are processed by
another operator tree rooted at another Reduce Output Operator. Each of these Reduce Output
Operators partitions the data to the reducers according to the criteria shown in the metadata.
A Reduce Operator Tree - This is the operator tree which processes all the rows on the reducer of the
map/reduce job. In Stage-1 for example, the Reducer Operator Tree is carrying out a partial aggregation
where as the Reducer Operator Tree in Stage-2 computes the final aggregation from the partial
aggregates computed in Stage-1
Virtual Columns
Hive 0.8.0 provides support for two virtual columns:
One is INPUT__FILE__NAME, which is the input file's name for a mapper task.
For block compressed file, it is the current block's file offset, which is the current block's first byte's file offset.
Simple Examples
select INPUT__FILE__NAME, key, BLOCK__OFFSET__INSIDE__FILE from src;
Locks
Right now hive can support the following locks:
SHARED
EXCLUSIVE
One is INPUT+FILE+NAME, which is the input file's name for a mapper task.
For block compressed file, it is the current block's file offset, which is the current block's first byte's file offset.
Simple Examples
select INPUT+FILE_NAME, key, BLOCKOFFSET_INSIDE+FILE from src;
Import/Export
The IMPORT and EXPORT commands were added in Hive 0.8.0.
Import Syntax
IMPORT [[EXTERNAL] TABLE tableOrPartitionReference] FROM path
Export Syntax
EXPORT TABLE tableOrPartitionReference TO path
Configuration Properties
Query Execution
mapred.reduce.tasks
Default Value: -1
Added In: 0.1
The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts.
Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive uses -1 as its default
value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.reducers.bytes.per.reducer
Default Value: 1000000000
Added In:
Size per reducer. The default is 1G, i.e if the input size is 10G, it will use 10 reducers.
hive.exec.reducers.max
Default Value: 999
Added In:
Max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is
negative, hive will use this one as the max number of reducers when automatically determine number of reducers.
hive.exec.scratchdir
Default Value: /tmp/hive-${user.name}
Added In:
hive.default.fileformat
Default Value: TextFile
Added In:
Default file format for CREATE TABLE statement. Options are TextFile and SequenceFile. Users can explicitly
say CREATE TABLE ... STORED AS TEXTFILE|SEQUENCEFILE to override.
hive.fileformat.check
Default Value: true
Added In:
hive.map.aggr
Default Value: true
Added In:
hive.groupby.skewindata
Default Value: false
Added In:
hive.groupby.mapaggr.checkinterval
Default Value: 100000
Added In:
Number of rows after which size of the grouping keys/aggregation classes is performed.
hive.mapred.local.mem
Default Value: 0
Added In:
hive.mapjoin.followby.map.aggr.hash.percentmemory
Default Value: 0.3
Added In:
Portion of total memory to be used by map-side grup aggregation hash table, when this group by is followed by
map join.
hive.map.aggr.hash.force.flush.memory.threshold
Default Value: 0.9
Added In:
The max memory to be used by map-side grup aggregation hash table, if the memory usage is higher than this
number, force to flush data.
hive.map.aggr.hash.percentmemory
Default Value: 0.5
Added In:
hive.map.aggr.hash.min.reduction
Default Value: 0.5
Added In:
Hash aggregation will be turned off if the ratio between hash table size and input rows is bigger than this number.
Set to 1 to make sure hash aggregation is never turned off.
hive.optimize.groupby
Default Value: true
Added In:
hive.multigroupby.singlemr
Default Value: false
Added In:
Whether to optimize multi group by query to generate single M/R job plan. If the multi group by query has
common group by keys, it will be optimized to generate single M/R job.
hive.optimize.cp
Default Value: true
Added In:
hive.optimize.index.filter
Default Value: false
Added In:
hive.optimize.index.groupby
Default Value: false
Added In:
Whether to enable optimization of group-by queries using Aggregate indexes.
hive.optimize.ppd
Default Value: true
Added In:
hive.optimize.ppd.storage
Default Value: true
Added In:
Whether to push predicates down into storage handlers. Ignored when hive.optimize.ppd is false.
hive.ppd.recognizetransivity
Default Value: true
Added In:
hive.join.emit.interval
Default Value: 1000
Added In:
How many rows in the right-most join operand Hive should buffer before
emitting the join result.
hive.join.cache.size
Default Value: 25000
Added In:
How many rows in the joining tables (except the streaming table)
should be cached in memory.
hive.mapjoin.bucket.cache.size
Default Value: 100
Added In:
How many values in each keys in the map-joined table should be cached
in memory.
hive.mapjoin.cache.numrows
Default Value: 25000
Added In:
hive.optimize.skewjoin
Default Value: false
Added In:
Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join
operator, we think the key as a skew join key.
hive.skewjoin.mapjoin.map.tasks
Default Value: 10000
Added In:
Determine the number of map task used in the follow up map join job for a skew join. It should be used together
with hive.skewjoin.mapjoin.min.split to perform a fine grained control.
hive.skewjoin.mapjoin.min.split
Default Value: 33554432
Added In:
Determine the number of map task at most used in the follow up map join job for a skew join by specifying the
minimum split size. It should be used together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained
control.
hive.mapred.mode
Default Value: nonstrict
Added In:
The mode in which the hive operations are being performed. In strict mode, some risky queries are not allowed to
run.
hive.exec.script.maxerrsize
Default Value: 100000
Added In:
Maximum number of bytes a script is allowed to emit to standard error (per map-reduce task). This prevents
runaway scripts from filling logs partitions to capacity.
hive.exec.script.allow.partial.consumption
Default Value: false
Added In:
When enabled, this option allows a user script to exit successfully without consuming all the data from the
standard input.
hive.script.operator.id.env.var
Default Value: HIVE_SCRIPT_OPERATOR_ID
Added In:
Name of the environment variable that holds the unique script operator ID in the user's transform function (the
custom mapper/reducer that the user has specified in the query).
hive.exec.compress.output
Default Value: false
Added In:
This controls whether the final outputs of a query (to a local/hdfs file or a hive table) is compressed. The
compression codec and other options are determined from hadoop config variables mapred.output.compress*
hive.exec.compress.intermediate
Default Value: false
Added In:
This controls whether intermediate files produced by hive between multiple map-reduce jobs are compressed.
The compression codec and other options are determined from hadoop config variables
mapred.output.compress*
hive.exec.parallel
Default Value: false
Added In:
hive.exec.parallel.thread.number
Default Value: 8
Added In:
hive.exec.rowoffset
Default Value: false
Added In:
hive.task.progress
Default Value: false
Added In:
Whether Hive should periodically update task progress counters during execution. Enabling this allows task
progress to be monitored more closely in the job tracker, but may impose a performance penalty. This flag is
automatically set to true for jobs with hive.exec.dynamic.partition set to true.
hive.exec.pre.hooks
Default Value: (empty)
Added In:
Comma-separated list of pre-execution hooks to be invoked for each statement. A pre-execution hook is specified
as the name of a Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext
interface.
hive.exec.post.hooks
Default Value: (empty)
Added In:
Comma-separated list of post-execution hooks to be invoked for each statement. A post-execution hook is
specified as the name of a Java class which implements the
org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface.
hive.exec.failure.hooks
Default Value: (empty)
Added In:
Comma-separated list of on-failure hooks to be invoked for each statement. An on-failure hook is specified as the
name of Java class which implements the org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext interface.
hive.merge.mapfiles
Default Value: true
Added In:
hive.merge.mapredfiles
Default Value: false
Added In:
hive.mergejob.maponly
Default Value: true
Added In:
hive.merge.size.per.task
Default Value: 256000000
Added In:
hive.merge.smallfiles.avgsize
Default Value: 16000000
Added In:
When the average output file size of a job is less than this number, Hive will start an additional map-reduce job to
merge the output files into bigger files. This is only done for map-only jobs if hive.merge.mapfiles is true, and for
map-reduce jobs if hive.merge.mapredfiles is true.
hive.mapjoin.smalltable.filesize
Default Value: 25000000
Added In:
The threshold for the input file size of the small tables; if the file size is smaller than this threshold, it will try to
convert the common join into map join.
hive.mapjoin.localtask.max.memory.usage
Default Value: 0.90
Added In:
This number means how much memory the local task can take to hold the key/value into in-memory hash table; If
the local task's memory usage is more than this number, the local task will be abort by themself. It means the data
of small table is too large to be hold in the memory.
hive.mapjoin.followby.gby.localtask.max.memory.usage
Default Value: 0.55
Added In:
This number means how much memory the local task can take to hold the key/value into in-memory hash table
when this map join followed by a group by; If the local task's memory usage is more than this number, the local
task will be abort by themself. It means the data of small table is too large to be hold in the memory.
hive.mapjoin.check.memory.rows
Default Value: 100000
Added In:
The number means after how many rows processed it needs to check the memory usage.
hive.heartbeat.interval
Default Value: 1000
Added In:
Send a heartbeat after this interval - used by mapjoin and filter operators.
hive.auto.convert.join
Default Value: false
Added In:
Whether Hive enable the optimization about converting common join into mapjoin based on the input file size.
hive.script.auto.progress
Default Value: false
Added In:
Whether Hive Tranform/Map/Reduce Clause should automatically send progress information to TaskTracker to
avoid the task getting killed because of inactivity. Hive sends progress information when the script is outputting to
stderr. This option removes the need of periodically producing stderr messages, but users should be cautious
because this may prevent infinite loops in the scripts to be killed by TaskTracker.
hive.script.serde
Default Value: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Added In:
The default SerDe for transmitting input data to and reading output data from the user scripts.
hive.script.recordreader
Default Value: org.apache.hadoop.hive.ql.exec.TextRecordReader
Added In:
The default record reader for reading data from the user scripts.
hive.script.recordwriter
Default Value: org.apache.hadoop.hive.ql.exec.TextRecordWriter
Added In:
The default record writer for writing data to the user scripts.
hive.input.format
Default Value: org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
Added In:
The default input format. Set this to HiveInputFormat if you encounter problems with CombineHiveInputFormat.
hive.udtf.auto.progress
Default Value: false
Added In:
Whether Hive should automatically send progress information to TaskTracker when using UDTF's to prevent the
task getting killed because of inactivity. Users should be cautious because this may prevent TaskTracker from
killing tasks with infinte loops.
hive.mapred.reduce.tasks.speculative.execution
Default Value: true
Added In:
hive.exec.counters.pull.interval
Default Value: 1000
Added In:
The interval with which to poll the JobTracker for the counters the running job. The smaller it is the more load
there will be on the jobtracker, the higher it is the less granular the caught will be.
hive.enforce.bucketing
Default Value: false
Added In:
Whether bucketing is enforced. If true, while inserting into the table, bucketing is enforced.
hive.enforce.sorting
Default Value: false
Added In:
Whether sorting is enforced. If true, while inserting into the table, sorting is enforced.
hive.optimize.reducededuplication
Default Value: true
Added In:
Remove extra map-reduce jobs if the data is already clustered by the same key which needs to be used again.
This should always be set to true. Since it is a new feature, it has been made configurable.
hive.exec.dynamic.partition
Default Value: false
Added In:
hive.exec.dynamic.partition.mode
Default Value: strict
Added In:
In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all
partitions.
hive.exec.max.dynamic.partitions
Default Value: 1000
Added In:
hive.exec.max.dynamic.partitions.pernode
Default Value: 100
Added In:
hive.exec.max.created.files
Default Value: 100000
Added In:
hive.exec.default.partition.name
Default Value: __HIVE_DEFAULT_PARTITION__
Added In:
The default partition name in case the dynamic partition column value is null/empty string or anyother values that
cannot be escaped. This value must not contain any special character used in HDFS URI (e.g., ':', '%', '/' etc). The
user has to be aware that the dynamic partition value should not contain this value to avoid confusions.
hive.fetch.output.serde
Default Value: org.apache.hadoop.hive.serde2.DelimitedJSONSerDe
Added In:
hive.exec.mode.local.auto
Default Value: false
Added In:
Let hive determine whether to run in local mode automatically
hive.exec.drop.ignorenonexistent
Default Value: true
Added In:
hive.exec.show.job.failure.debug.info
Default Value: true
Added In:
If a job fails, whether to provide a link in the CLI to the task with the most failures, along with debugging hints if
applicable.
hive.auto.progress.timeout
Default Value: 0
Added In:
How long to run autoprogressor for the script/UDTF operators (in seconds). Set to 0 for forever.
hive.table.parameters.default
Default Value: (empty)
Added In:
hive.variable.substitute
Default Value: true
Added In:
This enables substitution using syntax like ${var} ${system:var} and ${env:var}.
hive.error.on.empty.partition
Default Value: false
Added In:
hive.exim.uri.scheme.whitelist
Default Value: hdfs,pfile
Added In:
A comma separated list of acceptable URI schemes for import and export.
hive.limit.row.max.size
Default Value: 100000
Added In:
When trying a smaller subset of data for simple LIMIT, how much size we need to guarantee each row to have at
least.
hive.limit.optimize.limit.file
Default Value: 10
Added In:
When trying a smaller subset of data for simple LIMIT, maximum number of files we can sample.
hive.limit.optimize.enable
Default Value: false
Added In:
Whether to enable to optimization to trying a smaller subset of data for simple LIMIT first.
hive.limit.optimize.fetch.max
Default Value: 50000
Added In:
Maximum number of rows allowed for a smaller subset of data for simple LIMIT, if it is a fetch query. Insert
queries are not restricted by this limit.
hive.rework.mapredwork
Default Value: false
Added In:
Should rework the mapred work or not. This is first introduced by SymlinkTextInputFormat to replace symlink files
with real paths at compile time.
hive.sample.seednumber
Default Value: 0
Added In:
A number used to percentage sampling. By changing this number, user will change the subsets of data sampled.
hive.io.exception.handlers
Default Value: (empty)
Added In:
A list of io exception handler class names. This is used to construct a list of exception handlers to handle
exceptions thrown by record readers
hive.autogen.columnalias.prefix.label
Default Value: _c
Added In:
String used as a prefix when auto generating column alias. By default the prefix label will be appended with a
column position number to form the column alias. Auto generation would happen if an aggregate function is used
in a select clause without an explicit alias.
hive.autogen.columnalias.prefix.includefuncname
Default Value: false
Added In:
Whether to include function name in the column alias auto generated by hive.
hive.exec.perf.logger
Default Value: org.apache.hadoop.hive.ql.log.PerfLogger
Added In:
The class responsible logging client side performance metrics. Must be a subclass of
org.apache.hadoop.hive.ql.log.PerfLogger.
hive.start.cleanup.scratchdir
Default Value: false
Added In:
hive.output.file.extension
Default Value: (empty)
Added In:
String used as a file extension for output files. If not set, defaults to the codec extension for text files (e.g. ".gz"),
or no extension otherwise.
hive.insert.into.multilevel.dirs
Default Value: false
Added In:
Where to insert into multilevel directories like "insert directory '/HIVEFT25686/chinna/' from table".
hive.files.umask.value
Default Value: 0002
Added In:
The dfs.umask value for the hive created folders.
MetaStore
hive.metastore.local
Default Value: true
Added In:
Controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM.
javax.jdo.option.ConnectionURL
Default Value: jdbc:derby:;databaseName=metastore_db;create=true
Added In:
javax.jdo.option.ConnectionDriverName
Default Value: org.apache.derby.jdbc.EmbeddedDriver
Added In:
javax.jdo.option.DetachAllOnCommit
Default Value: true
Added In:
Detaches all objects from session so that they can be used after transaction is committed.
javax.jdo.option.NonTransactionalRead
Default Value: true
Added In:
javax.jdo.option.ConnectionUserName
Default Value: APP
Added In:
javax.jdo.option.ConnectionPassword
Default Value: mine
Added In:
javax.jdo.option.Multithreaded
Default Value: true
Added In:
Set this to true if multiple threads access metastore through JDO concurrently.
datanucleus.connectionPoolingType
Default Value: DBCP
Added In:
datanucleus.validateTables
Default Value: false
Added In:
Validates existing schema against code. Turn this on if you want to verify existing schema
datanucleus.validateColumns
Default Value: false
Added In:
Validates existing schema against code. Turn this on if you want to verify existing schema.
datanucleus.validateConstraints
Default Value: false
Added In:
Validates existing schema against code. Turn this on if you want to verify existing schema.
datanucleus.storeManagerType
Default Value: rdbms
Added In:
datanucleus.autoCreateSchema
Default Value: true
Added In:
Creates necessary schema on a startup if one doesn't exist. set this to false, after creating it once.
datanucleus.autoStartMechanismMode
Default Value: checked
Added In:
datanucleus.transactionIsolation
Default Value: read-committed
Added In:
datanucleus.cache.level2
Default Value: false
Added In:
Use a level 2 cache. Turn this off if metadata is changed independently of hive metastore server
datanucleus.cache.level2.type
Default Value: SOFT
Added In:
datanucleus.identifierFactory
Default Value: datanucleus
Added In:
Name of the identifier factory to use when generating table/column names etc. 'datanucleus' is used for backward
compatibility.
datanucleus.plugin.pluginRegistryBundleCheck
Default Value: LOG
Added In:
Defines what happens when plugin bundles are found and are duplicated [EXCEPTION].
hive.metastore.warehouse.dir
Default Value: /user/hive/warehouse
Added In:
hive.metastore.execute.setugi
Default Value: false
Added In:
In unsecure mode, setting this property to true will cause the metastore to execute DFS operations using the
client's reported user and group permissions. Note that this property must be set on both the client and server
sides. Further note that its best effort. If client sets its to true and server sets it to false, client setting will be
ignored.
hive.metastore.event.listeners
Default Value: (empty)
Added In:
hive.metastore.partition.inherit.table.properties
Default Value: (empty)
Added In:
List of comma seperated keys occurring in table properties which will get inherited to newly created partitions. *
implies all the keys will get inherited.
hive.metastore.end.function.listeners
Default Value: (empty)
Added In:
hive.metastore.event.expiry.duration
Default Value: 0
Added In:
Duration after which events expire from events table (in seconds).
hive.metastore.event.clean.freq
Default Value: 0
Added In:
hive.metastore.client.connect.retry.delay
Default Value: 1
Added In:
hive.metastore.client.socket.timeout
Default Value: 20
Added In:
hive.metastore.rawstore.impl
Default Value: org.apache.hadoop.hive.metastore.ObjectStore
Added In:
Name of the class that implements org.apache.hadoop.hive.metastore.rawstore interface. This class is used to
store and retrieval of raw metadata objects such as table, database.
hive.metastore.batch.retrieve.max
Default Value: 300
Added In:
Maximum number of objects (tables/partitions) can be retrieved from metastore in one batch. The higher the
number, the less the number of round trips is needed to the Hive metastore server, but it may also cause higher
memory requirement at the client side.
hive.metastore.ds.connection.url.hook
Default Value: (empty)
Added In:
Name of the hook to use for retriving the JDO connection URL. If empty, the value in
javax.jdo.option.ConnectionURL is used.
hive.metastore.ds.retry.attempts
Default Value: 1
Added In:
The number of times to retry a metastore call if there were a connection error.
hive.metastore.ds.retry.interval
Default Value: 1000
Added In:
hive.metastore.server.max.threads
Default Value: 100000
Added In:
hive.metastore.server.tcp.keepalive
Default Value: true
Added In:
Whether to enable TCP keepalive for the metastore server. Keepalive will prevent accumulation of half-open
connections.
hive.metastore.sasl.enabled
Default Value: false
Added In:
If true, the metastore thrift interface will be secured with SASL. Clients must authenticate with Kerberos.
hive.metastore.kerberos.keytab.file
Default Value: (empty)
Added In:
The path to the Kerberos Keytab file containing the metastore thrift server's service principal.
hive.metastore.kerberos.principal
Default Value: hive-metastore/_HOST@EXAMPLE.COM
Added In:
The service principal for the metastore thrift server. The special string _HOST will be replaced automatically with
the correct host name.
hive.metastore.cache.pinobjtypes
Default Value:
Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order
Added In:
List of comma separated metastore object types that should be pinned in the cache.
hive.metastore.authorization.storage.checks
Default Value: false
Added In:
Should the metastore do authorization checks against the underlying storage for operations like drop-partition
(disallow the drop-partition if the user in question doesn't have permissions to delete the corresponding directory
on the storage).
Indexing
hive.index.compact.file.ignore.hdfs
Default Value: false
Added In:
True the hdfs location stored in the index file will be igbored at runtime. If the data got moved or the name of the
cluster got changed, the index data should still be usable.
hive.optimize.index.filter.compact.minsize
Default Value: 5368709120
Added In:
Minimum size (in bytes) of the inputs on which a compact index is automatically used.
hive.optimize.index.filter.compact.maxsize
Default Value: -1
Added In:
Maximum size (in bytes) of the inputs on which a compact index is automatically used. A negative number is
equivalent to infinity.
hive.index.compact.query.max.size
Default Value: 10737418240
Added In:
The maximum number of bytes that a query using the compact index can read. Negative value is equivalent to
infinity.
hive.index.compact.query.max.entries
Default Value: 10000000
Added In:
The maximum number of index entries to read during a query that uses the compact index. Negative value is
equivalent to infinity.
hive.index.compact.binary.search
Default Value: true
Added In:
Whether or not to use a binary search to find the entries in an index table that match the filter, where possible.
hive.exec.concatenate.check.index
Default Value: true
Added In:
If this sets to true, hive will throw error when doing ALTER TABLE tbl_name [partSpec] CONCATENATE on a
table/partition that has indexes on it. The reason the user want to set this to true is because it can help user to
avoid handling all index drop, recreation, rebuild work. This is very helpful for tables with thousands of partitions.
Statistics
hive.stats.dbclass
Default Value: jdbc:derby
Added In:
hive.stats.autogather
Default Value: true
Added In:
hive.stats.jdbcdriver
Default Value: org.apache.derby.jdbc.EmbeddedDriver
Added In:
The JDBC driver for the database that stores temporary hive
statistics.
hive.stats.dbconnectionstring
Default Value: jdbc:derby:;databaseName=TempStatsStore;create=true
Added In:
The default connection string for the database that stores temporary hive statistics.
hive.stats.default.publisher
Default Value: (empty)
Added In:
The Java class (implementing the StatsPublisher interface) that is used by default if hive.stats.dbclass is not
JDBC or HBase.
hive.stats.default.aggregator
Default Value: (empty)
Added In:
The Java class (implementing the StatsAggregator interface) that is used by default if hive.stats.dbclass is not
JDBC or HBase.
hive.stats.jdbc.timeout
Default Value: 30
Added In:
hive.stats.retries.max
Default Value: 0
Added In:
Maximum number of retries when stats publisher/aggregator got an exception updating intermediate database.
Default is no tries on failures.
hive.stats.retries.wait
Default Value: 3000
Added In:
The base waiting window (in milliseconds) before the next retry. The actual wait time is calculated by
baseWindow * failues + baseWindow * (failure + 1) * (random number between [0.0,1.0]).
hive.client.stats.publishers
Default Value: (empty)
Added In:
Comma-separated list of statistics publishers to be invoked on counters on each job. A client stats publisher is
specified as the name of a Java class which implements the org.apache.hadoop.hive.ql.stats.ClientStatsPublisher
interface.
hive.client.stats.counters
Default Value: (empty)
Added In:
Subset of counters that should be of interest for hive.client.stats.publishers (when one wants to limit their
publishing). Non-display names should be used.
Authentication/Authorization
hive.security.authorization.enabled
Default Value: false
Added In:
hive.security.authorization.manager
Default Value:
org.apache.hadoop.hive.ql.security.authorization.DefaultHiveAuthorizationProv
ider
Added In:
The hive client authorization manager class name. The user defined authorization class should implement
interface org.apache.hadoop.hive.ql.security.authorization.HiveAuthorizationProvider.
hive.security.authenticator.manager
Default Value: org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator
Added In:
Hive client authenticator manager class name. The user defined authenticator should implement interface
org.apache.hadoop.hive.ql.security.HiveAuthenticationProvider.
hive.security.authorization.createtable.user.grants
Default Value: (empty)
Added In:
The privileges automatically granted to some users whenever a table gets created. An example like
"userX,userY:select;userZ:create" will grant select privilege to userX and userY, and grant create privilege to
userZ whenever a new table created.
hive.security.authorization.createtable.group.grants
Default Value: (empty)
Added In:
The privileges automatically granted to some groups whenever a table gets created. An example like
"groupX,groupY:select;groupZ:create" will grant select privilege to groupX and groupY, and grant create privilege
to groupZ whenever a new table created.
hive.security.authorization.createtable.role.grants
Default Value: (empty)
Added In:
The privileges automatically granted to some roles whenever a table gets created. An example like
"roleX,roleY:select;roleZ:create" will grant select privilege to roleX and roleY, and grant create privilege to roleZ
whenever a new table created.
hive.security.authorization.createtable.owner.grants
Default Value: (empty)
Added In:
The privileges automatically granted to the owner whenever a table gets created. An example like "select,drop"
will grant select and drop privilege to the owner of the table.
Archiving
fs.har.impl
Default Value: org.apache.hadoop.hive.shims.HiveHarFileSystem
Added In:
The implementation for accessing Hadoop Archives. Note that this won't be applicable to Hadoop versions less
than 0.20.
hive.archive.enabled
Default Value: false
Added In:
hive.archive.har.parentdir.settable
Default Value: false
Added In:
In new Hadoop versions, the parent directory must be set while creating a HAR. Because this functionality is hard
to detect with just version numbers, this conf var needs to be set manually.
Locking
hive.support.concurrency
Default Value: false
Added In:
Whether hive supports concurrency or not. A zookeeper instance must be up and running for the default hive lock
manager to support read-write locks.
hive.lock.mapred.only.operation
Default Value: false
Added In:
This param is to control whether or not only do lock on queries that need to execute at least one mapred job.
hive.lock.numretries
Default Value: 100
Added In:
The number of times you want to try to get all the locks.
hive.unlock.numretries
Default Value: 10
Added In:
hive.lock.sleep.between.retries
Default Value: 60
Added In:
hive.zookeeper.quorum
Default Value: (empty)
Added In:
The list of zookeeper servers to talk to. This is only needed for read/write locks.
hive.zookeeper.client.port
Default Value: 2181
Added In:
The port of zookeeper servers to talk to. This is only needed for read/write locks.
hive.zookeeper.session.timeout
Default Value: 600000
Added In:
Zookeeper client's session timeout. The client is disconnected, and as a result, all locks released, if a heartbeat is
not sent in the timeout.
hive.zookeeper.namespace
Default Value: hive_zookeeper_namespace
Added In:
The parent node under which all zookeeper nodes are created.
hive.zookeeper.clean.extra.nodes
Default Value: false
Added In:
Clustering
hive.cluster.delegation.token.store.class
Default Value: org.apache.hadoop.hive.thrift.MemoryTokenStore
Added In:
hive.cluster.delegation.token.store.zookeeper.connectString
Default Value: localhost:2181
Added In:
hive.cluster.delegation.token.store.zookeeper.znode
Default Value: /hive/cluster/delegation
Added In:
hive.cluster.delegation.token.store.zookeeper.acl
Default Value: sasl:hive/host1@EXAMPLE.COM:cdrwa,sasl:hive/host2@EXAMPLE.COM:cdrwa
Added In:
ACL for token store entries. List comma separated all server principals for the cluster.
Regions
hive.use.input.primary.region
Default Value: true
Added In:
When creating a table from an input table, create the table in the input table's primary region.
hive.default.region.name
Default Value: default
Added In:
hive.region.properties
Default Value: (empty)
Added In:
hive.cli.print.current.db
Default Value: false
Added In:
HBase StorageHandler
hive.hbase.wal.enabled
Default Value: true
Added In:
Whether writes to HBase should be forced to the write-ahead log. Disabling this improves HBase write
performance at the risk of lost writes in case of a crash.
hive.hwi.listen.host
Default Value: 0.0.0.0
Added In:
This is the host address the Hive Web Interface will listen on.
hive.hwi.listen.port
Default Value: 9999
Added In:
This is the port the Hive Web Interface will listen on.
Test Properties
hive.test.mode
Default Value: false
Added In:
Whether hive is running in test mode. If yes, it turns on sampling and prefixes the output tablename.
hive.test.mode.prefix
Default Value: test_
Added In:
If hive is running in test mode, prefixes the output table by this string.
hive.test.mode.samplefreq
Default Value: 32
Added In:
If hive is running in test mode and table is not bucketed, sampling frequency.
hive.test.mode.nosamplelist
Default Value: (empty)
Added In:
If hive is running in test mode, dont sample the above comma seperated list of tables.
HivePlugins
package com.example.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
(Note that there's already a built-in function for this, it's just an easy example).
After compiling your code to a jar, you need to add this to the hive classpath. See the section below on deploying
jars.
Once hive is started up with your jars in the classpath, the final step is to register your function:
create temporary function my_lower as 'com.example.hive.udf.Lower';
...
OK
cmo 13.0
vp 7.0
By default, it will look in the current directory. You can also specify a full path:
Your jar will then be on the classpath for all jobs initiated from that session. To see which jars have been added to
the classpath you can use:
hive> list jars;
my_jar.jar
LanguageManual UDF
Built-in Operators
Relational Operators
The following operators compare the passed operands and generate a TRUE or FALSE value depending on
whether the comparison between the operands holds.
A <> B All NULL if A or B is NULL, TRUE if expression A is NOT equal to expression B otherwise
primitive FALSE
types
A<B All NULL if A or B is NULL, TRUE if expression A is less than expression B otherwise
primitive FALSE
types
A <= B All NULL if A or B is NULL, TRUE if expression A is less than or equal to expression B
primitive otherwise FALSE
types
A>B All NULL if A or B is NULL, TRUE if expression A is greater than expression B otherwise
primitive FALSE
types
A >= B All NULL if A or B is NULL, TRUE if expression A is greater than or equal to expression B
primitive otherwise FALSE
types
A RLIKE B strings NULL if A or B is NULL, TRUE if string A matches the Java regular expression B(See
Java regular expressions syntax), otherwise FALSE e.g. 'foobar' rlike 'foo' evaluates to
FALSE where as 'foobar' rlike '^f.*r$' evaluates to TRUE
Arithmetic Operators
The following operators support various common arithmetic operations on the operands. All return number types;
if any of the operands are NULL, then the result is also NULL.
A+B All number Gives the result of adding A and B. The type of the result is the same as the common
types parent(in the type hierarchy) of the types of the operands. e.g. since every integer is a
float, therefore float is a containing type of integer so the + operator on a float and an int
will result in a float.
A-B All number Gives the result of subtracting B from A. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands.
A*B All number Gives the result of multiplying A and B. The type of the result is the same as the common
types parent(in the type hierarchy) of the types of the operands. Note that if the multiplication
causing overflow, you will have to cast one of the operators to a type higher in the type
hierarchy.
A/B All number Gives the result of dividing B from A. The result is a double type.
types
A%B All number Gives the reminder resulting from dividing A by B. The type of the result is the same as
types the common parent(in the type hierarchy) of the types of the operands.
A&B All number Gives the result of bitwise AND of A and B. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands.
A|B All number Gives the result of bitwise OR of A and B. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands.
A^B All number Gives the result of bitwise XOR of A and B. The type of the result is the same as the
types common parent(in the type hierarchy) of the types of the operands.
~A All number Gives the result of bitwise NOT of A. The type of the result is the same as the type of A.
types
Logical Operators
The following operators provide support for creating logical expressions. All of them return boolean TRUE,
FALSE, or NULL depending upon the boolean values of the operands. NULL behaves as an "unknown" flag, so if
the result depends on the state of an unknown, the result itself is unknown.
A AND B boolean TRUE if both A and B are TRUE, otherwise FALSE. NULL if A or B is NULL
A OR B boolean TRUE if either A or B or both are TRUE; FALSE OR NULL is NULL; otherwise
FALSE
A || B boolean Same as A OR B
map (key1, value1, key2, Creates a map with the given key/value pairs
value2, ...)
struct (val1, val2, val3, ...) Creates a struct with the given field values. Struct field names
will be col1, col2, ...
named_struct (name1, val1, name2, Creates a struct with the given field names and values.
val2, ...)
array (val1, val2, ...) Creates an array with the given elements
Operators on Complex Types
The following operators provide mechanisms to access elements in Complex Types
A[n] A is an Array and n is Returns the nth element in the array A. The first element has index 0 e.g. if A
an int is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns
'bar'
M[key] M is a Map<K, V> Returns the value corresponding to the key in the map e.g. if M is a map
and key has type K comprising of {'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'
S.x S is a struct Returns the x field of S. e.g for struct foobar {int foo, int bar} foobar.foo returns
the integer stored in the foo field of the struct.
Built-in Functions
Mathematical Functions
The following built-in mathematical functions are supported in hive; most return NULL when the argument(s) are
NULL:
BIGINT floor(double a) Returns the maximum BIGINT value that is equal or less than the double
BIGINT ceil(double a), Returns the minimum BIGINT value that is equal or greater than the double
ceiling(double a)
double rand(), rand(int seed) Returns a random number (that changes from row to row) that is distributed
uniformly from 0 to 1. Specifiying the seed will make sure the generated
random number sequence is deterministic.
a
double exp(double a) Returns e where e is the base of the natural logarithm
p
double pow(double a, double Return a
p) power(double a,
double p)
string hex(BIGINT a) If the argument is an int, hex returns the number as a string in hex format.
hex(string a) Otherwise if the number is a string, it converts each character into its hex
representation and returns the resulting string. (see
[http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_hex])
string unhex(string a) Inverse of hex. Interprets each pair of characters as a hexidecimal number and
converts to the character represented by the number.
string conv(BIGINT num, Converts a number from a given base to another (see
int from_base, int [http://dev.mysql.com/doc/refman/5.0/en/mathematical-
to_base) functions.html#function_conv])
Collection Functions
The following built-in collection functions are supported in hive:
array<K> map_keys(Map<K.V>) Returns an unordered array containing the keys of the input map
array<V> map_values(Map<K.V>) Returns an unordered array containing the values of the input
map
Expected "=" to cast(expr as Converts the results of the expression expr to <type> e.g. cast('1' as
follow "type" <type>) BIGINT) will convert the string '1' to it integral representation. A null is
returned if the conversion does not succeed.
Date Functions
The following built-in date functions are supported in hive:
string from_unixtime(bigint Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC)
unixtime[, string to a string representing the timestamp of that moment in the current system
format]) time zone in the format of "1970-01-01 00:00:00"
bigint unix_timestamp() Gets current time stamp using the default time zone.
bigint unix_timestamp(string Converts time string in format yyyy-MM-dd HH:mm:ss to Unix time stamp,
date) return 0 if fail: unix_timestamp('2009-03-20 11:30:01') = 1237573801
string to_date(string Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") =
timestamp) "1970-01-01"
int year(string date) Returns the year part of a date or a timestamp string: year("1970-01-01
00:00:00") = 1970, year("1970-01-01") = 1970
int month(string date) Returns the month part of a date or a timestamp string: month("1970-11-01
00:00:00") = 11, month("1970-11-01") = 11
int day(string date) Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00")
dayofmonth(date) = 1, day("1970-11-01") = 1
int hour(string date) Returns the hour of the timestamp: hour('2009-07-30 12:58:59') = 12,
hour('12:58:59') = 12
int datediff(string enddate, Return the number of days from startdate to enddate: datediff('2009-03-01',
string startdate) '2009-02-27') = 2
T COALESCE(T v1, T v2, ...) Return the first v that is not NULL, or NULL if all v's are
NULL
T CASE a WHEN b THEN c [WHEN d THEN When a = b, returns c; when a = d, return e; else return
e]* [ELSE f] END f
T CASE WHEN a THEN b [WHEN c THEN When a = true, returns b; when c = true, return d; else
d]* [ELSE e] END return e
String Functions
The following are built-in String functions are supported in hive:
string concat(string A, string B...) Returns the string resulting from concatenating the
strings passed in as parameters in order. e.g.
concat('foo', 'bar') results in 'foobar'. Note that this
function can take any number of input strings.
string concat_ws(string SEP, string A, string Like concat() above, but with custom separator SEP.
B...)
string substr(string A, int start) substring(string Returns the substring of A starting from start position
A, int start) till the end of string A e.g. substr('foobar', 4) results in
'bar' (see
[http://dev.mysql.com/doc/refman/5.0/en/string-
functions.html#function_substr])
string substr(string A, int start, int len) Returns the substring of A starting from start position
substring(string A, int start, int len) with length len e.g. substr('foobar', 4, 1) results in 'b'
(see [http://dev.mysql.com/doc/refman/5.0/en/string-
functions.html#function_substr])
string upper(string A) ucase(string A) Returns the string resulting from converting all
characters of A to upper case e.g. upper('fOoBaR')
results in 'FOOBAR'
string lower(string A) lcase(string A) Returns the string resulting from converting all
characters of B to lower case e.g. lower('fOoBaR')
results in 'foobar'
string regexp_replace(string INITIAL_STRING, Returns the string resulting from replacing all
string PATTERN, string substrings in INITIAL_STRING that match the java
REPLACEMENT) regular expression syntax defined in PATTERN with
instances of REPLACEMENT, e.g.
regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note
that some care is necessary in using predefined
character classes: using '\s' as the second argument
will match the letter s; '
s' is necessary to match whitespace, etc.
string regexp_extract(string subject, string Returns the string extracted using the pattern. e.g.
pattern, int index) regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns
'bar.' Note that some care is necessary in using
predefined character classes: using '\s' as the second
argument will match the letter s; '
s' is necessary to match whitespace, etc. The 'index'
parameter is the Java regex Matcher group() method
index. See docs/api/java/util/regex/Matcher.html for
more information on the 'index' or Java regex group()
method.
string parse_url(string urlString, string Returns the specified part from the URL. Valid values
partToExtract [, string keyToExtract]) for partToExtract include HOST, PATH, QUERY,
REF, PROTOCOL, AUTHORITY, FILE, and
USERINFO. e.g.
parse_url('http://facebook.com/path1/p.php?k1=v1&k2
=v2#Ref1', 'HOST') returns 'facebook.com'. Also a
value of a particular key in QUERY can be extracted
by providing the key as the third argument, e.g.
parse_url('http://facebook.com/path1/p.php?k1=v1&k2
=v2#Ref1', 'QUERY', 'k1') returns 'v1'.
string get_json_object(string json_string, string Extract json object from a json string based on json
path) path specified, and return json string of the extracted
json object. It will return null if the input json string is
invalid. NOTE: The json path can only have the
characters [0-9a-z_], i.e., no upper-case or special
characters. Also, the keys *cannot start with
numbers.* This is due to restrictions on Hive column
names.
int ascii(string str) Returns the numeric value of the first character of str
string lpad(string str, int len, string pad) Returns str, left-padded with pad to a length of len
string rpad(string str, int len, string pad) Returns str, right-padded with pad to a length of len
array split(string str, string pat) Split str around pat (pat is a regular expression)
int find_in_set(string str, string strList) Returns the first occurance of str in strList where
strList is a comma-delimited string. Returns null if
either argument is null. Returns 0 if the first argument
contains any commas. e.g. find_in_set('ab',
'abc,b,ab,c,def') returns 3
int locate(string substr, string str[, int pos]) Returns the position of the first occurrence of substr in
str after position pos
int instr(string str, string substr) Returns the position of the first occurence of substr in
str
map<string,stri str_to_map(text[, delimiter1, delimiter2]) Splits text into key-value pairs using two delimiters.
ng> Delimiter1 separates text into K-V pairs, and
Delimiter2 splits each K-V pair. Default delimiters are
',' for delimiter1 and '=' for delimiter2.
array<array<st sentences(string str, string lang, string Tokenizes a string of natural language text into words
ring>> locale) and sentences, where each sentence is broken at the
appropriate sentence boundary and returned as an
array of words. The 'lang' and 'locale' are optional
arguments. e.g. sentences('Hello there! How are
you?') returns ( ("Hello", "there"), ("How", "are", "you")
)
array<struct<st ngrams(array<array<string>>, int N, int Returns the top-k N-grams from a set of tokenized
ring,double>> K, int pf) sentences, such as those returned by the sentences()
UDAF. See StatisticsAndDataMining for more
information.
boolean in_file(string str, string filename) Returns true if the string str appears as an entire line
in filename.
Misc. Functions
xpath
The following functions are described in LanguageManual XPathUDF:
get_json_object
A limited version of JSONPath is supported:
$ : Root object
. : Child operator
[] : Subscript operator for array
* : Wildcard for []
json
+----+
{"store":
{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red"}
},
"email":"amy@only_for_json_udf_test.net",
"owner":"amy"
}
+----+
The fields of the json object can be extracted using these queries:
amy
{"weight":8,"type":"apple"}
NULL
bigint count(*), count(expr), count(*) - Returns the total number of retrieved rows, including
count(DISTINCT expr[, rows containing NULL values; count(expr) - Returns the
expr_.]) number of rows for which the supplied expression is non-NULL;
count(DISTINCT expr[, expr]) - Returns the number of rows for
which the supplied expression(s) are unique and non-NULL.
double sum(col), sum(DISTINCT col) Returns the sum of the elements in the group or the sum of the
distinct values of the column in the group
double avg(col), avg(DISTINCT col) Returns the average of the elements in the group or the
average of the distinct values of the column in the group
double max(col) Returns the maximum value of the column in the group
double variance(col), var_pop(col) Returns the variance of a numeric column in the group
double covar_samp(col1, col2) Returns the sample covariance of a pair of a numeric columns
in the group
th
double percentile(BIGINT col, p) Returns the exact p percentile of a column in the group (does
not work with floating point types). p must be between 0 and 1.
NOTE: A true percentile can only be computed for integer
values. Use PERCENTILE_APPROX if your input is non-
integral.
array<double> percentile(BIGINT col, array(p1 Returns the exact percentiles p1, p2, ... of a column in the group
[, p2]...)) (does not work with floating point types). pi must be between 0
and 1. NOTE: A true percentile can only be computed for
integer values. Use PERCENTILE_APPROX if your input is
non-integral.
th
double percentile_approx(DOUBLE Returns an approximate p percentile of a numeric column
col, p [, B]) (including floating point types) in the group. The B parameter
controls approximation accuracy at the cost of memory. Higher
values yield better approximations, and the default is 10,000.
When the number of distinct values in col is smaller than B, this
gives an exact percentile value.
array<double> percentile_approx(DOUBLE Same as above, but accepts and returns an array of percentile
col, array(p1 [, p2]...) [, B]) values instead of a single one.
explode
explode() takes in an array as an input and outputs the elements of the array as separate rows. UDTF's can be
used in the SELECT expression list and as a part of LATERAL VIEW.
An example use of explode() in the SELECT expression list is as follows:
Consider a table named myTable that has a single column (myCol) and two rows:
Array<int> myCol
[1,2,3]
[4,5,6]
Will produce:
(int) myNewCol
Please see LanguageManual LateralView for an alternative syntax that does not have these limitations.
Array Type explode(array<TYPE> a) For each element in a, explode() generates a row containing that
element
stack(INT n, v_1, v_2, ..., Breaks up v_1, ..., v_k into n rows. Each row will have k/n columns. n
v_k) must be constant.
json_tuple
A new json_tuple() UDTF is introduced in hive 0.7. It takes a set of names (keys) and a JSON string, and returns
a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve
more than one key from a single JSON string. In any case where a single JSON string would be parsed more
than once, your query will be more efficient if you parse it once, which is what JSON_TUPLE is for. As
JSON_TUPLE is a UDTF, you will need to use the LATERAL VIEW syntax in order to achieve the same goal.
For example,
should be changed to
select a.timestamp, b.*
parse_url_tuple
The parse_url_tuple() UDTF is similar to parse_url(), but can extract multiple parts of a given URL, returning the
data in a tuple. Values for a particular key in QUERY can be extracted by appending a colon and the key to the
partToExtract argument, e.g. parse_url_tuple('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY:k1',
'QUERY:k2') returns a tuple with values of 'v1','v2'. This is more efficient than calling parse_url() multiple times. All
the input parameters and output column types are string.
SELECT b.*
FAILED: Error in semantic analysis: line 1:69 Invalid Table Alias or Column
Reference fc
Because you are not able to GROUP BY or SORT BY a column alias on which a function has been applied.
There are two workarounds. First, you can reformulate this query with subqueries, which is somewhat
complicated:
select sq.fc,col1,col2,...,colN,count(*) from
group by sq.fc,col1,col2,...,colN
Or you can make sure not to use a column alias, which is simpler:
select f(col) as fc, count(*) from table_name group by f(col)
Contact Tim Ellis (tellis) at RiotGames dot com if you would like to discuss this in further detail.
StatisticsAndDataMining
Contextual n-grams are similar to n-grams, but allow you to specify a 'context' string around which n-grams are to
be estimated. For example, you can specify that you're only interested in finding the most common two-word
phrases in text that follow the context "I love". You could achieve the same result by manually stripping sentences
of non-contextual content and then passing them to ngrams(), but context_ngrams() makes it much easier.
Use Cases
1. (ngrams) Find important topics in text in conjunction with a stopword list.
2. (ngrams) Find trending topics in text.
3. (context_ngrams) Extract marketing intelligence around certain words (e.g., "Twitter is ___").
4. (ngrams) Find frequently accessed URL sequences.
5. (context_ngrams) Find frequently accessed URL sequences that start or end at a particular URL.
6. (context_ngrams) Pre-compute common search lookaheads.
Usage
SELECT context_ngrams(sentences(lower(tweet)), 2, 100 [, 1000]) FROM twitter;
The command above will return the top-100 bigrams (2-grams) from a hypothetical table called twitter. The
tweet column is assumed to contain a string with arbitrary, possibly meaningless, text. The lower() UDF first
converts the text to lowercase for standardization, and then sentences() splits up the text into arrays of words.
The optional fourth argument is the precision factor that control the tradeoff between memory usage and
accuracy in frequency estimation. Higher values will be more accurate, but could potentially crash the JVM with
an OutOfMemory error. If omitted, sensible defaults are used.
Note that the following two queries are identical, but ngrams() will be slightly faster in practice.
Example
SELECT explode(ngrams(sentences(lower(val)), 2, 10)) AS x FROM kafka;
{"ngram":[of","the],"estfrequency":23.0}
{"ngram":[on","the],"estfrequency":20.0}
{"ngram":[in","the],"estfrequency":18.0}
{"ngram":[he","was],"estfrequency":17.0}
{"ngram":[at","the],"estfrequency":17.0}
{"ngram":[that","he],"estfrequency":16.0}
{"ngram":[to","the],"estfrequency":16.0}
{"ngram":[out","of],"estfrequency":16.0}
{"ngram":[he","had],"estfrequency":16.0}
{"ngram":[it","was],"estfrequency":15.0}
{"ngram":[was],"estfrequency":17.0}
{"ngram":[had],"estfrequency":16.0}
{"ngram":[thought],"estfrequency":13.0}
{"ngram":[could],"estfrequency":9.0}
{"ngram":[would],"estfrequency":7.0}
{"ngram":[lay],"estfrequency":5.0}
{"ngram":[s],"estfrequency":4.0}
{"ngram":[wanted],"estfrequency":4.0}
{"ngram":[did],"estfrequency":4.0}
{"ngram":[felt],"estfrequency":4.0}
Use Cases
1. Estimating the frequency distribution of a column, possibly grouped by other attributes.
2. Choosing discretization points in a continuous valued column.
Usage
SELECT histogram_numeric(age) FROM users GROUP BY gender;
The command above is self-explanatory. Converting the output into a graphical display is a bit more involved. The
following Gnuplot command should do it, assuming that you've parsed the output from histogram() into a text
file of (x,y) pairs called data.txt.
Example
SELECT explode(histogram_numeric(val, 10)) AS x FROM normal;
{"x":-3.6505464999999995,"y":20.0}
{"x":-2.7514727901960785,"y":510.0}
{"x":-1.7956678951954481,"y":8263.0}
{"x":-0.9878507685761995,"y":19167.0}
{"x":-0.2625338380837097,"y":31737.0}
{"x":0.5057392319427763,"y":31502.0}
{"x":1.2774146480311135,"y":14526.0}
{"x":2.083955560712489,"y":3986.0}
{"x":2.9209550254545484,"y":275.0}
{"x":3.674835214285715,"y":14.0}
Schema Browsing
An alternative to running 'show tables' or 'show extended tables' from the CLI is to use the web based schema
browser. The Hive Meta Data is presented in a hierarchical manner allowing you to start at the database level and
click to get information about tables including the SerDe, column names, and column types.
No local installation
Any user with a web browser can work with Hive. This has the usual web interface benefits, In particular a user
wishing to interact with hadoop or hive requires access to many ports. A remote or VPN user would only require
access to the hive web interface running by default on 0.0.0.0 tcp/9999.
Configuration
Hive Web Interface made its first appearance in the 0.2 branch. If you have 0.2 or the SVN trunk you already
have it.
You should not need to edit the defaults for the Hive web interface. HWI uses:
<property>
<name>hive.hwi.listen.host</name>
<value>0.0.0.0</value>
<description>This is the host address the Hive Web Interface will listen
on</description>
</property>
<property>
<name>hive.hwi.listen.port</name>
<value>9999</value>
<description>This is the port the Hive Web Interface will listen on</description>
</property>
<property>
<name>hive.hwi.war.file</name>
<value>${HIVE_HOME}/lib/hive_hwi.war</value>
<description>This is the WAR file with the jsp content for Hive Web
Interface</description>
</property>
You probably want to setup HiveDerbyServerMode to allow multiple sessions at the same time.
Start up
When initializing Hive with no arguments that CLI is invoked. Hive has an extension architecture used to start
other hive demons.
Jetty requires apache ant to start HWI. You should define ANT_LIB as an environment variable or add that to the
hive invocation.
export ANT_LIB=/opt/ant/lib
Java has no direct way of demonizing. In a production environment you should create a wrapper script.
If you want help on the service invocation or list of parameters you can add
bin/hive --service hwi --help
Authentication
Hadoop currently uses environmental properties to determine user name and group vector. Thus Hive and Hive
Web Interface can not enforce more stringent security then Hadoop can. When you first connect to the Hive Web
Interface the user is prompted for a user name and groups. This feature was added to support installations using
different schedulers.
If you want to tighten up security you are going to need to patch the source Hive Session Manager or you may be
able to tweak the JSP to accomplish this.
Accessing
In order to access the Hive Web Interface, go to <Hive Server Address>:9999/hwi on your web browser.
Result file
The result file is local to the web server. A query that produces massive output should set the result file to
/dev/null.
Debug Mode
The debug mode is used when the user is interested in having the result file not only contain the result of the hive
query but the other messages.
Set Processor
In the CLI a command like 'SET x=5' is not processed by the the Query Processor it is processed by
the Set Processor. Use the form 'x=5' not 'set x=5'
Walk through
Authorize
Unable to render embedded object: File (1_hwi_authorize.png) not found.
Unable to render embedded object: File (2_hwi_authorize.png) not found.
Schema Browser
Unable to render embedded object: File (3_schema_table.png) not found.
Unable to render embedded object: File (4_schema_browser.png) not found.
Diagnostics
Unable to render embedded object: File (5_diagnostic.png) not found.
Running a query
Unable to render embedded object: File (6_newsession.png) not found.
Unable to render embedded object: File (7_session_runquery.png) not found.
Unable to render embedded object: File (8_session_query_1.png) not found.
Unable to render embedded object: File (9_file_view.png) not found.
HiveClient
This page describes the different clients supported by Hive. The command line client currently only supports an
embedded server. The JDBC and thrift-java clients support both embedded and standalone servers. Clients in
other languages only support standalone servers. For details about the standalone server see Hive Server.
Command line
Operates in embedded mode only, i.e., it needs to have access to the hive libraries. For more details see Getting
Started.
JDBC
For embedded mode, uri is just "jdbc:hive://". For standalone server, uri is "jdbc:hive://host:port/dbname" where
host and port are determined by where the hive server is run. For example, "jdbc:hive://localhost:10000/default".
Currently, the only dbname supported is "default".
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
/**
* @param args
* @throws SQLException
*/
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
e.printStackTrace();
System.exit(1);
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/default", "", "");
res = stmt.executeQuery(sql);
if (res.next()) {
System.out.println(res.getString(1));
// describe table
res = stmt.executeQuery(sql);
while (res.next()) {
// NOTE: /tmp/a.txt is a ctrl-A separated file with two fields per line
sql = "load data local inpath '" + filepath + "' into table " + tableName;
res = stmt.executeQuery(sql);
// select * query
res = stmt.executeQuery(sql);
while (res.next()) {
res = stmt.executeQuery(sql);
while (res.next()) {
System.out.println(res.getString(1));
$ javac HiveJdbcClient.java
# To run the program in standalone mode, we need the following jars in the
classpath
# from hive/build/dist/lib
# hive_exec.jar
# hive_jdbc.jar
# hive_metastore.jar
# hive_service.jar
# libfb303.jar
# log4j-1.2.15.jar
# from hadoop/build
# hadoop-*-core.jar
# To run the program in embedded mode, we need the following additional jars in the
classpath
# from hive/build/dist/lib
# antlr-runtime-3.0.1.jar
# derby.jar
# jdo2-api-2.1.jar
# jpox-core-1.2.2.jar
# jpox-rdbms-1.2.2.jar
#
# as well as hive/build/dist/conf
# Alternatively, you can run the following bash script, which will seed the data
file
#!/bin/bash
HADOOP_HOME=/your/path/to/hadoop
HIVE_HOME=/your/path/to/hive
HADOOP_CORE={{ls $HADOOP_HOME/hadoop-*-core.jar}}
CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf
for i in ${HIVE_HOME}/lib/*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
Python
Operates only on a standalone server. Set (and export) PYTHONPATH to build/dist/lib/py.
The python modules imported in the code below are generated by building hive.
Please note that the generated python module names have changed in hive trunk.
#!/usr/bin/env python
import sys
try:
transport = TTransport.TBufferedTransport(transport)
protocol = TBinaryProtocol.TBinaryProtocol(transport)
client = ThriftHive.Client(protocol)
transport.open()
while (1):
row = client.fetchOne()
if (row == None):
break
print row
print client.fetchAll()
transport.close()
PHP
Operates only on a standalone server.
<?php
// set THRIFT_ROOT to php directory of the hive distribution
$GLOBALS['THRIFT_ROOT'] = '/lib/php/';
$transport->open();
var_dump($client->fetchAll());
$transport->close();
ODBC
Operates only on a standalone server. See Hive ODBC.