Blog

An Overview Of Hadoop Hive

Hadoop Hive

Hive is a component of hadoop which is built on top of HDFS and is a ware house kind of system in hadoop

 Hive will be used for data summarization for Adhoc queering and query language processing

Hive was first used in facebook (2007) under ASF i.e. Apache software foundation

Apache Hive supports analysis of large datasets that are stored in Hadoop – compatible file systems such as Amazon s3 file system.

Hive provides an SQL – like language called Hive QL language while also maintaining full support for mapreduce.

Hive does not mandate read or written data in the Hive format and there is no such thing

Hive equally works on .Thrift, control delimited and also on your specialized data formats.

Hive is not designed for OLTP workloads and does not offer real –time queries or row level updates.

It is best used for batch jobs over large sets of append – only data.

Hive Architecture:-

Driver manager life cycle of Hive QL query moves through Hive and also manages session handle and session statistics.

Compiler-compiles Hive QL into a directed acyclic graph of map/reduce tasks.

Execution engines:-The component executes the tasks in proper dependency order and also interacts with hadoop.

Hive server provides thrift interface and JDBC/ODBC for integrating other applications.

Client components are CLI, web interface, jdbc/odbc interface.

Extensibility interface include serde, user defined Function and also user Defined Aggregate function.

Meta store:-

     Meta store is the Hive internal database which will store all the table definitions, column level information and partition ID’S.

By default Hive uses the derby database as its meta store.

 We can also configure Mysql, Thrift server as the meta stores

The Meta store is divided into two pieces are the service and the backing store for the data.

By default the meta store service runs in the same JVM  as the Hive service and contains an embedded Derby database instance backed by the local disk This is called Embedded Meta store configuration.

 Using an embedded meta store is a simple way to get stored with Hive and however only one embedded Derby database can access the database files on disk at any one time which means you can only have one Hive session open at a time that shares the same meta store.

The solution to supporting multiple sessions is to use standalone database and this configuration is referred to as a local meta store, since the meta store service still runs in the same process as the Hive service, but connections to a database running in a separate process, either on the same machine or on any remote machine.

Mysql is a popular choice for stand alone meta store

In this case, JDBC Driver JAR file for Mysql must be on Hive class which is simply archived.

Installing Hive:-

 Installation of Hive is straight forward and Java 1.6 is a prerequisite.

If you are installing on windows, you will need cygwin too.

 You also need to have the same version of Hadoop installed locally either in standalone or pseudo-distributed mode or where your cluster is running while getting started with Hive.

Steps:-

1. Download the Hive Release at https://Hive.apche.org/ html. i.e. far ball file.
2. Unpack the tar ball in a suitable place in your Hadoop Installation environment. i.e $ far – xzvf Hive- 0.8.1 tar.gz
3. Setting the environment variable HIVE-HOME to point the installation directory:

$ cd Hive -0.8.1
$ export HIVE –HOME={{pwd}}
     4.Add $ HIVE –HOME/bin to your PATH

$ export PATH=$ HIVE –HOME/bin: $ PATH

Data base Setup:-

Install Mysql server with developed and tested versions 5.1.46 and 5.1.48.

Once you have Mysql up and running, use the Mysql Command line tool to add the Hive user and Hive meta stored data base.

Now, we need to pick a password for your Hive user and replace db password in the following commands with it.

Log into Mysql

Cmd:> Mysql
Mysql>create DATABASE meta store;
Mysql>use meta store;
Mysql> SOURCE usr/lib/Hive/scripts/ meta store/upgrade/ Mysql/ Hive-schema-07.0. Mysql. Sql;
Mysql> CREATE USER ‘Hive user’@’%’ IDETIFIED By’ password’;
Mysql> GRANT SELECT ‘INSERT,UPDATE, DELETE ON meta store To’ Hive user’@’%’;
Mysql> REVOKE ALTER,CREATE ON meta store* FROM Hive user’@’%’;

To start Mysql services

Cmd:>cd/etc/init.d
>./ Mysql start
To stop:>./ Mysql stop.

To start Hive services

Cmd: usr/bin/Hive-service Hive server

Configuration of Hive:-

Hive is Configured using an XML Configuration file like Hadoop and the file is called ‘Hive-site.xml’

    Hive-site.xml is located in Hive conf directory

The same directory contains Hive-default.xml which documents the properties that Hive exposes and their default values.

Configure the Hive-site.xml as below


<property>
<name>javax.dbo. option. connection URL</name >
<value>jdbc: der by; data base name -/var/lib/Hive/meta store/meta store-db; create=true</value>
<description>JDBC Connect string for a JDBC meta store
</description>
</property>

Comparison with Traditional Databases:-

Hive looks very much like traditional database code with SQL access.

However, because Hive is based on Hadoop and MapReduce operation, there are several key differences

In a traditional data base, a table’s schema is enforced at data load time.

If the data being loaded doesn’t confirm to the schema, then it  is rejected.

Hive, on the other hand, doesn’t verify the data when it is loaded, but rather when a query is issued.

Updates, transactions and indexes are main stays of traditional data bases.

Yet, until recently, these features have not been considered as a part of Hive’s feature. This is because Hive was built to operate over HDFS data using Map Reduce where full – table scans are the norm and a table update is archived by transforming the data into a new table.

Hive CLI Options

$HIVE-HOME/bin/Hive is a shell utility which can be used to run Hive queries in either interactive or batch mode.

Hive COMMAND LINE Option

To get help for Hive options, run the command as “Hive-H” or ”Hive– help”

Command line options as in Hive 0.9.0

Hive-d or—define variable substitution to apply to Hive. Commands
Ex:– -d A=B or—define A=B

hive –e   SQL from command line
      3. hive-f SQL from files.

      4. hive-hconnection to Hive serer on remote host

      5.—hive conf use value for given property.

     6.—hive var variable substitution to apply to Hive commands.

Ex:—hive var A=B.

     7. hive-i initialization SQL file

     8.hive-p connecting to Hive server on port number

hive-s—or–silent silent mode in interactive shell
hive-v or—ver bose verbox mode(echo executed SQL to the console)
As of Hive 0.10.0, there is one addition command line option Hive—data box specify the database to use

Examples:-

Example for running a query from the command line

$HIVE-HOME/bin/Hive-e ’select a.col from tab1

Example of setting Hive configuration variables

$HIVE-HOME/bin/Hive-e ’select a.col from tab1 a’-Hive conf
Hive. exec. Scrarch dir=/home/my/Hive-Hive conf mapred. reduce. tasks=32

Example of dumping data out from a query into a file using slient mode.

$HIVE-HOME/bin/Hive-s-e ’select a.col from tab1 a’>a.txt

Example of running a script non- interactively

$HIVE-HOME/bin/Hive-f /home/my/Hive-script.sql

Example of running an initialization script before entering interactive mode.

$HIVE-HOME/bin/Hive-i /home/my/Hive-init.sql

The Hiver File hive

           The CCI when invoked without the – I option will attempt to load $HIVE-HOME/bin/Hive rc and HOME/.Hive rc as initialization files.

Hive Batch mode commands:-

When $HIVE-HOME/bin/Hive is run with the –e or-option, it executes SQL Commands in batch mode.

.hive-e’’execute the query string

.hive-f execute one or more SQL queries from a file.

Examples:-

When $HIVE-HOME/bin/Hive is run without either –e or- f option, it enters interactive shell mode

i.e #hive

hive>

We have to use ’;’ to terminate commands

Comments is scripts that can be specified using the ‘–’ prefix 

Commands in interactive shell:-

 

Sl no. Command Description
1 Quit or exit  Use quit or exit to lease the interactive shell.
2 Reset  Resets the configuration to the default values. Set the value of a particular configuration variable(key)
3  
Set<key>=<value> 
Note: If you misspell the variable name, the CLI will not show an error.
4 Set  Prints a list of configuration variables that are over ridden by the user or Hive
5

Set-r 

Add file[S]*

Prints all Hadoop and Hive configuration variables.
6

Add JAR [S]*   

Add ARCHIVE[S]*

Adds one or more files, jars or archives to the list of resources in the distributed cache.
7

List File[S]  List JAR[S] 

List ARCHIVE[S] 

List File[s] >*

Lists the resources that are already added to the distributed cache.
8

Add JAR [S]* 

Add ARCHIVE[S]* 

Delete FILE[S]*

Checks whether the given resources are already added to the distributed cache or not.
9

Delete JAR[S]* 

Delete ARCHIVE[S]*

Removes the resource(s) from the distributed cache.
10
!<command>
Executes the shell command from the Hive shell
11
dfs<command>
Executes a dfs command from the Hive shell
12
<query string>
Executes a Hive query and prints results to the standard output.
13
Source File <file path>

Executes a script file inside the CLI

 

For Example:-

hive>set map red. reduce. tasks=32;
hive >set;
hive >select a.* from tab1;
hive >! Ls

RELATED COURSES

Get Updates on Tech posts, Interview & Certification questions and training schedules