Introduction To Hadoop Hive Query Language

Hadoop Articles

Hadoop Quiz

Test and Explore your knowledge

Hive Query Language

Hive QL is the HIVE QUERY LANGUAGE
Hive offers no support for row-level inserts, updates, and deletes.
Hive does not support transactions.
Hive adds extensions to provide better performance in the context of Hadoop and to integrate with custom extensions and even external programs.
DDL and DML are the parts of HIVE QL
Data Definition Language (DDL) is used for creating, altering and dropping databases, tables, views, functions and indexes.
Data manipulation language is used to put data into Hive tables and to extract data to the file system and also how to explore and manipulate data with queries, grouping, filtering, joining etc.

Accelerate your career with Hadoop Training and become experts in Apache Hadoop.

Databases in Hive:

The Databases in the Hive is essentially just a catalog or namespace of tables.
They are very useful for larger clusters with multiple teams and users, as a way of avoiding table name
Hive provides commands such as
- CREATE DATABASE db name -- to create a database in Hive
- USE db name -- To use the database in Hive.
- DROP db name -- To delete the database in Hive.
- SHOW DATABASE -- to see the list of the DataBase

If no database is specified, tables belong to the default Data Base.

Tables in Hive:

Hive table is logically made up of the data being stored and the associated metadata describing the layout of the data in the table.

The data typically resides in HDFS, although it may reside on any Hadoop file system including the local file system.
Hive stores the metadata in a relational database and not in HDFS.
The command for creating a table in Hive is

have>CREATE TABLE EMP (empid int, ename string, esal double)
ROW FORMAT DELIMITED FIELDS TERMINATED By ‘t’ LINES TERMINATED by ‘n’ STORED AS TEXT FILE;

To display the description of the table we use have>desc emp;

To have, we are having two types of tables

Managed tables
External tables

1. Managed tables

Managed tables are the one which will be managed in the Hive warehouse i.e. whenever we create a managed table definition, it will be stored under the default location of the Hive warehouse i.e./user/Hive/warehouse.

When we drop a managed table, Hive deletes the data in the table
Managed tables are less convenient for sharing with other tools.

Checkout Hadoop Tutorial

Syntax for creating Hive managed table:-

Hive>create table manage- tab (empid, ename string, esal int) row format delimited fields terminated by ‘t’ lines terminated by ‘m’ stored as a text file;

As discussed above, the table will be created under/user/Hive/warehouse/managed-tab by giving the command as

#hadoop fs –ls/user/Hive/warehouse.

How to load the data into managed tables

We can load the data in two ways

Local Mode
HDFS Mode

In local mode, the syntax is

hive>load data local in path’/home/new Batch/input1.txt’
Into table managed-tab;

For HDFS mode, the syntax is

hive>load data in path’/user/ramesh/Hive/input2.txt’
Into table managed – tab;

Once the successful loading of the table and once the file is loaded, the file will be deleted in HDFS path and we can see in use/Hive/ware house

2) External Tables:

Along with the managed tables, Hive also uses external tables.

Whenever the key word ‘external’ comes in the table definition part. A hive will not bother about the table definition, i.e. the external table will not be managed by the Hive warehouse system.

Along with the external keyword, we can also mention the ‘location’ in the table definition, where exactly the table definition will get stored.

When you drop an external table, Hive leaves the data untouched and only delete the meta data.

Syntax:-

Hive>create external table external- tab(empid int, ename string, esal double) 
row format delimited fields
Terminated by ‘f’ lines terminated by ‘n’ stored as text file location 
‘userRameshHive-external’;

A location will be automatically created.

Loading data into External Tables:-

Loading data from HDFS to

Hive>load data in path’/Ramesh/input data.txt’ into table external-tab;

Flow of Data in Hive process at the sample location
If we delete the managed table, both the schema and the data file will be deleted.
But, if we delete external tables, only the schema will be deleted and data file will be there in the specified location.

Frequently asked Hadoop Interview Questions

Difference between managed tables & External Tables:

One of the main differences between managed and external tables in Hive is that when an external table is dropped, the data associated with it does not get deleted from only the meta data (no. of cols, types of cols, terminators etc.) gets dropped form the Hive meta store

When a managed table gets dropped, both the metadata and data get dropped.
I have so far always preferred making table external because if the schema of my Hive table changes, I can just drop the external table and recreate another external table over the same HDFS data with the new schema

Hive>Create external table log in for tab(log id int, log Error string,Log error count int)
row format delimited fields 
terminated by’f’ stored as text file location ‘user/external location’;
Hive>select*from log in for tab;

We get the result from the file which we specified in the location path

For external tables, no need to load the data explicitly.
However, most of the changes to the schema can now be made through ALTER TABLE or similar command
So, the recommendation to use external tables over managed tables might be more of a legacy concern than a contemporary one.

Altering Table:

Most table properties can be altered with the ALTER TABLE statement, which change metadata about the table but not the table itself
ALTER TABLE modifies table meta data on.
Then statements can be used to fix mistakes in schema, move partition locations and do other operations.

Renaming a Table:

This statement is used to rename the table Log_messages to log msgs

Cmd: ALTER TABLE log _ messages RENAME To logmsgs;

Changing columns

You can rename a column, change its position, type or comment.

Syntax:

ALTER TABLE log-messages CHANCE COLUMN hms hours-minutes-

Seconds INT COMMENT ’The hours, minutes and seconds are part of the times tamp’ AFTER Severity;

You have to specify the old name, a new name and the type even if the name or type is not changed.
If you are not moving the column, the AFTER other – column close is not necessary.
In the example shown, we move the column after the column
If you want to move the column to the first position, use FIRST instead of AFTER other – column.

Adding Columns

You can add new columns to the end of the existing columns, before any partition.

 Example: ALTER TABLE Log-message ADD COLUMNS(app-name String COMMENT” 
Application Name” ,session-id long);

Deleting or replacing columns:

The replace statement can only be used with tables that use one of the native ser De modules are Dynamic Ser De or Metadata Type column set ser De.

Ser De determines how records are parsed into columns i.e deserialization and how records columns are stored (serialization)

Ex:-ALTER TABLE log- messages REPLACE COLUMNS(
Hours-mins-sees INT
Severity STRING
Message String);

This statement effectively renames the original hms column and removes the server and process – id columns from the original schema definition.

As for all the ALTER Statements, only the table metadata is changed.

Related Page: Hadoop Archive files in HDFS

Partitioning and Bucketing:

Hive organizes tables into partitions, a way of dividing a table into course – grained parts based on the value of a partition column, such as date.

Using partition can make it faster to do queries on slices of the data.
Tables or partitions may further be sub divided into buckets, to give extra structure to the data that may be used for more efficient queries.
For example, bucketing by user ID means we can quickly evaluate a user based query by running if on a randomized sample of the total set of users.

Partitions:

A table may be partitioned in multiple dimensions.
For example, in addition to partitioning logs by date, we might also subpartition each date partition by country to permit efficient queries by location.
Partitioned are defined at table creation time using the PATITIONED by the clause, which takes a list of column definitions.
If we want to search a large amount of data, then we can divide the large data into partitions.

Ex:-

hive>create table party table(loaded int, log error string)
PARTITIONED BY (Logdt string, country string) row format delimited field 
terminated by ‘t’ lines terminated by ‘n’ stored as text file

Partition is one of the concepts in Hive where exactly certain things are grouped by the means of column combination.

Buckets:

There are two reasons why you might want to organize your tables (or partitions) into buckets.

The first is to enable more efficient queries.
The second reason to bucket a table is to make sampling more efficient.

Example:-

Tell Hive that a table should be bucketed. And we use the CLUSTERED By clause to specify the columns to a bucket and the number of buckets

hive>CREATE TABLE bucketed users(id INT, name STRINA)
CLUSTERED BY (id)INTO 4 BUCKETS;

Here we are using the user ID to determine the bucket the Hive does which is done by hashing the value and reducing module and the number of buckets, so any particular bucket will effectively have a random set of users in it.

The data within a bucket may additionally be stored by one or more columns.
This allows even more efficient map-side joins since the join of each bucket becomes an efficient merge sort.

Syntax for delving that a table has sorted buckets is:

have>CREATE TABLE bucketed users(id INT, name STRING)
CLUSTERED By(id)SORTED By(id ASC)INTO 4 BUCKETS;

Explore Hadoop Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

List of Big Data Courses:

Hadoop Administration	MapReduce
Big Data On AWS	Informatica Big Data Integration
Bigdata Greenplum DBA	Informatica Big Data Edition
Hadoop Hive	Impala
Hadoop Testing	Apache Mahout

On-Job Support Service

Online Work Support for your on-job roles.

@Learner@SME

Our work-support plans provide precise options as per your project tasks. Whether you are a newbie or an experienced professional seeking assistance in completing project tasks, we are here with the following plans to meet your custom needs:

Pay Per Hour
Pay Per Week
Monthly

Learn MoreContact us

Course Schedule

Name	Dates
Hadoop Training	Oct 25 to Nov 09	View Details
Hadoop Training	Oct 28 to Nov 12	View Details
Hadoop Training	Nov 01 to Nov 16	View Details
Hadoop Training	Nov 04 to Nov 19	View Details

Last updated: 29 Apr 2023

About Author

Ravindra Savaram

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

read less

Recommended Courses