Blog

Apache Hive - Internal and External Tables

  • (4.0)
  •   |   676 Ratings

Apache Hive - Internal and External Tables

Introduction:

The time that we are currently living is called the Data Era for a reason as we see a generation of data from all corners of the world. This data can be what we see on the Social Media platform or the applications that we run in Organizations, log files that these applications generate day in and day out. Here comes a framework with its huge set of offerings to provide time and cost-effectiveness.

Thus came the term Big Data which is nothing but a container for collections of huge datasets that include high volume, higher velocity and varied kinds of data – increasing day in day out. Analyzing such huge amounts of data is not possible using the traditional RDBMS systems and hence the need for frameworks like Apache Hadoop. Hadoop is an open-source framework which targets to store and process Big Data in a distributed environment.

Accelerate your career with Hadoop Training and become experts in Apache Hadoop.

Hadoop MapReduce:

It is nothing but a parallel programming model to process larger amounts of structured, semi-structured and unstructured data.

HDFS (Hadoop Distributed File System):

This is used to store and process datasets and also provides fault-tolerant file system to run on commodity hardware.

Let us look into more details about this wonderful component from the Hadoop’s ecosystem in the sections below.

What is Apache Hive?

As the data grew in size, there was also the scarcity of Java developers who can write complex MapReduce jobs for Hadoop. Hence the advent of Hive comes which is created on top of Hadoop itself. Hive provides a SQL like a language termed HiveQL interface for users to extract data from a Hadoop system. With the simplicity provided by Hive to transform simple SQL queries into Hadoop’s MapReduce jobs, and runs them against a Hadoop cluster.

Apache Hive is well suited for Data warehousing applications in which case the data is structured, static and also formatted. As there are certain design constraints on Hive, it does not provide row-wise updates and inserts (which is coined as the biggest disadvantage of using Hive). As most Hive queries turn out into Map to Reduce jobs these queries will have higher latency due to start up overhead.

Based on these details, Hive is not a

  • Relational database
  • Design for OLTP (stands for Online Transaction Processing)
  • Language for real-time queries and row-level updates

Checkout Hadoop Tutorial

Features of Hive:

Features of Hive

With the basic understanding of what Apache Hive is, let us now take a look at all the features that are provided with this component of the Hadoop ecosystem:

  • Hive stores the schema details in a database and processes the data into HDFS
  • Hive is designed for OLAP (stands for Online Analytics Processing)
  • Hive provides SQL like language for querying data, named as HiveQL or HQL (do not misinterpret this with HQL from Hibernate, which stands for Hibernate Query Language).
  • Hive is a very fast, scalable and an extensible component within the Hadoop ecosystem.

Related Article: Hive VS Impala

What are Hive Internal and External Tables?

Internal or Managed Tables:

The tables that are created with the Hadoop Hive’s context, is very much similar to tables that are created on any of the RDBMS systems. Each of the tables that get created is associated with a directory configured within the ${HIVE_HOME}/conf/hive-site.xml in the Hadoop HDFS cluster.

By default on a Linux machine, it is this path /user/hive/warehouse in HDFS. If there is a /user/hive/warehouse/match created by Hive in HDFS for a match table. All the data for the table is recorded in the same folder as mentioned above and hence such tables are called INTERNAL or MANAGED tables.

When the data resides in the internal tables, then Hive takes the full responsibility of maintaining the life-cycle of the data and the table in itself. Hence it is evident that the data is removed the moment when the internal tables are dropped.

Checkout Hadoop Interview Questions

External Tables:

If there is data that is already existing in the HDFS cluster of Hadoop then an external Hive table is created to describe the data. These tables are called External tables, because they are going to be residing in the path specified by the LOCATION properties instead of the default warehouse directory (as described in the above paragraph).

When the data is stored in the external tables and when it is dropped, the metadata table is deleted but then the data is kept as is. This means that Hive evidently ignores the data that is present residing in the path specified by LOCATION property and is left untouched forever. If you want to delete such data, then use the command to achieve the same:

hadoop fs –rmr ‘tablename’

Explore Hadoop Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

Conclusion:

In this article, we have tried to introduce you to the topic of Apache Hadoop and thereafter one of the powerful components from the Hadoop’s ecosystem – Apache Hadoop Hive. We have also understood the usage of Internal and External tables within Hadoop Hive as well.

List of Other Big Data Courses:

 Hadoop Administration  MapReduce
 Big Data On AWS  Informatica Big Data Integration
 Bigdata Greenplum DBA  Informatica Big Data Edition
 Hadoop Hive  Impala
 Hadoop Testing  Apache Mahout

Popular Courses in 2018

Get Updates on Tech posts, Interview & Certification questions and training schedules