Apache Hive - Internal and External Tables

Hive is a Hadoop component that stores structured data in the form of a table within a database. To store and manage data in a database, users can construct internal and external tables. In this article, we'll go over the internal table and external table of Apache Hive in-depth.

Rating: 5
  
 
4399

The time that we are currently living is called the Data Era for a reason as we see a generation of data from all corners of the world. This data can be what we see on the Social Media platform or the applications that we run in Organizations, log files that these applications generate day in and day out. Here comes a framework with its huge set of offerings to provide time and cost-effectiveness.

Introduction to Internal and External Tables in Apache Hive

Thus came the term Big Data which is nothing but a container for collections of huge datasets that include high volume, higher velocity and varied kinds of data – increasing day in day out. Analyzing such huge amounts of data is not possible using the traditional RDBMS systems and hence the need for frameworks like Apache Hadoop. Hadoop is an open-source framework which targets to store and process Big Data in a distributed environment.

Hadoop MapReduce:

It is nothing but a parallel programming model to process larger amounts of structured, semi-structured and unstructured data.

HDFS (Hadoop Distributed File System)

This is used to store and process datasets and also provides fault-tolerant file system to run on commodity hardware.

Let us look into more details about this wonderful component from the Hadoop’s ecosystem in the sections below.

What is Apache Hive?

As the data grew in size, there was also the scarcity of Java developers who can write complex MapReduce jobs for Hadoop. Hence the advent of Hive comes which is created on top of Hadoop itself. Hive provides a SQL like a language termed HiveQL interface for users to extract data from a Hadoop system. With the simplicity provided by Hive to transform simple SQL queries into Hadoop’s MapReduce jobs, and runs them against a Hadoop cluster.

Apache Hive is well suited for Data warehousing applications in which case the data is structured, static and also formatted. As there are certain design constraints on Hive, it does not provide row-wise updates and inserts (which is coined as the biggest disadvantage of using Hive). As most Hive queries turn out into Map to Reduce jobs these queries will have higher latency due to start up overhead.

Based on these details, Hive is not a

  • Relational database
  • Design for OLTP (stands for Online Transaction Processing)
  • Language for real-time queries and row-level updates

Related Blog: Apache NiFi Tutorial

Checkout Hadoop Tutorial

Features of Hive

With the basic understanding of what Apache Hive is, let us now take a look at all the features that are provided with this component of the Hadoop ecosystem:

  • Hive stores the schema details in a database and processes the data into HDFS
  • Hive is designed for OLAP (stands for Online Analytics Processing)
  • Hive provides SQL like language for querying data, named as HiveQL or HQL (do not misinterpret this with HQL from Hibernate, which stands for Hibernate Query Language).
  • Hive is a very fast, scalable and an extensible component within the Hadoop ecosystem.

MindMajix YouTube Channel

Related Article: Hive vs Impala

What are Hive Internal and External Tables?

Internal or Managed Tables:

The tables that are created with the Hadoop Hive’s context, is very much similar to tables that are created on any of the RDBMS systems. Each of the tables that get created is associated with a directory configured within the ${HIVE_HOME}/conf/hive-site.xml in the Hadoop HDFS cluster.

By default on a Linux machine, it is this path /user/hive/warehouse in HDFS. If there is a /user/hive/warehouse/match created by Hive in HDFS for a match table. All the data for the table is recorded in the same folder as mentioned above and hence such tables are called INTERNAL or MANAGED tables.

When the data resides in the internal tables, then Hive takes the full responsibility of maintaining the life-cycle of the data and the table in itself. Hence it is evident that the data is removed the moment when the internal tables are dropped.

Checkout Hadoop Interview Questions

External Tables:

If there is data that is already existing in the HDFS cluster of Hadoop then an external Hive table is created to describe the data. These tables are called External tables, because they are going to be residing in the path specified by the LOCATION properties instead of the default warehouse directory (as described in the above paragraph).

When the data is stored in the external tables and when it is dropped, the metadata table is deleted but then the data is kept as is. This means that Hive evidently ignores the data that is present residing in the path specified by LOCATION property and is left untouched forever. If you want to delete such data, then use the command to achieve the same:

hadoop fs –rmr ‘tablename’

Explore Hadoop Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

Conclusion:

In this article, we have tried to introduce you to the topic of Apache Hadoop and thereafter one of the powerful components from the Hadoop’s ecosystem – Apache Hadoop Hive. We have also understood the usage of Internal and External tables within Hadoop Hive as well.

List of Other Big Data Courses:

 Hadoop Administration MapReduce
 Big Data On AWS Informatica Big Data Integration
 Bigdata Greenplum DBA Informatica Big Data Edition
 Hadoop Hive Impala
 Hadoop Testing Apache Mahout

 

Join our newsletter
inbox

Stay updated with our newsletter, packed with Tutorials, Interview Questions, How-to's, Tips & Tricks, Latest Trends & Updates, and more ➤ Straight to your inbox!

Course Schedule
NameDates
Cassandra TrainingApr 20 to May 05View Details
Cassandra TrainingApr 23 to May 08View Details
Cassandra TrainingApr 27 to May 12View Details
Cassandra TrainingApr 30 to May 15View Details
Last updated: 03 Apr 2023
About Author

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

read more