Apache Hive Tutorial - Apache Hive Internal & External Tables

Apche Cassandra Articles

Apche Cassandra Quiz

Test and Explore your knowledge

The time that we are currently living is called the Data Era for a reason as we see a generation of data from all corners of the world. This data can be what we see on the Social Media platform or the applications that we run in Organizations, log files that these applications generate day in and day out. Here comes a framework with its huge set of offerings to provide time and cost-effectiveness.

Introduction to Internal and External Tables in Apache Hive

Thus came the term Big Data which is nothing but a container for collections of huge datasets that include high volume, higher velocity and varied kinds of data – increasing day in day out. Analyzing such huge amounts of data is not possible using the traditional RDBMS systems and hence the need for frameworks like Apache Hadoop. Hadoop is an open-source framework which targets to store and process Big Data in a distributed environment.

Hadoop MapReduce:

It is nothing but a parallel programming model to process larger amounts of structured, semi-structured and unstructured data.

HDFS (Hadoop Distributed File System)

This is used to store and process datasets and also provides fault-tolerant file system to run on commodity hardware.

Let us look into more details about this wonderful component from the Hadoop’s ecosystem in the sections below.

What is Apache Hive?

As the data grew in size, there was also the scarcity of Java developers who can write complex MapReduce jobs for Hadoop. Hence the advent of Hive comes which is created on top of Hadoop itself. Hive provides a SQL like a language termed HiveQL interface for users to extract data from a Hadoop system. With the simplicity provided by Hive to transform simple SQL queries into Hadoop’s MapReduce jobs, and runs them against a Hadoop cluster.

Apache Hive is well suited for Data warehousing applications in which case the data is structured, static and also formatted. As there are certain design constraints on Hive, it does not provide row-wise updates and inserts (which is coined as the biggest disadvantage of using Hive). As most Hive queries turn out into Map to Reduce jobs these queries will have higher latency due to start up overhead.

Based on these details, Hive is not a

Relational database
Design for OLTP (stands for Online Transaction Processing)
Language for real-time queries and row-level updates

Related Blog: Apache NiFi Tutorial

Checkout Hadoop Tutorial

Features of Hive

With the basic understanding of what Apache Hive is, let us now take a look at all the features that are provided with this component of the Hadoop ecosystem:

Hive stores the schema details in a database and processes the data into HDFS
Hive is designed for OLAP (stands for Online Analytics Processing)
Hive provides SQL like language for querying data, named as HiveQL or HQL (do not misinterpret this with HQL from Hibernate, which stands for Hibernate Query Language).
Hive is a very fast, scalable and an extensible component within the Hadoop ecosystem.

Related Article: Hive vs Impala

What are Hive Internal and External Tables?

Internal or Managed Tables:

The tables that are created with the Hadoop Hive’s context, is very much similar to tables that are created on any of the RDBMS systems. Each of the tables that get created is associated with a directory configured within the ${HIVE_HOME}/conf/hive-site.xml in the Hadoop HDFS cluster.

By default on a Linux machine, it is this path /user/hive/warehouse in HDFS. If there is a /user/hive/warehouse/match created by Hive in HDFS for a match table. All the data for the table is recorded in the same folder as mentioned above and hence such tables are called INTERNAL or MANAGED tables.

When the data resides in the internal tables, then Hive takes the full responsibility of maintaining the life-cycle of the data and the table in itself. Hence it is evident that the data is removed the moment when the internal tables are dropped.

Checkout Hadoop Interview Questions

External Tables:

If there is data that is already existing in the HDFS cluster of Hadoop then an external Hive table is created to describe the data. These tables are called External tables, because they are going to be residing in the path specified by the LOCATION properties instead of the default warehouse directory (as described in the above paragraph).

When the data is stored in the external tables and when it is dropped, the metadata table is deleted but then the data is kept as is. This means that Hive evidently ignores the data that is present residing in the path specified by LOCATION property and is left untouched forever. If you want to delete such data, then use the command to achieve the same:

hadoop fs –rmr ‘tablename’

Explore Hadoop Sample Resumes! Download & Edit, Get Noticed by Top Employers!Download Now!

Conclusion:

In this article, we have tried to introduce you to the topic of Apache Hadoop and thereafter one of the powerful components from the Hadoop’s ecosystem – Apache Hadoop Hive. We have also understood the usage of Internal and External tables within Hadoop Hive as well.

List of Other Big Data Courses:

Hadoop Administration	MapReduce
Big Data On AWS	Informatica Big Data Integration
Bigdata Greenplum DBA	Informatica Big Data Edition
Hadoop Hive	Impala
Hadoop Testing	Apache Mahout

On-Job Support Service

Online Work Support for your on-job roles.

@Learner@SME

Our work-support plans provide precise options as per your project tasks. Whether you are a newbie or an experienced professional seeking assistance in completing project tasks, we are here with the following plans to meet your custom needs:

Pay Per Hour
Pay Per Week
Monthly

Learn MoreContact us

Course Schedule

Name	Dates
Cassandra Training	Aug 23 to Sep 07	View Details
Cassandra Training	Aug 26 to Sep 10	View Details
Cassandra Training	Aug 30 to Sep 14	View Details
Cassandra Training	Sep 02 to Sep 17	View Details

Last updated: 03 Apr 2023

About Author

Ravindra Savaram

Ravindra Savaram is a Technical Lead at Mindmajix.com. His passion lies in writing articles on the most popular IT platforms including Machine learning, DevOps, Data Science, Artificial Intelligence, RPA, Deep Learning, and so on. You can stay up to date on all these technologies by following him on LinkedIn and Twitter.

read less

Recommended Courses