Apache Hive and Apache HBase are two different Hadoop based Big Data technologies that server different purposes in almost all the use cases that can be practically considered. Taking an example of a Social media scenario of Facebook – when you login you might see multiple things on your Facebook landing page like your friends list, news feed, ad suggestions, friend suggestions etc. With over 2 billion monthly users accessing Facebook on a daily basis, how would you think that Facebook is able to load all such cluttered in a presentable manner – the answer is pretty simple, Apache Hadoop in conjunction with many other technologies that we are going to discuss today in detail, that is, Apache Hadoop with Apache Hive and Apache HBase. The complexity of big data systems requires that every technology needs to be used in conjunction with the other.
Hive should be used for analytical querying of data collected over a period of time - for instance, to calculate trends or website logs. Hive should not be used for real-time querying since it could take a while before any results are returned. HBase is perfect for real-time querying of Big Data. Facebook use it for messaging and real-time analytics. They may even be using it to count Facebook likes.
Apache Hive is a SQL like engine that runs atop Apache Hadoop and designed for the SQL savvy techies who enable running MapReduce Jobs through SQL like queries. Apache Hive lets developers impose logical and relational schema on various kinds of file formats and physical storage mechanisms within and also outside the Hadoop HDFS clusters. SQL queries are always run against these schemas that we have just discussed about in the form of MapReduce jobs. There is a limited set of write capabilities and interaction with the data in Apache Hive. Apache Hive is meant for the execution of batch transformation and also for the execution of large analytical queries.
Traditional RDBMS professionals would love to use Apache Hive, as they can simply map HDFS files to Hive tables and query the data. Even the HBase tables can be mapped and Hive can be used to operate on that data. Apache Hive should be used for data warehousing requirements and when the programmers do not want to write complex mapreduce code. However, not all problems can be solved using apache hive. For big data applications that require complex and fine grained processing, Hadoop MapReduce is the best choice.
Apache Hadoop has its own loopholes and one of the biggest of them is the non-availability of services that can make random access capabilities possible. HBase comes to the rescue to add the necessary capabilities to Apache Hadoop when it is used in conjunction with it. HBase is known to scale horizontally using the off the shelf region servers and it is also known to be highly available, consistent and only on the lower side of latency NoSQL database. HBase has a large set of flexible data models which are cost effective and have no sharding. HBase works pretty well with sparse data.
Few of the questions that you must pose yourself with, before using HBase for any of your Hadoop use cases:
* Do you have sufficient hardware?
* Does your applications require those additional features that RDBMS does not provide?
* Do you have enough data?
Apache Hadoop is not a perfect Big Data framework all by itself for the real time analytics and this is when you would want to rely on HBase to add the additional features that you would want – to be able to query real time data. Random reads and writes are also one another requirement from your use case to lean over HBase as an ideal Big Data solution in conjunction with Apache Hadoop. Accessing the data that is required can also be achieved by storing the data required in any of the NoSQL databases. HBase provides a rich set of APIs that can be used to pull and push data to it.
HBase finds its use cases where it can be perfectly integrated with Apache Hadoop MapReduce jobs for bulk operations that involve analytics, indexing and the like. One of the best ways to use HBase is to make the repository as Hadoop for all the static data and making HBase as the data store where the data that can be stored which will change in real time after processing. You may consider using HBase in your Organization or in your use cases when you need the following features from HBase:
* When there is huge amounts of data being considered
* When ACID properties are not considered mandatory but are just required
* When the data model schema is sparse
* When your application needs scalability and that too gracefully
With the understanding that we have gained through the sections earlier explaining each of the technologies that we wanted to learn in this article, it is a good opportunity for us to discuss further upon the differences between them. This will not only provide greater understanding on the products that you’ve known until now but also gives you an edge in making the necessary decisions, deciding upon which one to use at what situation. Let us take a closer look at the differences between Hive and HBase, shall we?
|Apache Hive is a query engine||HBase is a data storage which is particular for unstructured data|
|Apache Hive is not ideally a database but it is a MapReduce based SQL engine which runs atop Hadoop||HBase is a NoSQL database that is commonly used for real time data streaming|
|Apache Hive is used for batch processing (that means, OLAP based)||HBase is extremely used for transactional processing, and in the process, the query response time is not highly interactive (that means OLTP)|
|Operations in Hive don’t run in real time||Operations in HBase are said to run in real time on the database instead of transforming into MapReduce jobs|
|Apache Hive is to be used for analytical queries||HBase is to be used for real time queries|
|Apache Hive has limitations of higher latency||HBase doesn’t have any analytical capabilities|
HBase and Hive are used in conjunction on the same Hadoop cluster to attain and achieve more than just by using either of the products in the cluster. Some of these points are worth mentioning, that these two technologies should work hand in hand rather than one against the other. Let us take a look at the use cases where these two technologies go hand in hand:
* It is said to be a good option to use Hive as an ETL tool for batch inserts into HBase and then to execute queries that can further join data present on HBase tables with the data that is already present on HDFS systems.
* It is very much possible to write down HiveQL queries on HBase tables so that it can make best usage of the Hive’s grammar and parser query execution engine and also the query planner.
* Apache Hive has a specific library to interact with HBase in specific where there is a mediator layer developed between Hive and HBase.
* One of the issues that needs to be considered when we integrate Hive with HBase is the impedance mismatch between HBase’s sparse and un-typed schema over Hive’s dense and typed schema.
In this article, we have known in great detail about Apache Hive and HBase and discussed about them individually. In order to understand the offerings of these two technologies, we have tried to showcase the differences between them. Having said that, we have also let you know the advantages if both of these technologies can be used in conjunction to achieve much more than just using either of these technologies.
Hive and HBase are two different Hadoop based technologies where Hive is an SQL-like engine that runs MapReduce jobs, and on the contrary HBase is a NoSQL key/value database on Hadoop. Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from Hive to HBase and back again.
|Big Data On AWS||Informatica Big Data Integration|
|Bigdata Greenplum DBA||Informatica Big Data Edition|
|Hadoop Testing||Apache Mahout|
Get Updates on Tech posts, Interview & Certification questions and training schedules