Hive fundamentals

shorya sharma
5 min readAug 13, 2021

--

Hive Introduction

Initially developed by Facebook in 2007 to handle massive amounts of data growth. RDBMS Data warehouse was taking too long to process daily jobs hence Facebook decided to move their data into scalable open-source Hadoop environment, But, using Hadoop/creating MapReduce programs was not easy for many users. So, they had a vision to bring familiar database concepts to the unstructured world of Hadoop, while still maintaining Hadoops extensibility.

Hive was open sourced in 2008.

So what is Hive?

It is a data warehouse system built on top of Hadoop, Facilitates easy data summarization, ad-hoc queries, analysis of large datasets stored in Hadoop. Hive provides a SQL interface (HQL) for data stored in Hadoop. HQL queries implicitly translated to one or more Hadoop MapReduce jobs for execution. It also provides mechanism to project structure onto Hadoop datasets.

So What hive is not ?

It is not a full database, It is not a real time processing system, Also, It is not SQL-92 Compliant.

Hive Architecture

There are different ways you can interface with Hive. You can use a web browser to access Hive via the Hive Web Interface. You could also access Hive using an application over JDBC, ODBC, or the Thrift API, each made possible by Hive’s Thrift Server referred to as HiveServer. HiveServer2 was released in Hive 0.11 and serves as a replacement for HiveServer1, though you still have the choice of which HiveServer to run, or can even run them concurrently. HiveServer2 brings many enhancements including the ability to handle concurrent clients and more.

Hive also comes with some powerful Command Line interfaces (often referred to as the “CLI”). The introduction of HiveServer2 brings with it a new Hive CLI called Beeline, which can be run in embedded mode or thin client mode. In thin client mode, the Beeline CLI connects to Hive via JDBC and HiveServer2. The original CLI is also included with Hive and can run in embedded mode or as a client to the HiveServer1.

Hive comes with a catalogue known as the Metastore. The Metastore stores the system catalogue and metadata about tables, columns, partitions and so on. The metastore makes mapping file structure to a tabular form possible in Hive. A newer component of Hive is called HCatalog.

HCatalog is built on top of the Hive metastore and incorporates Hive’s DDL. HCatalog provides read and write interfaces for Pig and MapReduce and uses Hive’s command line interface for issuing data definition and metadata exploration commands. Essentially, HCatalog makes it easier for users of Pig, MapReduce, and Hive, to read and write data on the grid. The Hive Driver, Compiler, Optimizer, and Executor work together to turn a query into a set of Hadoop jobs.

The Driver piece manages the lifecycle of a HiveQL statement as it moves through Hive. It maintains a session handle and any session statistics. The Query Compiler compiles HiveQL queries into a DAG of MapReduce tasks. The Execution Engine executes the tasks produced by the compiler in proper dependency order. The Execution Engine interacts with the underlying Hadoop instance, working with the Name Node, Job Tracker and so on.

Hive Directory Structure

  1. Lib directory
  • $HIVE_HOME/lib
  • Location of hive JAR files
  • Contain the actual java code that implements the hive functionality

2. Bin Directory

  • $HIVE_HOME/bin
  • Location of Hive Scripts/Services

3. Conf Directory

  • $HIVE_HOME/conf
  • Location of configuration files

CLI Command Interface

  • Most Common Way to interact with Hive
  • From shell you can

> Perform Queries, DML, and DDL

> View and manipulate table metadata

> Retrieve query explain plans

  • Hive beeline and original CL are located in $HIVE_HOME/bin/hive

Hive Metastore

There’s three configurations you can choose for your metastore.

The first is embedded, which runs the metastore code in the same process with your Hive program and the database that backs the metastore is in the same process as well. The embedded metstore is likely to be used only in a test environment.

The second configuration option is to run the metastore as local, which keeps the metastore code running in process, but moves the database into a separate process that the metastore code communicates with.

The last option is to setup a remote metastore. This option moves the metastore itself out of the process as well. The remote metastore can be useful if you wish to share the metastore with other users. The remote metastore is the configuration you are most likely to use in a production environment, as it provides some additional security benefits on top of what’s possible with a local metastore.

Hive Data Units

  • Organisation of hive data

Database -> Table -> Partition -> Buckets

First data is organized into Databases which are namespaces that separate tables and other data units from naming conflicts.

Next data is organized into tables which are homogenous units of data that have the same schema.

Data can then be organized into partitions, though this is not a requirement. A Partition in Hive is a virtual column that defines how data is stored on the file system based on its values. A table can have zero or more partitions.

Finally, in each partition, data can be organized into Buckets based on the hash value of a column in the table.

Again, note that it isn’t necessary for tables to be partitioned or bucketed in Hive, however these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution and reduced latencies.

Types of tables in hive:

source: https://data-flair.training/blogs/wp-content/uploads/sites/2/2017/09/Hive-internal-table-vs-external-table-1200x675.jpg

Tables in Hive can either be managed(internal) or External. Tables by default are Managed by Hive — Hive controls the metadata for that table AND the lifecycle of the actual data in the table. The managed table data is stored in subdirectories within the configured warehouse directory. Dropping a managed table will delete the actual table data in addition to the metadata Hive stored for that table. Let’s contrast this behaviour to Hive’s External Tables. An external table’s data file(s) are stored in a location outside of Hive. Hive does not assume it owns the data in the table, dropping an external table deletes just the table metadata and leaves the actual data untouched. External tables can be useful if you are sharing your data with other tools outside of Hive.

--

--

shorya sharma

Assistant Manager at Bank Of America | Ex-Data Engineer at IBM | Ex - Software Engineer at Compunnel inc. | Python | Data Science