Hive

In this session we are going to learn about hive. We will see:
- What exactly is hive?
- How MapReduce is different or similar to hive?
- Hive introduction
- Hive versus rdbms
- Hive architecture
- Hive basic vocabulary
- Hive query language
- Hive joins
- Hive operations, etc

Map Reduce vs Hive
Hive Introduction
Hive vs RDBMS
Hive Architecture
Hive Basic Vocabulary
Hive Query Language
Hive Joins

Map Reduce vs Hive

Hadoop

Using traditional data management systems, it is difficult to process Big Data.
Big data by definition is the data set or a type of data that is very difficult to handle using traditional tools or conventional tools like SAS, R,Excel, SQL, etc that is used in the day-to-day life.
Therefore, the Apache Software Foundation introduced a framework called Hadoop to solve Big Data management and processing challenges.
Hadoop is an open-source framework to store and process Big Data in a distributed environment.
It contains two modules, one is MapReduce and another is Hadoop Distributed File System (HDFS).

MapReduce:

MapReduce is a kind of distributed computing where we first divide the whole objective into various smaller tasks and then finally write a Map Reduce program where the map will carry out some computation and then the reducer will take the output of map as input and find the required final output. As we discussed in earlier sessions, HDFS is the distributed file storage. In MapReduce we saw some programs like word count program, line count program, finding the average, etc. While writing MapReduce programs, we observed that hdfs and MapReduce together increase the data processing and increases the speed of computation.
It is a parallel programming model for processing large amounts of structured, semi-structured, and unstructured data on large clusters of commodity hardware. HDFS:
Hadoop Distributed File System is a part of Hadoop framework, used to store and process the datasets. It provides a fault-tolerant file system to run on commodity hardware.

MapReduce needs Java

While HDFS and Map Reduce programming solves our issue with big data handling, but it is not very easy to write a MapReduce code
One has to understand the logic of overall algorithm then convert to MapReduce format.
MapReduce programs are written in Java
We need to be an expert in Java to write MapReduce code for intermediate and Complex problems
Not everyone has Java background

Map Reduce Made Easy

The traditional approach using Java MapReduce program for structured, semi-structured, and unstructured data.
This way to executing map reduce is not easy
Hive is created to make the map reduce process easy. Hive query language.
Hive is similar to the SQL query language.
Let’s say, if we want to find the average of some data, then writing a map and reduce function in Java is difficult. In such cases writing an SQL query is much easier than writing a java code. Even if one does not know sql, learning SQL is very easy.
So writing queries in hive will be much easier than writing a java code. Hive is like SQL on top of hadoop.
If we write some queries in hive, it will understand our query, create the map and reduce functions, send it to hadoop distributed file system, perform the analysis on the data, and fetch the results back .

Line count- Code

Let’s take an example of a line count code.
In hive we just have to wite “Select count(*) from table” instead of writing a huge MapReduce code.
For this particular line count program, we need to write many lines of code in java, whereas in hive we just have to write a single line query.

Hive Introduction

Hive

Basically hive is a tool that makes our MapReduce easy. It converts our sql based queries into map reduce programs and then execute them on top of hive data warehouse tables.
We simply have to write the query and in the background hive will convert it to MapReduce
Hive runs on top of Hadoop
Hive is SQL on top of Hadoop
Learning hive is like learning SQL.
Developed by Facebook
Further developed and maintained by Apache Software Foundation
Majorly used for basic statistics, reporting and data analytics

Features of Hive

Enables easy data summarization i.e., if we have data, we can write some query and immediately we get the summary.
Ad-hoc analysis of large volumes of data: large volumes of data will be stored on HDFS. Hadoop will take care of distribution of the data among the nodes and then we write queries in hive, which will be converted into Mapreduce program which will be implemented on the large volumes of data in HDFS.
SQL like queries: Hive has hive structured language (HQL) which is very much similar to a SQL
Scaling-up is easy: Irrespective of the data volume we can write queries in hive which will give us the intermediate results

Hive vs RDBMS

Hive is different from RDBMS

One may think hive as Bigdata version of a Relational database management system(RDBMS). There are some key differences.
Hive is more of a storage and processing data warehouse rather than a database
RDBMS has some fundamental differences.
RDBMS is schema on Load i.e., before loading the table or before loading the data itself the schema has to be perfect and table has to match with the schema, that means if the table has some different format other than what is mentioned in the schema then the table will not be created. Whereas hive is scheme on read, which means hive will verify the data only when we are using it. It will create the table and keep the data inside it.
Record level updates are possible in RDBMS, i.e., we can perform insertion deletion, indexing, etc. easiy in RDBMS, whereas in hive, record level updates are not possible. This is because hive runs on top of hadoop and hive datasets are on hdfs. The data set are too big and they’re divided into smaller chunks of the data. If the data is divided into smaller chunks then it is not at all possible to find and update the smaller chunks. Hence hive doesn’t support record level update.
The main advantage of hive is that it can handle huge data sets i.e., in tera bytes, peta bytes, and so on. Scaling up is easy in hive. Whereas in rdbms it is very difficult to handle large data.
In RDBMS, we can read and write the table multiple number of times and work seamlessly. If we have smaller data and around 20 to 30 tables, then we can always do record level update any number of times. Whereas in hive we can read the tables or data or do some analysis any number of times, but we can write only once. This is because, hive itself does not have any database. It resides on top of hadoop so it relies on HDFS.
Rdbms supports OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) i.e., we can use it for online transaction processing and online analutical processing. Whereas hive does not support OLTP i.e., once the data is there, we cannot update or modify it. Hive support online analytical processing, which means we can write sql queries, we can get results out of it, we can view the summary, etc. but updating the tables or the data is not possible.
Rdbms is best suited for small data sets for dynamic analysis and fast response. Whereas hive can process big data or the larger data sets, but here the analysis is static, i.e., here we can store the data and do the computation slowly, one by one. Hive does not give swift response, it will take its time and do the analytics or the computation and then give us the desired results.
Hive has grown up so much along the time, it solves lot of issues and hive can be a good solution for most of the analytical or the reporting requirement.

Hive Architecture

Metastore

The component that store the system catalog and meta data about tables, columns, partitions etc.
Stored on a traditional RDBMS

Driver component

Manages the lifecycle of a HiveQL statement as it moves through Hive.
The driver also maintains a session handle and any session statistics.
Query Compiler
Optimizer
Execution Engine

Command Line Interface

Thecommand line interface is where we write codes or type in the commands.

Hive Basic Vocabulary

Databases, Tables and more

Databases:
Contains tables, views, partitions etc.,
Tables:
Data records with same schema. For example quarterly month sales data table

Data Types

Integers: TINYINT, SMALLINT, INT, BIGINT.
Boolean: BOOLEAN.
Floating point numbers: FLOAT, DOUBLE .
String: STRING.

Hive Query Language: LAB

Before starting hive we have to start the hadoop services as hive runs on top of hadoop.

    start-all.sh
    jps

Start Hive

    hive

It will load all the libraries that are required for hive and then it will start hive

Once hive has started, it will show “hive” followed by the prompt.

If we want to view the tables that are already there, we just have to type the following command:

  show tables;

This will show the complete list of tables present.

We can also create tables using the Create table command.

    CREATE TABLE stack_overflow_tags(id BIGINT, title string, body string, tag1 string,tag2 string,tag3 string,tag4 string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'  ESCAPED BY '\' ;

The name of the table is â€œstack_overflow_tagsâ€. Its 1st column is id, 2nd column is title, 3rdcolumn is body, 4th column is tag1, 5th column is tag2, 6th column is tag3 and 7th column is tag4.

This file consists of discussions of certain topics. In the title column, the questions or reated topics can be entered, in the body field the description to that title will be there, and in tag columns, all the tags related to that topic or discussion will be there.

The table which we created is a blank table. We need to put some data into it. We already have the data in the database, we just need to push it into the table.

Make sure that the dataset is on HDFS

    dfs -ls / ;

Blank table is created with schema. Now fill the data inside the table.

    LOAD DATA INPATH '/stack_overflow_hdfs' INTO TABLE stack_overflow_tags;

This command is for copying the data from the file stack_overflow_hdfs to the table, stack_overflow_tags.

We can see the table using following commands:

    select * from stack_overflow_tags;

We can also see the table using any browser with the below link. Using this we can see the data on hive data warehosue

http://localhost:50070/explorer.html#/user/hive/warehouse/stack_overflow_tags

Select first few rows

    select * from stack_overflow_tags LIMIT 3;

See below part of the terminal:

This will print only the first 3 rows of the table

Similarly we can give n number of rows.

We can save the output to a new file which can be used further for some analysis.

    INSERT OVERWRITE LOCAL DIRECTORY '/home/hduser/Output/hive_out1' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select * from stack_overflow_tags LIMIT 3;

This will create an output file in the specified folder where the output will be saved.

We can have a look at the exported file. For this we have to open a new terminal as in the current terminal, hive is already running.

    cat /home/hduser/Output/hive_out1/000000_0

This will open the output file.

Count(*)in Hive

A simpe count Query

    select count(*) from stack_overflow_tags;

Count query gives us the number of records or the number of rows in the table.

Group by in Hive

This command can be executed in a non-hive terminal

    hive -e "select tag1, count(*) as tag1_count from stack_overflow_tags group by tag1 order by tag1_count DESC" > /home/hduser/Output/hive_out_tag1.txt

The output is too large so we need to give the output file path to save the output in a file..

This command is to find the most discussed topics. We can find it by calculating the number of tags. The tag with highest number will be the topic which is discussed the most. We print the result in the descending order.

We can have a look at the exported file in a new terminal.

    cat /home/hduser/Output/hive_out_tag1.txt

This will give us the required output in a file.

Word count in Hive

This command can be executed in a non-hive terminal

    hive -e "SELECT word, count(1) AS count FROM (SELECT explode(split(tag1, 's')) AS word FROM stack_overflow_tags) w GROUP BY word ORDER BY count DESC" > /home/hduser/Output/hive_out_wordcount.txt

This command will cosider the tag1 column, the seperate all the words based on space and the find the count for each word in it.

Again we can have a look at the exported file in a new (non-hive) terminal

    cat /home/hduser/Output/hive_out_wordcount.txt

Executing a script from a file

For some analysis, we may need to write many queries in hive. It will be complicated and time consuming to run each query seperately.

In SQL we cal save all the queries in a .sql file and then call the file for execution.

Similarly, in hive we can create .hql file in which we can put all the queries together and run them.

Create a hive query script file. Fill file with queries

    mkdir /home/hduser/codes/   

    gedit /home/hduser/codes/my_query.hql

This will open a .hql file.

Now we will the above file with some queries.

    select * from stack_overflow_tags LIMIT 3;

We write this query inside the .hql file, save it and close it.

use source to execute it from hive

    hive
    source /home/hduser/codes/my_query.hql;

The above command will execute all the queries in the .hql file. We can add more queries to the script file.

Joins

Hive Joins

Even joins can be implemented using hive.
But there are few limitations in hive. Like it supports only equality joins i.e., a.key=b.key.
Doesnâ€™t support a.key<b.key type of joins as of now
Joins with Non-equality conditions are very difficult to express as a map/reduce job.

There are Two Tables Online_Retail_Customer, Online_Retail_Invoice

For this we must have the dataset in our local file system. Then push both the datasets onto HDFS.

     hadoop fs -copyFromLocal /home/hduser/datasets/Online_Retail_Sales_Data/Online_Retail_Customer.txt /Online_Retail_Customer

    
    hadoop fs -copyFromLocal /home/hduser/datasets/Online_Retail_Sales_Data/Online_Retail_Invoice.txt /Online_Retail_Invoice

    hadoop fs -ls /

Both the database is now in HDFS.

Now we create tables to perform our join operation.

Create Table Schema for Online_Retail_Customer on Hive

    hive
    
    CREATE TABLE Tbl_Online_Retail_Customer(uniq_idc string, InvoiceDate string, UnitPrice INT, CustomerID INT,Country string) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'  ESCAPED BY '\' ;

Then we will fill the table with data that we had imported into the database.

    LOAD DATA INPATH '/Online_Retail_Customer' INTO TABLE Tbl_Online_Retail_Customer;

We can see the table using the link http://localhost:50070/explorer.html#/user/hive/warehouse

Then we create our 2nd table i.e., Create Table Schema for Online_Retail_Invoice on Hive

    CREATE TABLE Tbl_Online_Retail_Invoice(uniq_idi string, InvoiceNo string, StockCode string, Description string,Quantity INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'  ESCAPED BY '\' ;

Fill the table with data

    LOAD DATA INPATH '/Online_Retail_Invoice' INTO TABLE Tbl_Online_Retail_Invoice;

Check the link to see the created tables.

http://localhost:50070/explorer.html#/user/hive/warehouse

Left join

In left join, all the entries from the left join will be considered and the all the matching data for the left table entries will be found out from the right hand side table. This will be the resultant left join table.

First we drop the table if it is already existing.

    Drop table Tbl_left_join;


    CREATE TABLE Tbl_left_join as SELECT * FROM Tbl_Online_Retail_Customer t1 LEFT JOIN Tbl_Online_Retail_Invoice t2 ON t1.uniq_idc = t2.uniq_idi;

Here we are creating the table â€œTbl_left_joinâ€ for our left join, using the 2 tables viz., Tbl_Online_Retail_Customer and Tbl_Online_Retail_Invoice. The join will be done based on the unique id which is common in both the tables.

We can check, how many rows are the in the resultant table and source tables.

    select count(*) from Tbl_left_join;

This gives us the line count of the resultant left join table.

    select count(*) from Tbl_Online_Retail_Customer;

This will give the line count for the “Tbl_Online_Retail_Customer” table.

    select count(*) from Tbl_Online_Retail_Invoice;

This will give the line count for “Tbl_Online_Retail_Invoice” table.

Right join

Right join is similar to left join. But in this, everything from right hand side table will be considered and the left hand side table will be taken based on the references of the right side table.

Drop the table if it is already existing.

    Drop table Tbl_right_join;

    CREATE TABLE Tbl_right_join as SELECT * FROM Tbl_Online_Retail_Customer t1 RIGHT JOIN Tbl_Online_Retail_Invoice t2 ON t1.uniq_idc = t2.uniq_idi;

This will be the resultant table for right join.

How many rows in the resultant table?

    select count(*) from Tbl_right_join;

This will give the line count of the resultant table.

    select count(*) from Tbl_Online_Retail_Customer;

This will give the line count for the “Tbl_Online_Retail_Customer” table.

    select count(*) from Tbl_Online_Retail_Invoice;

This will give the line count for “Tbl_Online_Retail_Invoice” table.

Inner Join

In inner join, whatever is common or unique in both the tables will be the resultant table.

Drop the table if it is already existing.

    Drop table Tbl_inner_join;

    CREATE TABLE Tbl_inner_join as SELECT * FROM Tbl_Online_Retail_Customer t1 JOIN Tbl_Online_Retail_Invoice t2 ON t1.uniq_idc = t2.uniq_idi;

This is our resultant inner join table.

How many rows in the resultant table?

    select count(*) from Tbl_inner_join;

This is to see the line count of the resultant table.

    select count(*) from Tbl_Online_Retail_Customer;

    select count(*) from Tbl_Online_Retail_Invoice;

Full join

In a full join, all the data in both the tables will be saved in one resultant table.

    Drop table Tbl_Full_outer_join;

    CREATE TABLE Tbl_Full_outer_join as SELECT * FROM Tbl_Online_Retail_Customer t1 FULL JOIN Tbl_Online_Retail_Invoice t2 ON t1.uniq_idc = t2.uniq_idi;

How many rows in the resultant table?

    select count(*) from Tbl_Full_outer_join;

    select count(*) from Tbl_Online_Retail_Customer;

    select count(*) from Tbl_Online_Retail_Invoice;

Conclusion

We discussed basics of hive
We learned some useful queries
Hive is one of the most widely used tool for reporting, inside bigdata eco-system
This session is just an introduction to hive.
Hive has many more features that are not discussed in this session.

Hive

Hive

Contents

Map Reduce vs Hive

Hadoop

MapReduce:

MapReduce needs Java

Map Reduce Made Easy

Line count- Code

Hive Introduction

Hive

Features of Hive

Hive vs RDBMS

Hive is different from RDBMS

Hive Architecture

Hive Basic Vocabulary

Databases, Tables and more

Data Types

Hive Query Language: LAB

Count(*)in Hive

Group by in Hive

Word count in Hive

Executing a script from a file

Joins

Hive Joins

Left join

Right join

Inner Join

Full join

Further Reading

Conclusion