Pig
In this particular session we are going to learn the basic of the pig, such as what is a pig, pig architecture ,pig latin scripts, pig basic operations , loading the data into pig , group by ,filtering , sorting, functions in pig , joins in pig and storing the data and exporting the data outside the pig. In short in this tutorial a detail study of the pig and its ecosystem will be covered. This session will teach you how to use pig tool with in Hadoop ecosystem, after learning the basic of the pig you can do the advance operations in the same.
Contents
- What is Pig
- Pig Architecture
- Pig Latin
- Pig Latin basic Operations
- Loading the data
- Group by
- Filtering
- Sorting
- Functions
- Joins
- Storing data
What is Apache Pig
As we know already the Map reduce have some issue that is one need to be expert in the java programming language to write the map reduce code efficiently. Second issue with map reduce is that we need to convert each and every problems into map reduce framework , is not like normal program where one can just write the code in the traditional way, every program need to be converted into Map that is running locally then get the output from the map and then execute reducer. This makes map reduce program tough to write . As the data scientific or the analytic are not excepted to know java as a professional java developer ,then we need to take the help of the tool within Hadoop ecosystems that will do map reduce job with out actually writing the java scripts . So hive was doing the map reduce job by converting queries into map reduce code , similar there is one more tool called Pig. So pig is a high level scripting language, by using the help of the pig latin script the code can be written which will be converted into the map reduce . Being a big data analytic the pig tool is very useful, where one can write the code in the pig latin script which will be internally be converted into the map reduce task. This is called as map-reduce made easy.
Map Reduce Made easy
- The traditional approach using Java MapReduce program for structured, semi-structured, and unstructured data.
- This way to executing map reduce is not easy
- Pig is a high level scripting language to escape the MapReduce Java coding complexities
- Programmers need to write scripts using Pig Latin language.
- All these scripts are internally converted to Map and Reduce tasks.
What is Pig
In short pig is a simple scripting tool and it is powerful alternative to map-reduce . Apache pig is an abstraction over map-reduce. Pig works very good for certain types of the classes such as web log analysis, text mining and etc . Pigs can handle datasets where the datasets are slightly unstructured or semi-structured unlike hive which will fail if the datasets are not in the proper structured format.
- In simple terms, Pig is a simple scripting tool and powerful alternative to MapReduce
- Apache Pig is an abstraction over MapReduce.
- Hive was good but has lot of limitations. User defined functions, built-in functions, ad-hock analysis is not flexible in hive
Pig Applications
The application of the pig are as follows:
- Web log processing.
- Data processing for web search platforms.
- Ad hoc queries across large- data sets.
Pig Latin
In this session we will discuss about pig in detail and pig latin script and how to write pig latin scripts . Both hive and pig have their own advantage and disadvantages, in some types of problems hive is better and in some classes of problem pig is better , so there is no competition between hive and pig. Data scientist or analyst should decide which tool is better for achieving the goal as both hive and pig are tools are used for achieving the desired results but the approach is different. To interact with pig we need to learn new language which is called as pig latin script, which is very simple and have limited number of commands or operators , syntax is very simple and hence not much time needed to be spend on learning the pig latin scripts.
So now we will learn about Pig latin script which is necessary to interact with the pig . To write data analysis programs , pig provides an high-level language known as pig Latin. As said before pig latin have very limited keywords and operators, and very simple to learn too . There are many built in operations for joins , filter and ordering , we just need to call the write operator for the right task . It also provide nested data types such as tuples, bag and maps which are missing from the map reduce. Sow what exactly is the nested data types for example a bag consist of the tuples and a map consist of the key value pairs so basically each one of them is the sub group of one another , the use of the nested data types will be more clear once we start writing the pig latin scripts . Pig also allows us to write user defined functions , we can write our own functions for reading ,writing, processing or creating the report and then implement them in pig which be internally be converted into the map reduce code, which is really a powerful feature of pig and also solve our business purpose .
- To write data analysis programs, Pig provides a high-level language known as Pig Latin
- Pig Latin syntax is very simple and intuitive.
- Built-in operators like joins, filters, ordering
- Provides nested data types like tuples, bags, and maps that are missing from MapReduce.
- User-defined Functions : Pig Latin makes it easy to develop own functions for reading, writing, and processing data
Pig Architecture
So lets us see about pig architecture, as from diagram we can see that pig sits over the Hadoop. With in the pig it is having execution engine , compiler , parser, optimiser. Pig offers us the grunt shell which is simply like a command line interface where pig latin script can be written . The pig latin script will be then passed down to parser , then code will be optimized , then compiler will check for any syntax error and after all this execution engine will send the code to the Hadoop by converting the code into the map reduce format and store them in hdfs, on which the data analysis and the data computation will be executed and again fetched back and given as output in pig .
Pig Data Types
Now we will look into the pig data types . What are the basic data types in pig latin script ? So the simple data types of the pig are integers, float , double, byte array, boolean , date time and big integer. Apart from this, pig is having some unique data types called as Atom. which is just a single value in Pig latin script , irrespective of their data types whether they are integer , float or anything everything is known as atom it is similar to dynamic variable as in other programming language . The other one is called as the tuple, tuple is like a row in a generic table or in a data table, basically tuple consist of a order set of fields so here is a example of the tuple where format is (Mobile, 200) where first filed shows the item name and second field tell us about the item cost for this particular example in short it means mobile price is 200. Tuple representation is done by both parentheses (), so anything written in this format is called as tuple , we can think it as a simple row with in table. The there is one more data type called bag. So what is a bag ? Bag is unordered set of tuples , which is represented by the “{}”” It is similar to a table but it is not necessary that every tuple in the bag contain the same number of fields or that the fields in the same position (column ) have the same type. For example first row can have 20 column , 2nd row can have 25 column , 3rd row can have just 4 column . A bag is just a simple collection of tuples. The representation of the bag is {(Mobile, 200)(PC, 600)}where it means Mobile price is 200 and PC price is 600. So why we need to learn about bag and tuples? As pig latin script is going to use these data types for data operations and these kind of data types might make our analysis easy to handle on any type of datasets. Next data type would be Map. In map is a key value pair datatype, there will be key and there will be a value . The key needs to be type of char array means it should be character and it has to be unique . Map data type is represented by []. The example of the map data type is :[‘Age’#30] where “Age is the key” and “value is 30”. Map with 2 keys is : [‘Item’#’Mobile’,’quantity’#200] where “item is the key and mobile is the value”, and “ quantity is the key and 200 is the value”. In short map is just a key value pair.
Simple Data Types:
- int, float, long, double, chararray, bytearray, Boolean, Datetime and Biginteger* Atom:
- Any single value in Pig Latin, irrespective of their data, type is known as an Atom.
- Like variable in other languages Tuple:
- A record that is formed by an ordered set of fields.
- Similar to a row in a table
- Represented by ()
- Eg: (Mobile, 200)
Bag: - A bag is an unordered set of tuples.
- Represented by “{}”.
- It is similar to a table but it is not necessary that every tuple contain the same number of fields or that the fields in the same position (column) have the same type.
- Eg: {(Mobile, 200), (PC, 600)} Map:
- Nothing but Key Value pairs
- The key needs to be of type chararray and should be unique.
- The value can be of any type.
- Represented by “[]”
- Eg:
- Map with one key: [‘Age’#30]
- Map with two keys : [‘Item’#‘Mobile’, ‘quantity’#200]
Relation
Now we will see the relation which is similar to bag or it is similar to table in some cases . We can say that relation is a bag , bag contains tuples , tuples contain fields and filed is a simple data . From the diagram we can say that the relation is the outermost structure of the pig Latin data model . A relation can have multiple bags . As soon as when we import an dataset the data will be converted into the tuples , bags and relations . So pig handles data sightly differently .
- Relation is a Bag(Similar to table in some cases)
- Bag contains tuples
- Tuples contain fields
- A field is a simple data
- Relation is the outermost structure of the Pig Latin data model
Pig Latin Basic Operations
Pig Latin Script
Before writing the pig latin scripts some important note should be taken in consideration First one is Pig is the case sensitive for certain commands in certain cases . Keywords in the pig latin are not case sensitive but the function name and reaction names are case sensitive. Suppose if we define particular relation with upper and the lower case then that is case sensitive but the keywords for example store ,load, some kind of import are not case sensitive . But the relation names and the tables names are case sensitive. There are two commenting styles either we can use SQL-style of single line comments or java style of the multiline comments .
- Pig Latin is the language used to analyse data in Hadoop using Apache Pig.
- Case Sensitivity
- Keywords in Pig Latin are not case-sensitive but Function names and relation names are case sensitive
- Comments
- Two types of comments
- SQL-style single-line comments (–)
- Java-style multiline comments (/* */).
Loading the Data into Pig
Lets start working with the pig . As we discussed earlier pig is build upon hadoop or the pig sits above the hadoop so we need to start hadoop before we start pig . So for starting the hadoop we need to run the command “start-all.sh” which will start the hadoop services. It is necessary to start hadoop before starting the pig because the pig latin script will be converted into the map reduce code internally, and map reduce is on top of the hadoop . So start the hadoop before staring the pig by command “start-all.sh”. Now we can see that hadoop is up and running . Now we can start the pig . pig is the command to start the pig . And it will open the ** grunt** shell
- Starting pig
pig
Lets work on the dataset Online_Retail_Invoice.txt, First we have push the data into hdfs, then from hdfs we have to move the data inside the pig, because pig is the part of the hadoop ecosystem and it works with hdfs.
- Now open a new terminal and type the below command to access the hduser .
su -hduser - Push the data onto HDFS using copyFromLocal statement
hadoop fs -copyFromLocal /home/hduser/datasets/ Online_Retail_Sales_Data/Online_Retail_Invoice.txt/Retail_invoice_hdfs - Now come back to the pig terminal.
- Push the HDFS data onto pig by using LOAD statement.
- We need to push the data into a relation. Remember relation is a bag.
- We need to use PigStorage(‘’) here. ‘’ is for tab delemited data
Retail_invoice_pig = LOAD '/Retail_invoice_hdfs' USING PigStorage('t');
Lets understand the code in detail,Retail_invoice_pig is the relation name its like the dataset name or data file name or the table name . Load is a keyword this load keyword is used to bring the hdfs data inside the pig and Retail_Invoice_hdfs is the location of the hdfs file that we want to load . Using PigStorage() is also an keyword which means the given data is in tab delimiter function. The tuple will be created based on the delimiter mentioned in this storage function . There are several options for loading functions such as Binstorage , JSonloader, Pigstorage, Textloader, in this particular example we are using the pig storage. We have used tab delimiter function for this particular example. Once we run the above command it shows an warning message as command is depreciated which is okay .
LOAD Statement
Accessing the Data on Pig- Dump
Now the dataset is inside the pig and we can use a dump statement DUMP Retail_invoice_pig; which is kind of a print statement which print the data. Inside the pig the relation name is Retal_invoic_pig. Inside pig we don’t call it as data set or data table this is called as the relation . So lets run the dump command. Once we run the dump command it will start calling some java libraries and finally we can see that it is print the data which was inside the retail invoice. Being a huge dataset it will take time to print all the rows. By now we can understand that DUMP is not a good option for printing the data set if the dataset is too larger, because dump command will consume considerable amount of time for printing the whole dataset. So DUMP command is not recommended to be used when the dataset is large.
- Dump operator simply prints the relation/output/result in the console
- Dump is NOT a good idea if your target dataset size is large
Lets have a look at data on Pig
DUMP Retail_invoice_pig;
Validating the data on Pig- Describe
So DUMP command is not recommended to be used when the dataset is large. Instead of DUMP we can use the Describe this is an another keyword inside pig. Lets try to run the Describe. Command for running the Describe is Describe Retail_invoice_pig; As soon as we run the describe command we get an error message saying Schema for retail_invoice_pig is unknown. Now we have to define along with schema to describe command to work
- Displays the schema of a relation(table)
- While Loading the data into pig you need to define schema. If you dont define a schema for relation then output is void
Describe Retail_invoice_pig;
Loading the Data with Schema
- Lets load this table using the schema .
Retail_invoice_pig1 = LOAD '/Retail_invoice_hdfs' USING PigStorage('t') as
(uniq_idi:chararray,InvoiceNo:chararray,StockCode:chararray, Description:chararray,Quantity:INT);
Lets try to understand the command . Retail_invoice_pig1 is the new relation, like it is the new table inside the pig . Load statement will tell the location of the data that should be loaded from. Using is an keyword which will take care about the delimiter used in the dataset. Now the next part of the command will tell about the scheme . First one is unique id which is type of char array Second one is Invoice Id which type of char array Third one is ** StockCode ** which type of char array. Fourth one is description which type of char array. And the last one is the Quantity which is type of integer. Then we are loading the data again inside Retail_invoice_pig1 but this time we are loading along with schema . The data is now loaded inside the table scheme Describe Retail_invoice_pig1
By running the describe Describe Retail_invoice_pig1; command it will give us the small description about relation, which consist of the column name and the structure of the column .
Describe Retail_invoice_pig1;
By running the DUMP Retail_invoice_pig1; it starred to print the data, again dump is not recommend command if the dataset consist of large amount of rows.
DUMP Retail_invoice_pig1;
Validating the data on Pig- LIMIT DUMP
- Dump is a bad idea if the data size is large
- We can use LIMIT statement to dump fewer tuples(records)
head_Retail_invoice_pig1 = LIMIT Retail_invoice_pig1 10;
DUMP head_Retail_invoice_pig1;
Group by in Pig
In this section we will see about the grouping in the pig, grouping in the pig is very important because most of the pig function takes the bag as the input parameter.
Grouping before using functions
- Most of the inbuilt functions in pig take the bag as input.
- We need to group the data and relevant bags before applying functions.
LAB – Group by in Pig
Grouping helps to gets the summary statistic. Grouping is similar to sql group by . Lets us see how to group inside the pig. Now we will work with an example of online customer retail data. First we need to move the online customer retail data into the hdfs. So now go back to the hduser terminal.
- Get the Online Customer data on to HDFS
- Get the above dataset into PIG
- Dump first 10 tuples
- Group by country
- Show the group by schema
- Dump the first two bags
Get the Online Customer data on to HDFS
This below command will move the data Online retail sales from local file system to the hadoop file system and it will be renamed as Online_Retail_Customer_hdfs. Just take a note that while using the hadoop command just switch to the hduser shell and during pig command switch to the pig shell.
hadoop fs -copyFromLocal /home/hduser/datasets/Online_Retail_Sales_Data/Online_Retail_Customer.txt /Retail_Customer_hdfs
Get the above dataset into PIG
Now we can go back to the pig shell and import the data from hdfs. Command for importing the data from hdfs to pig is
Retail_Customer_pig = LOAD '/Retail_Customer_hdfs' USING PigStorage('t') as (uniq_idc:chararray, InvoiceDate:chararray, UnitPrice:INT, CustomerID:chararray,Country:chararray);
Retail_Customer_Pig is the relation name inside the pig . Load command is used for defining source location of the dataset ,for this example load command will load the Onlie_Retail_Customer_hdfs in to Retail_Customer_Pig, and further command is about the table schema.
Dump first 10 tuples
We can use dump command to print the data but once again the dataset being very huge the dump command is not the efficient way , there is an better way for looking at the data and is by using the LIMIT command .
The command for LIMIT is :
head_Retail_Customer_pig = LIMIT Retail_Customer_pig 10;
Dump head_Retail_Customer_pig;
As being the data flowing language the data set created from LIMIT command needs to be stored somewhere, for the same reason we have created a new relation called head_Reatil_Customer_pig and this can be printed by using dump command. Lets take a detail look inside the command of LIMIT. Retail_Customer_pig is the name of the data table or the relation which is inside the pig and LIMIT command with number 10 means it will choose the first 10 observation from the same relation. And those 10 observation will be stored inside the new relation called head_retail_customer_pig the new relation is first created then the data is stored in it thats why it is called as “data flow language Now we can use the dump operator on the head_reatil_customer_pig relation and it will display the first fews rows.


