Map Reduce
In this section we are going to learn mapreduce basics. One of the biggest components of Hadoop is mapreduce. mapreduce is the programming part or achieving the objective part of solving the big data problems. so we will understand what is mapreduce,what is mapreduce algorithm, how to write mapreduce programs, and will also see some examples of writing mapreduce programs
Contents
- What is MapReduce
- Map Reduce Algorithm
- Map Reduce Program
- Word Count map reduce example
MapReduce Programming
Map Reduce Model
We already discussed about distributed computing in the last session in introduction to big data and distributed computing. Basically if we have really large data set we divide the whole problem into smaller pieces or into smaller problems and then find the final objective by taking the intermediate output from these smaller problems Basically we divide the whole data set into smaller pieces, store them on local systems and on each of these smaller pieces we will run Map code, that is the map function will be applied and then the output of map will be given as input to reducer and then reducer will give us the final output So to achieve the final objective we do a two stage map and reduce method of computation. Solving the overall problem or objective with map and reduce is called mapreduce programming model.
- Distributed Computing for distributed data
- Achieve the final objective by two stage Map and Reduce method of computation
Map Reduce
Map reduce is inspired from functional program example is Basically mapreduce is not the starting point, there was something called functional programming, from where mapreduce was introduced. Map will take input as the actual data or the smaller pieces of the actual data and then it will generate an intermediate key value pair. And then reducer will take all the intermediate key value pair or it will merge all the intermediate values from the map and it will generate the final output So the objective is divided into map and reduced format By using mapreduce model, it is very easy for us to scale up the data processing over multiple computers We will see some more details of map reduce, how eactly mapreduce works, how to write the mapreduce code, etc.
- Programming style is inspired from functional programming languages (eg. Lisp)
- Map()
- Process a KV pair (key- value pair) to generate intermediate key-value pairs
- Reduce()
- Merge all intermediate values associated with the same key to generate final result
- Using MapReduce model it is easy to scale data processing over multiple computing nodes.
How MapReduce Works
To understand this whole process of mapreduce we will take an example, using which it will be easy for us to understand mapreduce, instead of just looking at it as a programming model we will understand the whole steps in mapreduce. Let’s take a word count example, the reason why we are discussing word count program here is, it is very easy to understand mapreduce methodology using this particular example of counting the number of words that are present in a huge data set If the data set file is huge and if we want to count the number of words that are appearing in that data set, how do we do that? The data set can be anything. let us say the data set is the “tweets”, let’s say there’s a new product launch, maybe iPhone, or some other product and at the time of launching that product there are so many tweets, so many social media discussions are happening around that product and the final data if you take, it will be very huge to count the number of words and frequency of each words. From the data or words, we might even find out, what are the features are being discussed, etc. another example can be, if we take any current topic like election or sports, people will be discussing a lot and if we gather all the data and and then find the most occured word, the most discussed topic, etc How to achieve such objectives is what we are going to discuss
- We will understand MapReduce with the help of an example
- Word count example
- Imagine a huge data set that contains user comments and discussions about a laptop
- We want to count the frequency of words to get an idea on the popular features
Map Reduce Word Count Example
Map is a function, reduce is a function. Map function takes the input; key and value and gives the intermediate output Reduce function take the input as the output of the map function and give the final output. So what exactly map take as input? And what exactly maps gives as output ? So if you want to count the total number of words in a particular given data file or the frequency of each word, how do u do that? To divide the whole program into mapreduce program, we just take a particular document, now what map does is, it goes line by line and for every word it finds, it assingn that word with digit “1” , which denotes the frequency of that word. Map function simply takes every word in the input doc, and gives the word and “1”“, which is the key value pair. It gives key value pair as output. This is what map does. What reducer does is it takes the output of map as input, i.e., key-value pair , which is generated by the map function. And reducer will finally sum up, it will take each and every word, its frequency and sum up all the frequency to find the no. of occurances.
The Map Function input and output
So first lets discus about map function. Map takes key value pair as input and gives key-value pair as output Particular key-value pair is taken as input Lets dig deep into map function Maps input is key-value pair What is the key?Text doc name, whatever is the name of the document, it can be anything, we can name it anything, it is not that significant Key can be anything Map function takes the document name as key and value is nothing but the contents of the text document, Doc name and whatever is inside the document, map function takes that as input and then map function runs its funcions What output does the map give? It gives each and every word that is required, with value 1 assigned to it So output of map function is key which is the actual word and the value associated with key is 1 . Map function takes key value pair as input and gives key value pair as output Both these key value pair are different In input key value pair, the key is document name, value is contents of the document. In output, key is the word, and value is the “1” as it assigns 1 to every word and therefore the output will have many key value pairs.
- Input
- Map function takes a key and value pair as input
- Key = Text document name
- Value = the contents of text document
- Output
- Map function gives key value pairs as output
- Key = word
- Value = 1
Map Reduce Word Count Example
This function we are defining. Key and the value are the parameters. Key is the document name, value is the text of the document For each word “w”” in value, where value is the text of the document and for each word “w”” in that, emit(w,1), using this we are emitting a tuple, i.e., (w,1). So map function takes key value pair as input, where key is the text doc name and value is the content. Output is each word which is assigned with 1.
How Map Function Works
Lets say this is the input, Laptop screen issue, Laptop screen problem, screen size issue. Let say this is the text that is there inside the text document, then what does map function do? So the map function is taking input as key and value, where key is laptop.txt, the text file name and value is review_text, i.e., whatever is there in these three lines of the input is the value For each “w” in this, it will emit (w,1)
- If we apply this map function for the above given data, the output will be like the image given below.
If we take 1st line, it will go through the map function, it will get the output as each word followed by 1. Laptop 1 Screen 1 Issue 1 This is what map does, Similarly it is done for second and the third line.
This the output of the map function for the given data
Always remember, map function output is not the final output Map function gives us the intermediate output, this will be taken as input for the reduce function and this is the intermediate output after the execution of the map function
The Reduce Function input and output
Reduce function will take each word and its frequency as input or each word and number of ones as input and for output it will just sum up all these ones and give it as output. How does reducer work? So for reduce function, the input parameter are key and set of values, key is a word, values iterator or count. Basically in reducer is, just take each and every word, then start the count and result=result + 1, like count=count+1. It will finally emit each word and sum of number of times or the total count of 1’s, that is the word frequency, word count.
- Input
- Reduce function takes a key and set of value pair as input
- Key=word
- Value=all counts list of that word
- Output
- Reduce function gives key value pairs as output
- Key=word
- Value=sum of counts
How Reduce Function Works
Now the reduce function, what exactly reduce take? it takes the output of mass function as the input, so reduce function takes the key and value as the input. key is each word and values is the sum of all counts, how many times each word appears. if this laptop appeared thrice , reduce will take laptop,1,1,1 as that list of all the number of words that has appeared as input . Map output was simply each word and the frequency .
The shuffled the key-value pairs because some of the keys are same like laptop, which is appearing multiple times, so we can just shuffle and then simply prepare it for reduce. Some keys have more than one value.
Here screen appeared thrice, so this is the final key with shuffled key-value pair and then reduce function just takes the key and values as input and then sum all 1’s i.e., take all the 1’s of all words and sum it and give us a final output.
Reducer function take this shuffled key-value pair as input and gave it as output. If we pass the input data into reduce function we come to know that laptop has appeared twice, etc. Reduce function will just give each word and sum of counts or sum of the frequency , no. of 1’s, number of times that word is appearing as the output.
Shuffle & Reduce Stage
What reduce function is doing? Reduce is not just reduce, it is actually shuffle and reduce, so reduce stage is a combination of shuffle stage and reduce. What is shuffling? Here some of the keys are appearing more than once so we just shuffle them and slightly adjust them so that we have unique key and the corresponding values . The reducers job is just take the output from the map and give the final output. So once the data is processed, the final output will be sent to hdfs
- Reduce stage is the combination of theShuffle stage and the Reduce stage.
- The Reducer’s job is to process the data that comes from the mapper.
- After processing, it produces a new set of output, which will be stored in the HDFS
Scaling-up is easy in MapReduce
Why is map reduce programming such an important factor in distributed computing? Scaling-up is really easy in map reduce. Once you write map and reduce program, no matter how big the data is, it’s very easy to scale up.
How Map Function Works
Earlier we had a very small file with just three lines. imagine a huge input file .
We can divide the data and store it on the cluster of machines As we saw, the moment you move this file onto hdfs it will be divided into smaller pieces of 128 mb or 64 mb each and will be stored. now we can apply MapReduce on each of these datasets simultaneously. Running a map reduce program on a huge dataset file takes a lot of time, but if you’re running map reduce on smaller pieces simultaneously, then you can save a lot of time.
Divide the data, put them on cluster of machines, and then on each of the data you can apply the same map function. Map function will take the input as the text and it will give the output as each word and the digit ‘1’ which is associated with each word . Apply the same map function parallel or simultaneously on each of these pieces, this will save a lot of run time. In a normal case, when we are running a normal program on the first piece of data, the last piece of the data will be not working or it will be idle, whereas if the program is a MapReduce then we can run map on all the data pieces simultaneously.
Map output will be simply the particular word and ‘1’ , which acts as a key value pair,and then we get the shuffled pairs .
Then you can apply the same reduce function without any changes across all the data pieces, The reduce function takes the key value pair, i.e., word and number of ones and finally this function gives the output as count or the sum of all these 1’s i.e., how many times laptop, screen, etc. appears.
We can apply the same map and reduce function on a bigger data set, on a distributed dataset this is distributed style of computing, so the same map reduce function can applied on the bigger data set and we can get the result. That is what one of the main advantages of MapReduce function, i.e., scaling-up is very easy. That is why hadoop uses MapReduce type of coding. We will see some example, i.e., we will actually take one data set, and will write map and reduce program and we will submit to hadoop, first we have to take the data, we have to distribute it and then on each of the pieces we are to run map and reduce functions Hadoop will help us to do this very easily and then we will see the output . This is about MapReduce basic introduction
LAB Map Reduce Program
LAB: Map Reduce for word count
Now we will from work on the dataset called “user reviews”. Move the datset user review from local file system to the hdfs,then we will write an word count program for the same to count the frequency of the each word.Then we will take the final output in the text format.
- Dataset User_Reviews/User_Reviews.txt
- Move the data to hdfs
- Write a word count program to count the frequency of each word
- Take the final output inside a text file
Solution
Check if the hadoop services are stared or not if not then type the following command in the hduser:
start-all.sh
Now check the services of the hadoop is typing.
jps
- check your files on hdfs
hadoop fs -ls /
If the files are not present in the hdfs then we have to bring the file inside hdfs by using the follwing command.
- Bring the data onto hadoop HDFS
hadoop fs -copyFromLocal /home/hduser/datasets/User_Reviews/User_Reviews.txt /user_review_hdfs
Once again check for the avaliablity of the files in hdfs after moving the file from the local directory to the hdfs. – Check the data file on HDFS
hadoop fs -ls /
Now we can see that the files are present in the hdsuer directory and we can proceed . Now we will write a MapReduce program for it. so first we need to write a map.java and then a reduce.java code remember hadoop is built on java, so we need to write MapReduce programs in Java so if you want to write mapreduce programs, you need to learn java because we need to write java based my MapReduce program
- check your current working directory
cd
We will be using a word count program which is already written For doing this, first we change the directory and go to the hadoop bin folder – Goto hadoop bin
cd /usr/local/hadoop/bin/
It is imporatant to make your PWD(present working directory) as $hadoop/bin So once you are here, you have to create word count.java Open an editor with a file name WordCount.java
sudo gedit WordCount.java
When you say sudo, you need to enter the system password that is used for logging in By typing this command, a file will be opened, i.e., the gedit file which is an and within that you need to write the word count program A word count program has a mapper and reducer Map function emits a simple key value pair, Reducer calculate the sum so that is how MapReduce works
In this code these two functions that is the map and the reduce functons are the important part of the code, rest of the code contains the important libraries which are needed throughout the program Copy the below java code, paste in your file and save your file
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
To compile this program, use the below command
hadoop com.sun.tools.javac.Main WordCount.java
Create the jar file which is named as wc.jar
jar cf wc.jar WordCount*.class
Run wordcount program, output will be automaically routed to so to run the word count program, we need to give the command which consists of word count program jar file, the word count class and then wherever we saved the input i.e., user_reviews_hdfs is the input on which the word count program will run and then the output where we will get the final output.
hadoop jar wc.jar WordCount /user_review_hdfs /usr/review_count_out
Have a look at the output
hadoop fs -cat /usr/review_count_out/part-r-00000
Part of the output
We can take the output into a text file so that we can open that txt file later.For that we create a new directory
mkdir /home/hduser/Output/
Now we copy the output inside a text file named review_word_count.txt that is saved in a particular folder output /home/hduser/Output/review_word_count.txt
hadoop fs -cat /usr/review_count_out/part-r-00000
Since we wrote a simple word count, without taking care of spaces the output is like this, we can even modify the map and reduce program to make it look better this is just to give an idea on how MapReduce works . This was a small file, in our next exercise we will take a huge file and will try to count the number of lines It is little difficult to open a large datset in excel, so we can try loading it on sql to understand the data and then we will try to find the number of lines in it, in our next exercise.
LAB: Map Reduce for Line Count
Lets do an exercise with a larger data set Lets consider the stack overflow tags data which is almost 7 gb of size As the block size is 128 mb, 55 blocks will be required We will try to map reduce program to count the number of lines in that particular file This data which we will work now is from stack overflow data where this data contains is, simply the question-answers and the tags – Dataset: Stack_Overflow_Tags/final_stack_data.zip – Unzip the file and see the size of the data. – The dataset contains some stack overflow questions. The goal is to find out the total number of questions in the file(number of rows) – Move the data to hdfs. – Write a word count program to count the frequency of each word – Take the final output inside a text file
Solution
Is the Hadoop started?
jps
Start Hadoop if not started already
start-all.sh
Is the Hadoop started now check by using the “jps” command.
`jps`
check your files on hdfs
hadoop fs -ls /
Dataset is /Stack_Overflow_Tags/final_stack_data.zip. Unzip the data first. Unzipping takes some time
sudo unzip /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.zip -d /home/hduser/datasets/Stack_Overflow_Tags/
Bring the data onto hadoop HDFS. The file size is huge. This step takes some-time
hadoop fs -copyFromLocal /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.txt /stack_overflow_hdfs
Check the data file on HDFS
hadoop fs -ls /
check your current working directory
cd
Goto hadoop bin
cd /usr/local/hadoop/bin/
It is imporatant to make your PWD(present working directory) in $hadoop/bin Open an editor with a file name LineCount.java
sudo gedit LineCount.java
Copy the below java code, paste in your file and save your file The logic here is slightly different The , map function will give the count and reduce function will take all these counts and finally sum it up Here also we will use the MapReduce concept to count the number of lines in that particular data set
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class LineCount{
public static class LineCntMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
Text keyEmit = new Text("Total Lines");
private final static IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value, Context context){
try {
context.write(keyEmit, one);
}
catch (IOException e) {
e.printStackTrace();
System.exit(0);
}
catch (InterruptedException e) {
e.printStackTrace();
System.exit(0);
}
}
}
public static class LineCntReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context){
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
try {
context.write(key, new IntWritable(sum));
}
catch (IOException e) {
e.printStackTrace();
System.exit(0);
}
catch (InterruptedException e) {
e.printStackTrace();
System.exit(0);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "line count2");
job.setJarByClass(LineCount.class);
job.setMapperClass(LineCntMapper.class);
job.setCombinerClass(LineCntReducer.class);
job.setReducerClass(LineCntReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
To compile this program, use the below command
hadoop com.sun.tools.javac.Main LineCount.java
Create the jar file which is named as lc.jar
jar cf lc.jar LineCount*.class
Run wordcount program, output will be automaically routed to MapReduce start on each of the pieces or block parallely, so there are 56 plates, map is submitted to 56 plates, on each of the split, map start . it’ll take some time to run as it’s a 7 gb file the output of map will be going as input of reduce and reducer will work on it and give the output How many number of lines? How many stackoverflow questions were asked in that particular month is the final objective that we are trying to find out ,how MapReduce works on a 7gb data,you can see in between, how much present of map is completed and how much percent of reduce is completed. Output was generated and stored in usr/stack_overflow_out you can check the output from browser as well
hadoop jar lc.jar LineCount /stack_Over_hdfs /usr/stack_overflow_out
Check the output here http://localhost:50070/explorer.html#/
After the execution, now let us see the output
hadoop fs -cat /usr/stack_overflow_out/part-r-00000
So there are around six million total lines and that is around 60 lakhs So around six million questions were asked in a particular month of stack overflow on stack overflow website We can take the output to a text file /home/hduser/Output/stack_overflow_out.txt.
hadoop fs -cat /usr/stack_overflow_out/part-r-00000
LAB: Map Reduce Function-Average
In this one, we will be writing a map reduce code which is related to mathematics like averages, etc. instead of word or line count here we will try to find average Lets consider the stock price data if we have one stock which varies every second or can be so many fluctuations in stock so we have taken the data of a particular stock, its prize, its value over a period of time ,we simply want to find the average stock price the usual way of finding the average is, you just load it into one system and then find the average if the data set is too large, then we can do map and reduce program so we will try to find the average stock price using MapReduce style of coding Lets start the exercise – Dataset: Stock_Price_Data/stock_price.txt – The dataset contains stock price data collected for every minute. – Find the average stock price.
Solution
To check whether hadoop is running use the command jps
jps
Start Hadoop if not started already
start-all.sh
Is the Hadoop started now can be checked by
jps
check your files on hdfs
hadoop fs -ls /
For the data set stock_price.txt, we are trying to find the average stock so first we will move the stock price data on to hdfs the command for moving from local copy from local to hdfs is:
hadoop fs -copyFromLocal /home/hduser/datasets/Stock_Price_Data/stock_price.txt /stock_price
Once it is moved on to hdfs, we can check using hadoop fs -ls to find out whether the file is moved or not
hadoop fs -ls /
check your current working directory
cd
Goto hadoop bin
cd /usr/local/hadoop/bin/
For finding the average we will change the directory to the hadoop bin folder if you are already inside bin then you don’t need to again change the directory to bin now we can create the Java file AvgMapred.java using gedit and open it It is imporatant to make your PWD(present working directory) as $hadoop/bin
Open an editor with a file name AvgMapred.java
sudo gedit AvgMapred.java
Copy the below java code, paste in your file and save your file
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class AvgMapred {
/*
* data schema(tab separated) : 2000.16325 4664654.78955 46513123.134165
*/
public static class MapperClass extends
Mapper<LongWritable, Text, Text, FloatWritable> {
public void map(LongWritable key, Text empRecord, Context con)
throws IOException, InterruptedException {
String[] word = empRecord.toString().split("n");
String flg = " ";
try {
Float salary = Float.parseFloat(word[0]);
con.write(new Text(flg), new FloatWritable(salary));
} catch (Exception e) {
e.printStackTrace();
}
}
}
public static class ReducerClass extends
Reducer<Text, FloatWritable, Text, Text> {
public void reduce(Text key, Iterable<FloatWritable> valueList,
Context con) throws IOException, InterruptedException {
try {
Double total = (double) 0;
int count = 0;
for (FloatWritable var : valueList) {
total += var.get();
//System.out.println("reducer " + var.get());
count++;
}
Double avg = (double) total / count;
String out = "Total: " + total + " :: " + "Average: " + avg;
con.write(key, new Text(out));
} catch (Exception e) {
e.printStackTrace();
}
}
}
public static void main(String[] args) {
Configuration conf = new Configuration();
try {
Job job = Job.getInstance(conf, "FindAverageAndTotalSalary");
job.setJarByClass(AvgMapred.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(ReducerClass.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FloatWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
} catch (IOException e) {
e.printStackTrace();
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
The map function does all the summation and gives that as an input to reduce function Reducer sums all the output from the map function and divide it by the count which gives the average. To compile this program, use the below command
hadoop com.sun.tools.javac.Main AvgMapred.java
Create the jar file which is named as avg.jar
jar cf avg.jar AvgMapred*.class
Run average program, output will be automaically routed to
hadoop jar avg.jar AvgMapred /stock_price /usr/stock_price_out1
Part of the output from above line in Terminal looks like below :
Have a look at the output
hadoop fs -cat /usr/stock_price_out1/part-r-00000
We can take the output to a text file /home/hduser/Output/stock_price_out.txt
hadoop fs -cat /usr/stock_price_out1/part-r-00000
MapReduce Alternatives
MapReduce programming is not easy
- MapReduce programs are written in Java
- Not everyone has Java background
- We need to be an expert in Java to write MapReduce code for intermediate and Complex problems
- Hadoop alone can not make our life easy
- While HDFS and Map Reduce programming solves our issue with bigdata handling, but it is not very easy to write a MapReduce code
- One has to understand the logic of overall algorithm then convert it to MapReduce format.
Tools in Hadoop Echo System
- If we dont know java how do we write MapReduce programs?
- Is there a tool to interact with Hadoop HDFS environment and handle data operations, without writing complicated codes?
- Yes, Hadoop ecosystem has many such utility tools.
- For example Hive gives us an SQL like programming interface and converts our queries to MapReduce
Hadoop Ecosystem Tools
- Hive
- Pig
- Sqoop
- Flume
- Hbase
- Zookeeper
- Oozie
- Mahout
and many more!!!
Hadoop Ecosystem Tools – Hive
- Hive is for data analysts with strong SQL skills providing an SQL-like interface and a relational data model
- Hive uses a language called HiveQL; very similar to SQL
- Hive translates queries into a series of MapReduce jobs
Hadoop Ecosystem Tools – Pig
- Pig is a high-level platform for processing big data on Hadoop clusters.
- Pig consists of a data flow language, called Pig Latin, supporting writing queries on large datasets and an execution environment running programs from a console
- The Pig Latin programs consist of dataset transformation series converted under the covers, to a MapReduce program series
Conclusion
- We discussed only MapReduce philosophy and high level algorithms
- There are many more details in MapReduce programming
- We saw few examples of map reduce code and how Hadoop handles big data
- Java developers can focus on MapReduce coding, Data Analysts can focus on high level scripting tools in Hadoop eco-system
- In later seasons we will try to learn some important components in Hadoop ecosystem and map reduce programming.