Before start our lesson please download the datasets.

In this section we are going to learn mapreduce basics. One of the biggest components of Hadoop is mapreduce. mapreduce is the programming part or achieving the objective part of solving the big data problems. so we will understand what is mapreduce,what is mapreduce algorithm, how to write mapreduce programs, and will also see some examples of writing mapreduce programs

What is MapReduce
Map Reduce Algorithm
Map Reduce Program
Word Count map reduce example

MapReduce Programming

Map Reduce Model

We already discussed about distributed computing in the last session in introduction to big data and distributed computing. Basically if we have really large data set we divide the whole problem into smaller pieces or into smaller problems and then find the final objective by taking the intermediate output from these smaller problems Basically we divide the whole data set into smaller pieces, store them onÂ local systems and on each of these smaller pieces we will run Map code, that is the map function will be applied and then the output of map will be given as input to reducer and then reducer will give us the final output So to achieve the final objective we do a two stage map and reduce method of computation. Solving the overall problem or objective with map and reduce is called mapreduce programming model.

Distributed Computing for distributed data
Achieve the final objective by two stage Map and Reduce method of computation

Map Reduce

Map reduce is inspired from functional program example is Basically mapreduce is not the starting point, there was something called functional programming, from where mapreduce was introduced. Map will take input as the actual data or the smaller pieces of the actual data and then it will generate an intermediate key value pair. And then reducer will take all the intermediate key value pair or it willÂ merge all the intermediate values from the map and it will generate the final output So the objective is divided into map and reduced format By using mapreduce model, it is very easy for us to scale up the data processing over multiple computers We will see some more details of map reduce, how eactly mapreduce works, how to write the mapreduce code, etc.

Programming style is inspired from functional programming languages (eg. Lisp)
Map()
Process a KV pair (key- value pair) to generate intermediate key-value pairs
Reduce()
Merge all intermediate values associated with the same key to generate final result
Using MapReduce model it is easy to scale data processing over multiple computing nodes.

How MapReduce Works

To understand this whole process of mapreduce we will take an example, using which it will be easy for us to understand mapreduce, instead of just looking at it as a programming model we will understand the whole steps in mapreduce. Let’s take a word count example, the reason why we are discussing word count program here is, it is very easy to understand mapreduce methodology using this particular example of counting the number of words that are present in a huge data set If the data set file is huge and if we want to count the number of words that are appearing in that data set, how do we do that? The data set can be anything. let us say the data set is the “tweets”, let’s say there’s a new product launch, maybe iPhone, or some other product and at the time of launching that product there are so many tweets, so many social media discussions are happening around that product and the final data if you take, it will be very huge to count the number of words and frequency of each words. From the data or words, we might even find out, what are the features are being discussed, etc. another example can be, if we take any current topic like election or sports, people will be discussing a lot and if we gather all the data and and then find the most occured word, the most discussed topic, etc How to achieve such objectives is what we are going to discuss

We will understand MapReduce with the help of an example
Word count example
Imagine a huge data set that contains user comments and discussions about a laptop
We want to count the frequency of words to get an idea on the popular features

Map Reduce Word Count Example

Map is a function, reduce is a function. Map function takes the input; key and value and gives the intermediate output Reduce function take the input as the output of the map function and give the final output. So what exactly map take as input? And what exactly maps gives as output ? So if you want to count the total number of words in a particular given data file or the frequency of each word, how do u do that? To divide the whole program into mapreduce program, we just take a particular document, now what map does is, it goes line by line and for every word it finds, it assingn that word with digit “1” , which denotes the frequency of that word. Map function simply takes every word in the input doc, and gives the word and “1”“, which is the key value pair. It gives key value pair as output. This is what map does. What reducer does is it takes the output of map as input, i.e., key-value pair , which is generated by the map function. And reducer will finally sum up, it will take each and every word, its frequency and sum up all the frequency to find the no. of occurances.

The Map Function input and output

So first lets discus about map function. Map takes key value pair as input and gives key-value pair as output Particular key-value pair is taken as input Lets dig deep into map function Maps input is key-value pair What is the key?Text doc name, whatever is the name of the document, it can be anything, we can name it anything, it is not that significant Key can be anything Map function takes the document name as key and value is nothing but the contents of the text document, Doc name and whatever is inside the document, map function takes that as input and then map function runs its funcions What output does the map give? It gives each and every word that is required, with value 1 assigned to it So output of map function is key which is the actual word and the value associated with key is 1 . Map function takes key value pair as input and gives key value pair as output Both these key value pair are different In input key value pair, the key is document name, value is contents of the document. In output, key is the word, and value is the “1” as it assigns 1 to every word and therefore the output will have many key value pairs.

Input
Map function takes a key and value pair as input
Key = Text document name
Value = the contents of text document
Output
Map function gives key value pairs as output
Key = word
Value = 1

Map Reduce Word Count Example

This function we are defining. Key and the value are the parameters. Key is the document name, value is the text of the document For each word “w”” in value, where value is the text of the document and for each word “w”” in that, emit(w,1), using this we are emitting a tuple, i.e., (w,1). So map function takes key value pair as input, where key is the text doc name and value is the content. Output is each word which is assigned with 1.

How Map Function Works

Lets say this is the input, Laptop screen issue, Laptop screen problem, screen size issue. Let say this is the text that is there inside the text document, then what does map function do? So the map function is taking input as key and value, where key is laptop.txt, the text file name and value is review_text, i.e., whatever is there in these three lines of the input is the value For each “w” in this, it will emit (w,1)

If we apply this map function for the above given data, the output will be like the image given below.

If we take 1st line, it will go through the map function, it will get the output as each word followed by 1. Laptop 1 Screen 1 Issue 1 This is what map does, Similarly it is done for second and the third line.

This the output of the map function for the given data

Always remember, map function output is not the final output Map function gives us the intermediate output, this will be taken as input for the reduce function and this is the intermediate output after the execution of the map function

The Reduce Function input and output

Reduce function will take each word and its frequency as input or each word and number of ones as input and for output it will just sum up all these ones and give it as output. How does reducer work? So for reduce function, the input parameter are key and set of values, key is a word, values iterator or count. Basically in reducer is, just take each and every word, then start the count and result=result + 1, like count=count+1. It will finally emit each word and sum of number of times or the total count of 1’s, that is the word frequency, word count.

Input
Reduce function takes a key and set of value pair as input
Key=word
Value=all counts list of that word
Output
Reduce function gives key value pairs as output
Key=word
Value=sum of counts

How Reduce Function Works

Now the reduce function, what exactly reduce take? it takes the output of mass function as the input, so reduce function takes the key and value as the input. key is each word and values is the sum of all counts, how many times each word appears. if this laptop appeared thrice , reduce will take laptop,1,1,1 as that list of all the number of words that has appeared as input . Map output was simply each word and the frequency .

The shuffled the key-value pairs because some of the keys are same like laptop, which is appearing multiple times, so we can just shuffle and then simply prepare it for reduce. Some keys have more than one value.

Here screen appeared thrice, so this is the final key with shuffled key-value pair and then reduce function just takes the key and values as input and then sum all 1’s i.e., take all the 1’s of all words and sum it and give us a final output.

Reducer function take this shuffled key-value pair as input and gave it as output. If we pass the input data into reduce function we come to know that laptop has appeared twice, etc. Reduce function will just give each word and sum of counts or sum of the frequency , no. of 1’s, number of times that word is appearing as the output.

Shuffle & Reduce Stage

What reduce function is doing? Reduce is not just reduce, it is actually shuffle and reduce, so reduce stage is a combination of shuffle stage and reduce. What is shuffling? Here some of the keys are appearing more than once so we just shuffle them and slightly adjust them so that we have unique key and the corresponding values . The reducers job is just take the output from the map and give the final output. So once the data is processed, the final output will be sent to hdfs

Reduce stage is the combination of theShuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper.
After processing, it produces a new set of output, which will be stored in the HDFS

Scaling-up is easy in MapReduce

Why is map reduce programming such an important factor in distributed computing? Scaling-up is really easy in map reduce. Once you write map and reduce program, no matter how big the data is, it’s very easy to scale up.

How Map Function Works

Earlier we had a very small file with just three lines. imagine a huge input file .

We can divide the data and store it on the cluster of machines As we saw, the moment you move this file onto hdfs it will be divided into smaller pieces of 128 mb or 64 mb each and will be stored. now we can apply MapReduce on each of these datasets simultaneously. Running a map reduce program on a huge dataset file takes a lot of time, but if you’re running map reduce on smaller pieces simultaneously, then you can save a lot of time.

Divide the data, put them on cluster of machines, and then on each of the data you can apply the same map function. Map function will take the input as the text and it will give the output as each word and the digit ‘1’ which is associated with each word . Apply the same map function parallel or simultaneously on each of these pieces, this will save a lot of run time. In a normal case, when we are running a normal program on the first piece of data, the last piece of the data will be not working or it will be idle, whereas if the program is a MapReduce then we can run map on all the data pieces simultaneously.

Map output will be simply the particular word and ‘1’ , which acts as a key value pair,and then we get the shuffled pairs .

Then you can apply the same reduce function without any changes across all the data pieces, The reduce function takes the key value pair, i.e., word and number of ones and finally this function gives the output as count or the sum of all these 1’s i.e., how many times laptop, screen, etc. appears.

We can apply the same map and reduce function on a bigger data set, on a distributed dataset this is distributed style of computing, so the same map reduce function can applied on the bigger data set and we can get the result. That is what one of the main advantages of MapReduce function, i.e., scaling-up is very easy. That is why hadoop uses MapReduce type of coding. We will see some example, i.e., we will actually take one data set, and will write map and reduce program and we will submit to hadoop, first we have to take the data, we have to distribute it and then on each of the pieces we are to run map and reduce functions Hadoop will help us to do this very easily and then we will see the output . This is about MapReduce basic introduction

LAB Map Reduce Program

LAB: Map Reduce for word count

Now we will from work on the dataset called “user reviews”. Move the datset user review from local file system to the hdfs,then we will write an word count program for the same to count the frequency of the each word.Then we will take the final output in the text format.

Dataset User_Reviews/User_Reviews.txt
Move the data to hdfs
Write a word count program to count the frequency of each word
Take the final output inside a text file

Solution

Check if the hadoop services are stared or not if not then type the following command in the hduser:

        start-all.sh

Now check the services of the hadoop is typing.

jps

check your files on hdfs
```
    hadoop fs -ls /
```

If the files are not present in the hdfs then we have to bring the file inside hdfs by using the follwing command.

Bring the data onto hadoop HDFS

        hadoop fs -copyFromLocal /home/hduser/datasets/User_Reviews/User_Reviews.txt /user_review_hdfs

Once again check for the avaliablity of the files in hdfs after moving the file from the local directory to the hdfs. – Check the data file on HDFS

        hadoop fs -ls /

Now we can see that the files are present in the hdsuer directory and we can proceed . Now we will write a MapReduce program for it. so first we need to write a map.java and then a reduce.java code remember hadoop is built on java, so we need to write MapReduce programs in Java so if you want to write mapreduce programs, you need to learn java because we need to write java based my MapReduce program

check your current working directory

cd

We will be using a word count program which is already written For doing this, first we change the directory and go to the hadoop bin folder – Goto hadoop bin

        cd /usr/local/hadoop/bin/

It is imporatant to make your PWD(present working directory) as $hadoop/bin So once you are here, you have to create word count.java Open an editor with a file name WordCount.java

    sudo gedit WordCount.java

When you say sudo, you need to enter the system password that is used for logging in By typing this command, a file will be opened, i.e., the gedit file which is an and within that you need to write the word count program A word count program has a mapper and reducer Map function emits a simple key value pair, Reducer calculate the sum so that is how MapReduce works

In this code these two functions that is the map and the reduce functons are the important part of the code, rest of the code contains the important libraries which are needed throughout the program Copy the below java code, paste in your file and save your file

    import java.io.IOException;
    import java.util.StringTokenizer;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    public class WordCount {
        public static class TokenizerMapper
            extends Mapper<Object, Text, Text, IntWritable>{

            private final static IntWritable one = new IntWritable(1);
            private Text word = new Text();
            public void map(Object key, Text value, Context context)
            throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                 word.set(itr.nextToken());
                 context.write(word, one);
                }
            }
        }
        public static class IntSumReducer
                extends Reducer<Text,IntWritable,Text,IntWritable> {
                private IntWritable result = new IntWritable();

                public void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
                int sum = 0;
                for (IntWritable val : values) {
                    sum += val.get();
                }
                result.set(sum);
                context.write(key, result);
            }
        }

        public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    }

To compile this program, use the below command

    hadoop com.sun.tools.javac.Main WordCount.java

Create the jar file which is named as wc.jar

    jar cf wc.jar WordCount*.class

Run wordcount program, output will be automaically routed to so to run the word count program, we need to give the command which consists of word count program jar file, the word count class and then wherever we saved the input i.e., user_reviews_hdfs is the input on which the word count program will run and then the output where we will get the final output.

    hadoop jar wc.jar WordCount /user_review_hdfs /usr/review_count_out

Have a look at the output

    hadoop fs -cat /usr/review_count_out/part-r-00000

Part of the output

We can take the output into a text file so that we can open that txt file later.For that we create a new directory

    mkdir /home/hduser/Output/

Now we copy the output inside a text file named review_word_count.txt that is saved in a particular folder output /home/hduser/Output/review_word_count.txt

    hadoop fs -cat /usr/review_count_out/part-r-00000

Since we wrote a simple word count, without taking care of spaces the output is like this, we can even modify the map and reduce program to make it look better this is just to give an idea on how MapReduce works . This was a small file, in our next exercise we will take a huge file and will try to count the number of lines It is little difficult to open a large datset in excel, so we can try loading it on sql to understand the data and then we will try to find the number of lines in it, in our next exercise.

LAB: Map Reduce for Line Count

Lets do an exercise with a larger data set Lets consider the stack overflow tags data which is almost 7 gb of size As the block size is 128 mb, 55 blocks will be required We will try to map reduce program to count the number of lines in that particular file This data which we will work now is from stack overflow data where this data contains is, simply the question-answers and the tags – Dataset: Stack_Overflow_Tags/final_stack_data.zip – Unzip the file and see the size of the data. – The dataset contains some stack overflow questions. The goal is to find out the total number of questions in the file(number of rows) – Move the data to hdfs. – Write a word count program to count the frequency of each word – Take the final output inside a text file

Solution

Is the Hadoop started?

jps

Start Hadoop if not started already

    start-all.sh

Is the Hadoop started now check by using the “jps” command.

        `jps`

check your files on hdfs

    hadoop fs -ls /

Dataset is /Stack_Overflow_Tags/final_stack_data.zip. Unzip the data first. Unzipping takes some time

sudo unzip /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.zip -d /home/hduser/datasets/Stack_Overflow_Tags/

Bring the data onto hadoop HDFS. The file size is huge. This step takes some-time

    hadoop fs -copyFromLocal /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.txt /stack_overflow_hdfs

Check the data file on HDFS

    hadoop fs -ls /

check your current working directory

cd

Goto hadoop bin

    cd /usr/local/hadoop/bin/

It is imporatant to make your PWD(present working directory) in $hadoop/bin Open an editor with a file name LineCount.java

    sudo gedit LineCount.java

Copy the below java code, paste in your file and save your file The logic here is slightly different The , map function will give the count and reduce function will take all these counts and finally sum it up Here also we will use the MapReduce concept to count the number of lines in that particular data set


    import java.io.IOException;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    public class LineCount{
        public static class LineCntMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
        Text keyEmit = new Text("Total Lines");
        private final static IntWritable one = new IntWritable(1);

        public void map(LongWritable key, Text value, Context context){
            try {
                context.write(keyEmit, one);
            } 
            catch (IOException e) {
                e.printStackTrace();
                System.exit(0);
            } 
            catch (InterruptedException e) {
                e.printStackTrace();
                System.exit(0);
            }
        }
    }

        public static class LineCntReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values, Context context){
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            try {
                context.write(key, new IntWritable(sum));
            } 
            catch (IOException e) {
                e.printStackTrace();
                System.exit(0);
            } 
            catch (InterruptedException e) {
                e.printStackTrace();
                System.exit(0);
            }
        }
    }
    
        public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "line count2");
        job.setJarByClass(LineCount.class);
        job.setMapperClass(LineCntMapper.class);
        job.setCombinerClass(LineCntReducer.class);
        job.setReducerClass(LineCntReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    }

To compile this program, use the below command


    hadoop com.sun.tools.javac.Main LineCount.java

Create the jar file which is named as lc.jar

    jar cf lc.jar LineCount*.class

Run wordcount program, output will be automaically routed to MapReduce start on each of the pieces or block parallely, so there are 56 plates, map is submitted to 56 plates, on each of the split, map start . it’ll take some time to run as it’s a 7 gb file the output of map will be going as input of reduce and reducer will work on it and give the output How many number of lines? How many stackoverflow questions were asked in that particular month is the final objective that we are trying to find out ,how MapReduce works on a 7gb data,you can see in between, how much present of map is completed and how much percent of reduce is completed. Output was generated and stored in usr/stack_overflow_out you can check the output from browser as well


    hadoop jar lc.jar LineCount /stack_Over_hdfs /usr/stack_overflow_out


Check the output here http://localhost:50070/explorer.html#/

After the execution, now let us see the output

    hadoop fs -cat /usr/stack_overflow_out/part-r-00000

So there are around six million total lines and that is around 60 lakhs So around six million questions were asked in a particular month of stack overflow on stack overflow website We can take the output to a text file /home/hduser/Output/stack_overflow_out.txt.

    hadoop fs -cat /usr/stack_overflow_out/part-r-00000

LAB: Map Reduce Function-Average

In this one, we will be writing a map reduce code which is related to mathematics like averages, etc. instead of word or line count here we will try to find average Lets consider the stock price data if we have one stock which varies every second or can be so many fluctuations in stock so we have taken the data of a particular stock, its prize, its value over a period of time ,we simply want to find the average stock price the usual way of finding the average is, you just load it into one system and then find the average if the data set is too large, then we can do map and reduce program so we will try to find the average stock price using MapReduce style of coding Lets start the exercise – Dataset: Stock_Price_Data/stock_price.txt – The dataset contains stock price data collected for every minute. – Find the average stock price.

Solution

To check whether hadoop is running use the command jps

jps

Start Hadoop if not started already

    start-all.sh

Is the Hadoop started now can be checked by

jps

check your files on hdfs

    hadoop fs -ls /

For the data set stock_price.txt, we are trying to find the average stock so first we will move the stock price data on to hdfs the command for moving from local copy from local to hdfs is:

    hadoop fs -copyFromLocal /home/hduser/datasets/Stock_Price_Data/stock_price.txt /stock_price

Once it is moved on to hdfs, we can check using hadoop fs -ls to find out whether the file is moved or not

    hadoop fs -ls /

check your current working directory

cd

Goto hadoop bin

    cd /usr/local/hadoop/bin/

For finding the average we will change the directory to the hadoop bin folder if you are already inside bin then you don’t need to again change the directory to bin now we can create the Java file AvgMapred.java using gedit and open it It is imporatant to make your PWD(present working directory) as $hadoop/bin

Open an editor with a file name AvgMapred.java

    sudo gedit AvgMapred.java

Copy the below java code, paste in your file and save your file


    import java.io.IOException;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.FloatWritable;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;

    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    public class AvgMapred {
    /*
     * data schema(tab separated) : 2000.16325 4664654.78955 46513123.134165
     */
    public static class MapperClass extends
            Mapper<LongWritable, Text, Text, FloatWritable> {
        public void map(LongWritable key, Text empRecord, Context con)
                throws IOException, InterruptedException {
            String[] word = empRecord.toString().split("n");
            String flg = " ";
            try {
                Float salary = Float.parseFloat(word[0]);
                con.write(new Text(flg), new FloatWritable(salary));
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    public static class ReducerClass extends
            Reducer<Text, FloatWritable, Text, Text> {
        public void reduce(Text key, Iterable<FloatWritable> valueList,
                Context con) throws IOException, InterruptedException {
            try {
                Double total = (double) 0;
                int count = 0;
                for (FloatWritable var : valueList) {
                    total += var.get();
                    //System.out.println("reducer " + var.get());
                    count++;
                }
                Double avg = (double) total / count;
                String out = "Total: " + total + " :: " + "Average: " + avg;
                con.write(key, new Text(out));
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }

    public static void main(String[] args) {
        Configuration conf = new Configuration();
        try {
            Job job = Job.getInstance(conf, "FindAverageAndTotalSalary");
            job.setJarByClass(AvgMapred.class);
            job.setMapperClass(MapperClass.class);
            job.setReducerClass(ReducerClass.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(FloatWritable.class);

            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        } catch (IOException e) {
            e.printStackTrace();
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

    }
    }

The map function does all the summation and gives that as an input to reduce function Reducer sums all the output from the map function and divide it by the count which gives the average. To compile this program, use the below command

    hadoop com.sun.tools.javac.Main AvgMapred.java

Create the jar file which is named as avg.jar

    jar cf avg.jar AvgMapred*.class

Run average program, output will be automaically routed to

    hadoop jar avg.jar AvgMapred /stock_price /usr/stock_price_out1

Part of the output from above line in Terminal looks like below :

Have a look at the output

    hadoop fs -cat /usr/stock_price_out1/part-r-00000

We can take the output to a text file /home/hduser/Output/stock_price_out.txt

    hadoop fs -cat /usr/stock_price_out1/part-r-00000

MapReduce Alternatives

MapReduce programming is not easy

MapReduce programs are written in Java
Not everyone has Java background
We need to be an expert in Java to write MapReduce code for intermediate and Complex problems
Hadoop alone can not make our life easy
While HDFS and Map Reduce programming solves our issue with bigdata handling, but it is not very easy to write a MapReduce code
One has to understand the logic of overall algorithm then convert it to MapReduce format.

Tools in Hadoop Echo System

If we dont know java how do we write MapReduce programs?
Is there a tool to interact with Hadoop HDFS environment and handle data operations, without writing complicated codes?
Yes, Hadoop ecosystem has many such utility tools.
For example Hive gives us an SQL like programming interface and converts our queries to MapReduce

Hadoop Ecosystem Tools

Hive
Pig
Sqoop
Flume
Hbase
Zookeeper
Oozie
Mahout

and many more!!!

Hadoop Ecosystem Tools – Hive

Hive is for data analysts with strong SQL skills providing an SQL-like interface and a relational data model
Hive uses a language called HiveQL; very similar to SQL
Hive translates queries into a series of MapReduce jobs

Hadoop Ecosystem Tools – Pig

Pig is a high-level platform for processing big data on Hadoop clusters.
Pig consists of a data flow language, called Pig Latin, supporting writing queries on large datasets and an execution environment running programs from a console
The Pig Latin programs consist of dataset transformation series converted under the covers, to a MapReduce program series

Conclusion

We discussed only MapReduce philosophy and high level algorithms
There are many more details in MapReduce programming
We saw few examples of map reduce code and how Hadoop handles big data
Java developers can focus on MapReduce coding, Data Analysts can focus on high level scripting tools in Hadoop eco-system
In later seasons we will try to learn some important components in Hadoop ecosystem and map reduce programming.

Select Category

Handout – Mapreduce

Before start our lesson please download the datasets.

Map Reduce

Contents

MapReduce Programming

Map Reduce Model

Map Reduce

How MapReduce Works

Map Reduce Word Count Example

The Map Function input and output

Map Reduce Word Count Example

How Map Function Works

The Reduce Function input and output

How Reduce Function Works

Shuffle & Reduce Stage

Scaling-up is easy in MapReduce

How Map Function Works

LAB Map Reduce Program

LAB: Map Reduce for word count

Solution

LAB: Map Reduce for Line Count

Solution

LAB: Map Reduce Function-Average

Solution

MapReduce Alternatives

MapReduce programming is not easy

Tools in Hadoop Echo System

Hadoop Ecosystem Tools – Hive

Hadoop Ecosystem Tools – Pig

Conclusion