Getting Started with Bigdata
Contents
- What is Bigdata
- What can be done with Big data?
- Handling Bigdata
- Distributed Computing
- Map Reduce
- Hadoop
- Hadoop Sandbox
- HDFS
- Conclusion
Conventional tools and their limitations
Really big data:
- Can you think of running a query on 20,980,000 GB file.
- What if we get a new dataset like this, every day?
- What if we need to execute complex queries on this data set everyday?
- Does anybody really deal with this type of data set?
- Is it possible to store and analyze this data?
- Yes, google deals with more than 20 PB data everyday (before some years back, as in 2008). Now they’re dealing with more than 20 PB of data every day. Google is collecting lots of data everyday.
- Now running queries on one such data set which is off 20 PB on a SQL is very difficult.
Are there really big datasets?
- Google processes 20 PB a day (2008).
- Way back Machine has 3 PB + 100 TB/month (3/2009).
- Facebook has 2.5 PB of user data + 15 TB/day (4/2009).
- eBay has 6.5 PB of user data + 50 TB/day (5/2009).
- CERN’s Large Hydron Collider (LHC) generates 15 PB a year.
In fact, in a minute…
- Email users send more than 204 million messages;
- Mobile Web receives 217 new users;
- Google receives over 2 million search queries;
- YouTube users upload 48 hours of new video;
- Facebook users share 684,000 bits of content;
- Twitter users send more than 100,000 tweets;
- Consumers spend $272,000 on Web shopping;
- Apple receives around 47,000 application downloads;
- Brands receive more than 34,000 Facebook ‘likes’;
- Tumblr blog owners publish 27,000 new posts;
- Instagram users share 3,600 new photos;
- Flickr users, on the other hand, add 3,125 new photos;
- Foursquare users perform 2,000 check-ins;
- WordPress users publish close to 350 new blog posts.
- There are many places, where data is generated in very huge amount within one day or within a minute.
- In fact, just in 1 minute, this much amount of data is being generated.
Conventional tools and their limitations
- Traditional data handling tools and their limitations
- Excel : Have you ever tried a pivot table on 500 MB file?
- If you try excel, it is a good tool for a ad-hoc analysis, but if you try to open a file which is 500 MB or even 1 GB then it starts hanging the system as you won’t be able to handle the data more than 1 GB in excel on usual systems.
- SAS/R : Have you ever tried a frequency table on 2 GB file?
- SAS or R are the analytical tools as they tend to give it up when you try a data which is more than 2 GB file.
- Access: Have you ever tried running a query on 10 GB file?
- Access can handle data or query up to 10 GB, but beyond that it is not really going to help you.
- SQL: Have you ever tried running a query on 50 GB file?
- SQL on a supercomputer kind of system can handle up to 50 GB data, but beyond that, SQL won’t be able to handle data.
- Thus, these are the conventional tools such as SQL, Excel, Access, R, SAS.
- The conventional tools won’t be able to handle this type of data that is coming rapidly within a minute or every day.
- There is so much of data that is coming up and the conventional tools won’t be able to handle this.
Definition of Bigdata
Any data that is hard to handle using conventional tools and techniques is Bigdata.
- If you just want to find a number of rows, number of columns or what an average of a particular column is; then you load the file and find the average or find the number of loads.
- But if you use the same technique on a bigger dataset, then it won’t work.
- Thus, any dataset that is hard to handle using conventional tools and techniques is called big data.
Bigdata
Any data that is difficult to:
- Capture
- Curate
- Store
- Search
- Transfer
- Analyze
- To create visualizations
Bigdata is
- Data that so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
- Big Data is the data whose scale, diversity, and complexity require new architecture, techniques, tools, algorithms, and hardware to manage it and extract value and hidden knowledge from it.
Bigdata is not just about size
- Volume:
- Data volumes are becoming unmanageable.
- Variety:
- Text Data (Web), Numerical, Images, Audio, Video, Social Network, Semantic Web (RDF), Semi-Structured Data (XML), Multi-Dimensional Arrays, etc.
- Social Networking Sites data
- Text Data (Web), Numerical, Images, Audio, Video, Social Network, Semantic Web (RDF), Semi-Structured Data (XML), Multi-Dimensional Arrays, etc.
- Velocity:
- Some data is arriving so rapidly, that it must either be processed instantly, or lost. This is a whole subfield called stream processing.
- E-commerce data
- Stock exchange data
- Some data is arriving so rapidly, that it must either be processed instantly, or lost. This is a whole subfield called stream processing.
Some examples of Bigdata
- Stock exchange transactions data
- Voice clips data
- Video clips data
- Social network data
- Smart phone generated data
- Weblog data
- E-commerce customer data
What can be done with Bigdata?
Uses of Bigdata Analysis
- Ford:
- Collects the data from more than 4,000,000 vehicles using sensors.
- Thus, from these sensors, they collect much of data every day; whenever the vehicle is in active position.
- The sensors are sending much of data to Ford servers.
- They are analyzing the sensor data to improve quality of vehicles and to reduce the fatal accidents.
- Amazon:
- Collects millions of users click stream and product search log data thus as soon as you buy first, second or third product then the 4th product that is going to appear in the recommendations is most likely to be the one that you are looking for.
- Used for improving product recommendation engine.
- AT&T:
- They’re collecting lots of signal data from the telecom mobile users, based on that they’re giving that data to traffic police, so that they can plan the traffic.
- Collects data from millions of customers.
- Analyzes the cell tower network usage data and helps the urban planners and traffic engineers, thus even traffic engineers are using it for the well planning of the traffic or routing.
- Walmart:
- One of the largest civilian data warehouse in the world.
- They get a lot of data from their users, based on that, they come up with a lot of interesting relations and market basket analysis.
- Market Basket Analysis crunches the data and finds out various hidden patterns like what do hurricanes, strawberry and beer have in common.
Bigdata Tool
- Analysis on this Bigdata can give us awesome insights.
- But, by definition, the bigdata can’t be handled using conventional tools.
- Datasets complex, huge and difficult to process.
- What is the solution?
Handling Bigdata – Using super computers
- Super Computer is a solution.
- Put multiple CPUs in a machine (100?). It will give the result quickly.
- Let us see if we have a normal laptop then it is very difficult to handle big data, because the data set size itself is 16 PB or 1 PB and if we have a normal system that even might have just 1 TB of hard disk space, then getting the data or acquiring the data or storing the data itself becomes difficult, forget about analyzing the data.
- We can take a supercomputer, so instead of one CPU, we can put multiple CPUs in that, instead of one harddisk, we can put a huge harddisk so we can have a supercomputer to handle the big data.
- Now the problem with supercomputer is building a supercomputer or the cost of building a supercomputer is so high that the institutes like NASA or ISRO or really big institutes or really big companies can afford supercomputers.
- The cost of buying a supercomputer might be sometimes really higher than whatever results that you are going to get out of big data.
- If the dataset’s size is large, then that doesn’t mean we have to invest a lot on the computer.
- Supercomputer is a solution but it is not that cost effective solution; it is really costly for individuals. It’s almost like impossible to buy a supercomputer just to perform these operations.
Handling Bigdata: Is there a better way?
- Till 1985, there is no way to connect multiple computers.
- All computers were centralized individual systems.
- Multi-core system or supercomputers were the only options for big data problems.
- After 1985, we have powerful microprocessors and High Speed Computer Networks.
Handling Bigdata: Distributed systems
- The Computer Networks LANs, WANs lead to distributed systems.
- Now that we have a distributed system that ensures a collection of independent computers appears to its users as a single coherent system.
- We can use some low-priced connected computers and process our bigdata.
Cluster Computing
- Cluster is nothing but when you take few machines and you connect them through LANs and WAN’s, that is called cluster.
- A collection of independent computers that are joined together using LAN is called computer cluster.
- We can do distributed computing or cluster computing to handle big data with a single machine, as it is really difficult for it to handle big data.
Handling Bigdata- Distributed computing
Distributed computing
- We have the overall final task, then we can divide the data into smaller pieces and place them on all these different machines.
- Now these smaller machines or low end machines can handle smaller data set, if we have a huge data set, we can divide the dataset into smaller pieces and then distributed onto all these machines.
- Then we connect all these machines using LAN or WAN and this whole set of machines or cluster of machines, cluster of computers look like a really big supercomputer, we can make it work like that.
- Put them in each of the machine, divide the overall problem into smaller pieces and then run them locally on each of the machines.
- Dividing The Overall Problem into Smaller Pieces
- Let us suppose we just want to simply find the number of lines or let’s say the data is from facebook, in a particular day how many likes are generated in facebook, that is what we want.
- Thus we have the huge data set of all the ‘likes’, which will be in TB’s or PB’s, we divide the data into smaller pieces, let’s say each row in that data set represents one ‘like’ or each row represents one activity i.e., if we want to count the number of activity we can divide the whole data into smaller pieces, put them on all these lower end computers, now overall problem is, we want to count the number of activities.
- We can divide this overall object also into smaller pieces. So we first count the activities in computer number one, whatever is the data on split one or data chunk 1, we can calculate the number of rows or we can calculate the number of activities, we can individually calculate the number of activities on all of these systems and we can do that parallely also and then once all these systems have locally calculated the number of rows or the number of activities, we can simply add them up later to get the final answer.
- Now, we divide data and store them locally and then on each of the data, we run the task and this is called map.
- Thus, on local systems at the map level or at very low level, we are calculating something, that is called map.
- Now, once we get output of the map then finally collate the results from local machines.
- Let us take a simple three Node Cluster, 3 nodes – three computers, so we do a whole dataset into computer 1 , 2 and 3.
- Then we take data chunk 1, 2 and 3 and the assign then to the 3 computers to work in parallel to calculate the number of row, andthis is called map.
- Once we have the result of all these maps, then we can calculate REDUCE that is nothing but the sum of all these rows.
- Thus, this is nothing but the distributed computing model.
- We can process bigdata in a parallel programming/distributed computing model.
- This is also known as MapReduce programming.
Map Reduce and Network Programming
MapReduce: Programming Model
- MapReduce:
- Map(the local/low level computation)
- We go to each and every data set and data chunk wherever it is and calculate something. The output of the map is the input of the reduce.
- Reduce(the collation of map results)
- Reduce will take this output, calculate something over it and then finally give you the result.
- Thus we will be processing the data using special map() and reduce() functions.
Map(): – The map() function is called on every item in the input and emits a series of intermediate key/value pairs(Local calculation). – All values associated with a given key are grouped together. Reduce(): – The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output(final organization).
- Since, we can’t handle the huge data set or big data using normal computers or conventional tools, we are making use of distributed computing.
- You distribute the data first and then you split the overall problem in such a way that you can code it in a MapReduce format, which is called the distributed computing.
Map Reduce is not new
- Finally, if we want to make something or achieve something, then you need not to do it in one go, you can divide the whole problem into different pieces i.e., initial data is the raw data and then you get the intermediate output and then we do the reduce.
- Thus, the idea was there earlier, now with the new network programming and all the network computing, we can achieve MapReduce distributed computing.
- To handle big data, we need to write MapReduce programs, we can’t write simple programs.
MapReduce programs
- The conventional program to count the number records in a data file:
count=count+1
- The MapReduce program to count the number of records in a bigdata file:
count=count+1
cum_sum=cum_sum+sum
More than just a MapReduce program
- The MapReduce program to count the number of records in a bigdata file:
count=count+1
cum_sum=cum_sum+sum
- Who will setup the network of machines and store the data locally?
- Who will divide and send the map program to local machines and call the reduce program on top of map?
- What if one machine is very slow in the cluster?
- What if there is a hardware failure in one of the machines?
- It is not just MapReduce, it is not that simple. It is much more than MapReduce.
Additional scripts for work distribution
- We need to first setup a cluster of machines, then divide the whole data set into blocks and store them in local machines.
- We also need to assign a master node that takes charge of all meta data, which block of data is on what machine.
- We need to write a script that will take care of work scheduling, distribution of tasks and job orchestration.
- We also need to assign worker slots to execute map and reduce functions.
Additional scripts for efficiently
- We need to write scripts for load balancing (What if one machine is very slow in the cluster?).
- We also need to write scripts for data backup, replication and Fault Tolerance (What if the intermediate data is partially read).
- Finally write the map reduce code that solves our problem.
Implementation of MapReduce is difficult
- Analysis on Bigdata can give us awesome insights.
- But, datasets are huge, complex and difficult to process.
- The solution is distributed computing or MapReduce.
- But looks like this data storage & parallel processing, job orchestration and network setup is complicated.
- What is the solution?
- Is there a readymade tool? Or platform that can take care of all these tasks.
- Hadoop
Hadoop
What is Hadoop?
- Hadoop is a bunch of tools.
- Hadoop is a software platform.
- One can easily write and run applications that process bigdata.
- Hadoop is a java based open source framework for distributed storage and distributed computing.
Hadoop takes care of many tasks
- Hadoop takes care of many difficult tasks in MapReduce implementation like:
- Data distribution
- Master node
- Slave nodes
- Cluster Management
- Job scheduling
- Data Replication
- Load balancing
Is Hadoop a Database?
- Hadoop is not Bigdata.
- Hadoop is not a database.
- Hadoop is a platform/framework.
- Which allows the user to quickly write and test distributed systems.
- Which is efficient in automatically distributing the data and work across machines.
- Hadoop is a software platform that lets one easily write and run applications that process bigdata.
What does Hadoop do?
- It can reliably store and process petabytes.
- It distributes the data and processing across clusters of commonly available computers.
- As soon as you move the data onto HDFS, it divides the data into small pieces (64MB each) and distributes across the machines.
- By distributing the data, it can process it in parallel on the nodes where the data is located.
- It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.
Hadoop Main Components
- Four major components
- Hadoop Common: Java libraries/packaged codes/utilities
- Hadoop is a software platform or a kind of software package which has common Java libraries.
- What are Java Libraries??
- These are compiled codes that will automatically take care of all the data distribution work like scheduling, etc.
- There are some common libraries which are there in hadoop.
- Hadoop is written on Java platform, so Java libraries, packaged codes and ulities are first part of hadoop components.
- Hadoop YARN: Framework for job scheduling and cluster resource management
- This is for job scheduling, cluster resource management, etc.
- That is how do you manage resources: like how many cores are there, how much is the data, the hard disk size, the ram size in each of the computer, etc. in the cluster.
- Thus, Hadoop YARN is one of the important component.
- Hadoop Distributed File System: Distributed data storage
- This is for distributed data storage, because data storage is very important as big data itself is not at all manageable.
- Thus, managing or storing data in an efficient manner, without losing the data is very much important.
- Hadoop MapReduce: Parallel processing and distributed computing
- These are the set of libraries that will help us in distributed computing.
- In MapReduce, the map function is applied to the data chunks in HDFS. The output from the map function is given to the reduce function. The reduce function gives us the desired output.
- Hadoop Common: Java libraries/packaged codes/utilities
HDFS
- HDFS is the abreviation form for Hadoop Distributed File System.
- HDFS is designed to store very large files across machines in a large network
- 64MB
- Any file that is kept on HDFS is divided into small pieces and distributed.
- Each file is divided into a sequence of blocks and all blocks are of equal size, 64MB in general, sometimes even 128 MB.
HDFS-Replication
- Blocks are replicated on different machine in the cluster for fault tolerance.
- Replication placement is not random, it is optimized to improve:
- Reliability
- Availability
- Network bandwidth utilization
Name node and Data node
- HDFS has small pieces of data called blocks and manages these blocks using master and slave architecture.
- Name node:
- The Master node.
- Name node has metadata stored in it like namespace information, block information, etc.
- Keeps track of what blocks are on which slave nodes, where the replication of data blocks on the data nodes.
- Data node:
- The Slave nodes.
- The blocks of data are stored on data nodes.
- Data nodes are not smart, their main role is to store the data.
Resource Manager and Node Manager
Resource Manager
- There is only one Resource Manager per cluster.
- Resource Manager knows where the slaves are located (Rack Awareness).
- Keeps track of how many resources each slave have.
- It runs several services, the most important is the Resource Scheduler which decides how to assign the resources.
Node Manager
- There are many node managers in a cluster.
- Node Manager is a slave to Resource Manager.
- Each Node Manager tracks the available data processing resources on its slave nodes.
- The processing resource capacity is the amount of memory and the number of cores.
- At run-time, the Resource Scheduler will decide how to use this capacity.
- Node manager sends regular reports(heartbeat) to the Resource Manager.
- Tracks node-health, logs management.
LAB: Hadoop Sandbox
LAB: Demo Hadoop Sandbox
- Install Oracle VM virtual box or VM Player
- Load Hadoop VMware Image
LAB: Starting Hadoop
- Go to home folder
ls
cd ..
- Start Hadoop
start-all.sh
- If above command doesn’t work, then use the following command:
start-dfs.sh
start-yarn.sh
- Check name node and data nodes
jps
LAB: HDFS Files
- The files on HDFS
hadoop fs -ls /
- Check in the browser
http://localhost:50070
LAB: Copy from Local to HDFS
- Move files to HDFS
hadoop fs -copyFromLocal /home/hduser/datasets/Stock_Price_Data/stock_price.txt /test_on_dfs
- The files on HDFS
hadoop fs -ls /
- Delete files from HDFS
hadoop fs -rmr /test_on_dfs
- The files on HDFS now
hadoop fs -ls /
LAB: Move big data file to HDFS
- Since this is sudo cluster mode, which we are still working with a limited resource computer.
- Thus, let us take a medium size data.
- The dataset that we are going to work with, is Stack Overflow Tax data, which is already provided.
hadoop fs -copyFromLocal /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.txt /stack_data
- This code will copy the local file into HDFS.
- The code consists of the path of the local file on the local system, followed by the file name on HDFS i.e., stack_data, onto which the local file is copied.
- This code will move the 7 GB file onto HDFS.
- The copy will take some time, because this whole file need to be cut into smaller pieces, their pointers need to be taken and 3 replicas of each data chunk of 128 MB will be created, but we cannot directly access them.
- We can access each block of data. We can select the block that we need and download it.
- We can check the status of copying from the browser directory. Every time the page is refreshed, the copies of data will be updated. We can even see the no. of blocks that are copied onto HDFS.
- We can check the no. of blocks, no. of replications created for each block and the total size of the data from the browser. There are total 55 blocks of data.
hadoop fs -ls /
- This code will show all the files on HDFS. We can see that, there is a file called “stack_data”.
- As there are 55 blocks or chunks, we can apply mapreduce on it and find out the required results, like counting the no. of lines in the data.
- Counting the lines on whole data of 7 gb will be very difficult, so applying mapreduce to find the line count on the smaller data pieces will gives faster results.
Conclusion
- In this session we discussed some basic concepts of distributed computing
- We understood what is bigdata and Hadoop
- We discussed the concepts of MapReduce and HDFS
- In later sessions we will understand MapReduce programming and Hadoop ecosystem tools like Hive and Pig.