Hadoop Fundamentals : HDFS & MapReduce
What is Hadoop ?
Imagine that scenario where you have 1GB of data that you need to process, the data is stored in your desktop computer which has no problem handling the load. Then your company starts growing very quickly, and that data grows to 10GB, then 100GB, and you start to reach the limits of your current desktop Computer.
So what would you do? you will scale up by investing in a larger computer, and then you are ok for a few months. When your data grows from 1TB to 10TB , and then 100TB , you are again quickly approaching the limits of that computer. Moreover, you have been asked to feed your application with unstructured data coming from sources like Facebook, Twitter, RFID sensors, Sensors and so on. Your Management wants to derive information from both the relational and the unstructured data and wants the information as soon as possible. What should you do? Hadoop may the answer.
Hadoop is an open source project of the Apache foundations. It is a framework written in java originally, developed by Doug Cutting. Hadoop used googles MapReduce technology as its foundation. It is optimized to handle massive quantities of data which could be structured, unstructured, or semi-structured, using commodity hardware. Hadoop replicates its data across different computers, so that if one goes down, the data is processed on one of the replicated computers.
Hadoop is not suitable for On-Line Transaction Processing workloads where data is randomly accessed on structured data like a relational database. Also, Hadoop is not suitable for On-Line analytical Processing or Decision Support system workloads. Hadoop is used for Big Data , It complements On-Line Transaction Processing and On-Line analytical Processing. It is not a replacement of relational database system.
What is Big data ?
I would say Big data is extremely huge volume of data. but wait a minute, what is huge ? There is no straight hard number that defines huge. There are two reasons why there is no straight forward answer to this question.
- What is considered big or huge today in terms of data size or volume need not be considered as big a year from now.
- Its all relative. What you and I considers to be big may not be the case for companies like Google or Facebook.
There are 3 factors to determine whether the data is a big data problem or not. We term it as 3Vs.
- Volume: How much is the data .
- Velocity: How fast your data is Growing.
- variety: Data coming into system in different formats and you have to process and analyse it.
Why Hadoop ?
- Support huge volume
- Storage Efficiency
- Good Recovery Solutions
- Horizontal Scaling
- Cost Effective
- Easy for programmers and non programmers
Hadoop Distributed File System (HDFS)
Without a file system, information placed in a storage area would be one large body of data with no way to tell where one piece of information stops and next begins.
functions of a file system:
- control how data is stored and retrieved
- metadata about files and folders
- permission and security
- manage storage space efficiently
Why Do we need another file system i.e. HDFS ?
To support truly parallel Computation we have to divide the data set into blocks and store them in different nodes and to recover from data loss we also replicate each block in more than one node.
Assume you have a 10 node cluster and you have EXT4 as the file system on each node, we will refer EXT4 on each node as local file system. The first task of your proposed file system is when you upload a file , you need the file system to divide the data set into fixed sized blocks. Next your file system should have a distributed view of the files or blocks in your cluster which is not possible within the local file system, what I mean is , your local file system on node 1 has no idea what is on node 2 and vice versa. Next important thing is replication, since node 1 has no idea what’s on node 2 so it does not have the ability to replicate blocks , which means we are exposed to data loss.
So now Assume we have a file system on top of EXT4 , but this time it spread across all the nodes, and there you go , we call this file system Hadoop distributed file system. So HDFS takes care of placing the blocks in different nodes and also takes care of replicating each block and to more than one node. By default HDFS replicates a block to 3 nodes. HDFS will continue to keep track of all the blocks and there node assignments at all time.
HDFS is not a replacement of you local file system. Your O.S. still rely on your local file system. HDFS is placed on top of local file system.
HDFS ARCHITECTURE
The nodes where the blocks are physically stored are known as dataNodes and they are named so because these nodes hold the actual data for the cluster. Each data node knows its lists of blocks it is responsible for. Given data node knows only about its own blocks and does not care about what blocks are stored in other data nodes.
But this is a problem for us as users right.
because we don’t know anything about the blocks and quite honestly don’t care about the blocks. All we know is the file name and we should be able to work one day with the file name in your cluster. So the question here is, If dataNodes does not know which block belongs to which file then who has that key information?
This key information is maintained by a node called NameNode. NameNode keeps the track of all the files or datasets in HDFS. For any given file in HDFS , NameNode knows the list of blocks that make ups the file. Not only the list of blocks but also the location of the blocks. NameNode also has the metadata of the files and folders in the HDFS. Due to the significance of the name node its is also called the Master Node, and the DataNodes are also called the slave nodes. NameNode persist all of the metadata information about the files and folder and hard disk except for the block location.
MapReduce
What is MapReduce?
- Distributed programming model for processing large datasets
- Conceived at google
- Can be implemented in any programming language
- MapReduce is not a programming language
- Hadoop implements MapReduce
Why MapReduce?
- Distributes the processing of data on your cluster
- Divides your data up into partitions that are MAPPED (transformed) and Reduced (aggregated ) by mappers and reducer functions you define.
How MapReduce Works
- The mapper converts raw source data into key/value pairs
2. MapReduce Sorts and groups the Mapped data
3. The Reducer Processes Each Key’s Values
For example , if we are working on a movie lens dataset and wants to see which user reviewed which particular movie and also apply some function on it to see the length of lists of movies by each user, it would be something like this.
So What’s going under the hood?
So you actually kick off your job from a client node, some terminal prompts somewhere that you have open is actually kicking off this map reduce job somewhere on a PC on your cluster. And the first thing that happens is that it goes and talks to the yarn resource manager.
Now yarn is yet another resource negotiator. It’s another core piece of Hadoop and it’s what manages what gets run on which machines. So it’s keeping track of what machines in my cluster are available, which ones have capacity and so on and so forth. It talks to the resource manager and says hey I want to kick off a map reduce job. And while it’s doing that it’s also copying any data that it needs to copy to HDFS. So it might need to copy some data up to be accessible by all of these different nodes that might be processing that data later on.
So the next thing that happens is a map reduce application master get spun up. And this is running under what we call a node manager. So basically everything that’s running map reduce stuff is being managed by something called a node manager that just keeps track of what this node is doing. So the application master is now responsible for keeping an eye on all the different individual map and reduce tasks. And it works with the resource manager to actually distribute that work across your cluster.
So let’s imagine that we have two different machines in our cluster here apart from our resource manager. And maybe this one is running a map task and another map task these both might be running under a different container or different JVM. But there are still managed by the same node manager which talks to our application master and keeps track of what’s running where and we might have a second machine running a reduce task or something later on. And that might have its own node manager as well. All working in concert to keep track of what’s up and what’s not. And while these things are mapping and reducing stuff it’s talking to our HDFS cluster to actually receive the data that it needs to process and output the data results that it has at the end.
So here we have a basic understanding and fundamental concepts of Hadoop(HDFS and MapReduce).