Hadoop is the open source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets.
Hadoop is the system that allows unstructured data to be distributed across hundreds or thousands of machines forming shared nothing clusters, and the execution of Map/Reduce routines to run on the data in that cluster.
Hadoop has its own filesystem which replicates data to multiple nodes to ensure if one node holding data goes down, there are at least 2 other nodes from which to retrieve that piece of information. This protects the data availability from node failure, something which is critical when there are many nodes in a cluster .

The base Apache Hadoop framework is composed of the following modules:
• Hadoop Common – contains libraries and utilities needed by other Hadoop modules.
• Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
• Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users’ applications
• Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing.

Introduction of Hadoop
In a Hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in. The Hadoop Distributed File System (HDFS) will split large data files into chunks which are managed by different nodes in the cluster. In addition to this each chunk is replicated across several machines, so that a single machine failure does not result in any data being unavailable.

Data is conceptually record-oriented in the Hadoop programming framework. Individual input files are broken into lines or into other formats specific to the application logic. Each process running on a node in the cluster then processes a subset of these records. Since files are spread across the distributed file system as chunks, each compute process running on a node operates on a subset of the data. Which data operated on by a node is chosen based on its locality to the node: most data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers.
This strategy of moving computation to the data , instead of moving the data to the computation allows Hadoop to achieve high data locality which in turn results in high performance
Hadoop limits the amount of communication which can be performed by the processes, as each individual record is processed by a task in isolation from one another. While this sounds like a major limitation at first, it makes the whole framework much more reliable. Hadoop will not run just any program and distribute it across a cluster. Programs must be written to conform to a particular programming model, named “MapReduce.”

InMapReduce, records are processed in isolation by tasks called Mappers . The output from the Mappers is then brought together into a second set of tasks called Reducers , where results from different mappers can be merged together.
Hadoop internally manages all of the data transfer and cluster topology issues. By restricting the communication between nodes, Hadoop makes the distributed system much more reliable.

Author: Achala Sharma