Consider the case where you want to store a file bigger in size the capacity of your PC or Laptop. The question arises how to do it. Here is the solution to it. And the solution lies in the concept called Hadoop, using which you can store the very very large files and many many files.
These days with the advent of digitization large amount of data is created and stored. This structure and unstructured data is generated from sources like social media, smart phones, tablets, machine data, videos etc. Daily about 2.5 Quintillion (that is 18 zeros) Bytes of data is generated. If we store this much quantum of data in blue ray disc and stack them then the height of these discs would be around the height of Eiffel Tower back and forth twice. About 300 million photos are uploaded and 4.75 billion pieces of content is shared daily on facebook alone. To address this problem BIG data and the concept of Hadoop came into picture.
Apache Hadoop (written in Java) is an open source software based on Big Data technology. This distributes sets of data across different cluster (thousands of nodes) of servers to enable the processing of large amount of data. It is designed to scale up from a single machine to thousands of machines, and each of these machines offer local computation and storage. Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
Framework of Apache Hadoop is composed of following four modules:
- Hadoop Common or Hadoop Core: It is a collection of common libraries and utilities to support other Hadoop modules. In the Hadoop framework this is a essential module and designed in such a manner that hardware failures are automatically handled in software by the Hadoop Framework. As it provides basic processes and essential services such as abstraction of the operating system and its file system it is also considered as the base/core of the Hadoop framework. It also contains the necessary JAR files and scripts required to start Hadoop. It also provides source code and documentation from the Hadoop Community.
- Hadoop YARN: Here YARN stands for Yet Another Resource Negotiator a cluster management technology. YARN supports priority environment for applications to deliver consistent operational requirements, tools governing data and ensures security for Hadoop clusters. Power of Hadoop is extended by YARN so as to adapt new technologies and to extract their advantages of linear- scale storage, processing and cost effectiveness.
- Hadoop Distributed File System (HDFS): It is JAVA based file system and provides for reliable and scalable storage of data. It supports scalability for storage upto 200 petabytes and a single cluster of 4500 servers. With the simultaneous use of HDFS file system and YARN enabled multiple data access application, Hadoop can answer the questions thereby risking the previous data platforms.
Hadoop Distributed File System derived from GFS (Google File System) and supports a distributed file system that runs on large clusters of small computers in a manner which is reliable and fault free.
HDFS is based on distributed file system, is fault free and scalable and works along with data access applications controlled by YARN. As the storage and computation is distributed it can grow with demand in a linear way keeping it economical for every incremental increase in storage.
HDFS is based on a Master – Slave architecture where master comprises of single node called Name Node that manages the file system metadata and the slave comprises of single or multiple servers called Data Nodes that stores the actual data as shown in the figure.
The Master- Slave node structure in Hadoop
How it functions: A file gets split into blocks and these blocks are split and stored in data nodes as instructed by the Name Node (Master Node). The Data Nodes takes care of read and write functions and on the instructions of Name Node are also responsible for block replication, creation and deletion.
4. Hadoop Map Reduce: It is a software for writing applications to process large amount of structured, semi-structured and unstructured data stored in the HDFS system.
It performs two different categories of tasks:
- The Map Task- It takes the input data, converts it into a set of data and then these individual elements are broken into a finite ordered list of elements or tuples (keys/pairs)
- The Reduce Task- It takes the output of Map Task as input and converts the data arranged into a finite ordered list of elements into a smaller set of list of elements. This task is always followed by Map Task.
The Map Reduce consists of Master Job Tracker and a slave Task Tracker per cluster node. Resource consumption and availability, scheduling the jobs component tasks on the slaves, resource management and monitoring and re-executing the failed jobs are all managed by Master Job Tracker whereas executing the tasks assigned by the master and to periodically provide task status information to the master is managed by the Slave Task Tracker.