What is HDFS, Map Reduce, YARN, HBase, Hive, Pig, Mongodb in Apache Hadoop Big Data

What is Apache Hadoop ?

Apache Hadoop is an open source framework written in Java language. Open source means it is freely available and we can even change its source code as per the requirements. It is a platform for large data storage as well as processing It efficiently processes large volumes of data on a cluster of commodity hardware (low-performance systems which are compatible with Microsoft Windows or Linux). Most of Hadoop code is written by Yahoo, IBM, Facebook, Cloudera.

apache-hadoop-yarn-block
apache-hadoop-yarn-block

It provides framework for parallel processing of data as it runs jobs on multiple machines simultaneously. group of such machine which are connected in parallel are called as Cluster.

Hadoop consists of three key technologies – HDFS, MAPREDUCE, YARN

Hadoop Distributed File System (HDFS)

Basically , It is File system for data storage purposes, it is designed on principle of storage of less number of large files rather than the huge number of small files. the data is written once on the server and subsequently read and re-used many times thereafter.

How HDFS works ?

It has two main parts – a data processing framework and a distributed file system for data storage. It has NameNode and DataNode. NameNode only stores the metadata of HDFS which is the directory tree of all files in the file system, and tracks the files across the cluster. NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes.

hdfs-architecture

HDFS works it by a main ‘NameNode’ and multiple ‘DataNodes’ on a commodity hardware cluster. The NameNode is the smart node in the cluster. It knows exactly which data node contains which blocks and where the data nodes are located within the machine cluster. The NameNode also manages access to the files, including reads, writes, creates, deletes and replication of data blocks across different data nodes. All the nodes are usually organized within the same physical rack in the data center. Data is then broken down into separate blocks that are distributed among the various data nodes for storage. Blocks are also replicated across nodes to reduce the likelihood of failure.

Important features of HDFS is Fault tolerance,HDFS is highly fault-tolerant,high scalability, faster data processing, high availability.

Fault-tolerance-

Fault-tolerance is working strength of a system in unfavorable conditions. HDFS data is divided into blocks and multiple copies of blocks are created on different machines in the cluster. If any machine in the cluster goes down due to unfavorable conditions, then a client can easily access their data from other machine which contains the same copy of data blocks.

Highly Scalability-

Hadoop is an open source platform and it runs on industry-standard hardware, it makes Hadoop as a extremely scalable platform where new nodes can be easily added in the system as and data volume of processing needs grow without altering anything in the existing systems or programs.There is two scalability mechanisms, Vertical scalability – add more resources (CPU, Memory, Disk) on the existing nodes of the cluster. Another way is horizontal scalability – Add more machines in the cluster.

Faster Data Processing-

Hadoop is extremely good at high-volume batch processing because of its ability to do parallel processing. Hadoop can perform batch processes 10 times faster than on a single thread server or on the mainframe. Hadoop Ecosystem is robust as it comes with a suite of tools and technologies to deliver to a variety of data processing needs analytical needs of developers and small to large organizations

High Availability-

HDFS is highly available file system; data gets replicated among the nodes in the HDFS cluster by creating a replica of the blocks on the other slaves present in the HDFS cluster. Hence, when a client want to access his data, they can access their data from the slaves which contains its blocks and which is available on the nearest node in the cluster. At the time of failure of node, a client can easily access their data from other nodes.

What is Map-Reduce ?

MapReduce is framework which is responsible for cluster resource management and data processing. it process huge amount of data in parallel. it divides input dataset into independent chunk. Maps jobs process this data chunks in parallel and output stored on server as intermediate data. this output is shuffled and sort on keys. this sorted output is input to reducers. Reducers combines the output of mappers and produces a reduced result set.

map-reduce-architecture
map-reduce

This model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. You just need to put business logic in the way MapReduce works and rest things will be taken care by the framework. Work (complete job) which is submitted by the user to master is divided into small works (tasks) and assigned to slaves.

In Map step,

  • Master node takes large problem input and slices it into
    smaller sub problems; distributes these to worker nodes.
  • Worker node may do this again; leads to a multi-level tree
    structure
  • Worker processes smaller problem and hands back to master

In Reduce step

  • Master node takes the answers to the sub problems and
    combines them in a predefined way to get the output/answer
    to original problem

Key Functionalities of Hadoop YARN (Yet Another Resource Negotiator)

YARN basically for resource management and job scheduling and monitoring purposes. After the storage is mapped out, YARN can figure out how to process it and what and how much resources like RAM are to be assigned to each block of data.

In YARN, data-computation framework is formed by ResourceManager and the NodeManager. ResourceManager is a system that checks resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

The ApplicationsManager is responsible for accepting job-submissions and  provides the service for restarting the ApplicationMaster container on failure. ApplicationMaster negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.

What is Apache HBase ?

Apache HBase is an open-source, distributed, versioned, non-relational database. It is Distributed Storage System for Structured Data. It is meant to host large tables with billions of rows with potentially millions of columns and run across a cluster of commodity hardware. HBase is a powerful database in its own right that blends real-time query capabilities with the speed of a key/value store and offline or batch processing via MapReduce. In short, HBase supports structured data storage for large tables.

Why Hive Required ?

It provides analysis of data using language similar to SQL, hence it becomes very easy for the SQL developers to learn and implement Hive Queries. It is used to process structured and semi-structured data in Hadoop. Analysis of large datasets stored in HDFS is supported by Hive. Like SQL hive also provides query language named HiveQL.

Whats the importance of PIG ?

PIG is a tool/platform which is used to analyze larger sets of data representing them as data flows. It can perform all the data manipulation operations. To write data analysis programs, Pig provides a high-level language known as Pig Latin. The scripts written in this language provides as user defined functions for reading, writing, and processing data. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.

What is MangoDB/ NOSQL?

MongoDB is a non-relational database (often known as No-SQL). It is document oriented data store. MongoDB stores data in flexible, JSON-like documents, meaning fields can vary from document to document and data structure can be changed over time. MongoDB is a distributed database having key benefits as Horizontal scaling, Geographic distribution, Large volume of rapidly changing structured Semi- structured and unstructured data Object-oriented programming, dynamic schema & its open source.