What is Hadoop



Hadoop is a Java-based open source programming framework that supports the storage and processing of large data sets in a distributed computing environment.  Hadoop is designed to run on a large number of commodity hardware machines that don’t share any memory or disks and can scale up or down without system interruption.  It is part of the Apache project sponsored by the Apache Software Foundation.
Hadoop consists of three main functions: storage, processing and resource management. Storage is accomplished with the Hadoop Distributed File System (HDFS). HDFS is a reliable distributed file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers. Processing or Computation in Hadoop is based on the MapReduce paradigm that distributes tasks across a cluster of coordinated nodes or machines. YARN performs the resource management function in Hadoop 2.0 and extends MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models and was introduced in Hadoop 2.0. 



Hadoop was created by Doug Cutting. The underlying technology was invented by Google so as to index all the rich textural and structural information they were collecting, and then present meaningful and actionable results to users. Google’s innovations were incorporated into Nutch, an open source project and Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise applications.
Although Hadoop is best known for MapReduce, HDFS and Yarn, the term is also used for a family of related projects that fall under the umbrella of infrastructure for distributed computing and large-scale data processing. Below are some of the ASF projects included in a typical Hadoop distribution.
MapReduce: - MapReduce is a framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable and fault-tolerant manner. 
HDFS: - HDFS is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers.
Apache Hadoop YARN: - YARN is a next-generation framework for Hadoop data processing extending MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models.
Apache Tez: - Tez generalizes the MapReduce paradigm to a more powerful framework for executing a complex Directed Acyclic Graph (DAG) of tasks for near real-time big data processing.
Apache Pig: - A platform for processing and analyzing large data sets. Pig consists on a high-level language (Pig Latin) for expressing data analysis programs paired with the MapReduce framework for processing these programs.
Apache HCatalog: - A table and metadata management service that provides a centralized way for data processing systems to understand the structure and location of the data stored within Apache Hadoop.
Apache Hive: -Built on the MapReduce framework, Hive is a data warehouse that enables easy data summarization and ad-hoc queries via an SQL-like interface for large datasets stored in HDFS.
Apache HBase: - A column-oriented NoSQL data storage system that provides random real-time read/write access to big data for user applications.
Apache Mahout: - Mahout provides scalable machine learning algorithms for Hadoop which aids with data science for clustering, classification and batch based collaborative filtering.
Apache Accumulo: - Accumulo is a high performance data storage and retrieval system with cell-level access control. It is a scalable implementation of Google’s Big Table design that works on top of Apache Hadoop and Apache ZooKeeper.
Apache Flume: - Flume allows you to efficiently aggregate and move large amounts of log data from many different sources to to Hadoop.
Apache Sqoop: - Sqoop is a tool that speeds and eases movement of data in and out of Hadoop to and from RDBMS. It provides a reliable parallel load for various, popular enterprise data sources.
Apache ZooKeeper: - A highly available system for coordinating distributed processes. Distributed applications use ZooKeeper to store and mediate updates to important configuration information.
Apache Ambari: - An open source installation lifecycle management, administration and monitoring system for Apache Hadoop clusters.
Apache Oozie: - Oozie Java Web application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work.
Apache Falcon: - Falcon is a data management framework for simplifying data lifecycle management and processing pipelines on Apache Hadoop®. It enables users to configure, manage and orchestrate data motion, pipeline processing, disaster recovery, and data retention workflows.
Apache Knox: - The Knox Gateway (“Knox”) is a system that provides a single point of authentication and access for Apache Hadoop services in a cluster. The goal of the project is to simplify Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster.

No comments:

Post a Comment