Hadoop is a Java-based open source programming
framework that supports the storage and processing of large data sets in a
distributed computing environment.
Hadoop is designed to run on a large number of commodity hardware machines that don’t
share any memory or disks and can
scale up or down without system interruption. It is part of the Apache project sponsored by the Apache
Software Foundation.
Hadoop consists of
three main functions: storage, processing and resource management. Storage is accomplished with the Hadoop
Distributed File System (HDFS). HDFS is a reliable distributed file system that
allows large volumes of data to be stored and rapidly accessed across large
clusters of commodity servers. Processing or Computation in Hadoop is based on the MapReduce paradigm
that distributes tasks across a cluster of coordinated nodes or machines. YARN
performs the resource management function in Hadoop 2.0 and extends MapReduce
capabilities by supporting non-MapReduce workloads associated with other
programming models and was introduced in Hadoop 2.0.
Hadoop
was created by Doug Cutting. The underlying technology was invented by Google so as to index
all the rich textural and structural information they were collecting, and then
present meaningful and actionable results to users. Google’s innovations were
incorporated into Nutch, an open source project and
Hadoop was later spun-off from that. Yahoo has played a key role developing Hadoop for enterprise
applications.
Although Hadoop is best known
for MapReduce, HDFS and Yarn, the term is also used for a family of related
projects that fall under the umbrella of infrastructure for distributed
computing and large-scale data processing. Below are some of the ASF projects included in a typical Hadoop distribution.
MapReduce: - MapReduce is a
framework for writing applications that process large amounts of structured and
unstructured data in parallel across a cluster of thousands of machines, in a
reliable and fault-tolerant manner.
HDFS: - HDFS is a
Java-based file system that provides scalable and reliable data storage that is
designed to span large clusters of commodity servers.
Apache Hadoop YARN: - YARN is a
next-generation framework for Hadoop data processing extending MapReduce
capabilities by supporting non-MapReduce workloads associated with other
programming models.
Apache Tez: - Tez generalizes the
MapReduce paradigm to a more powerful framework for executing a complex
Directed Acyclic Graph (DAG) of tasks for near real-time big data processing.
Apache Pig: - A platform for
processing and analyzing large data sets. Pig consists on a high-level language
(Pig Latin) for expressing data analysis programs paired with the MapReduce
framework for processing these programs.
Apache HCatalog: - A table and
metadata management service that provides a centralized way for data processing
systems to understand the structure and location of the data stored within
Apache Hadoop.
Apache Hive: -Built on the
MapReduce framework, Hive is a data warehouse that enables easy data
summarization and ad-hoc queries via an SQL-like interface for large datasets
stored in HDFS.
Apache HBase: - A column-oriented
NoSQL data storage system that provides random real-time read/write access to
big data for user applications.
Apache Mahout: - Mahout provides
scalable machine learning algorithms for Hadoop which aids with data science
for clustering, classification and batch based collaborative filtering.
Apache Accumulo: - Accumulo is a
high performance data storage and retrieval system with cell-level access
control. It is a scalable implementation of Google’s Big Table design that
works on top of Apache Hadoop and Apache ZooKeeper.
Apache Flume: - Flume allows you
to efficiently aggregate and move large amounts of log data from many
different sources to to Hadoop.
Apache Sqoop: - Sqoop is a tool
that speeds and eases movement of data in and out of Hadoop to and from RDBMS.
It provides a reliable parallel load for various,
popular enterprise data sources.
Apache ZooKeeper: - A highly
available system for coordinating distributed processes. Distributed
applications use ZooKeeper to store and mediate updates to important
configuration information.
Apache Ambari: - An open source
installation lifecycle management, administration and monitoring system for
Apache Hadoop clusters.
Apache Oozie: - Oozie Java Web
application used to schedule Apache Hadoop jobs. Oozie combines multiple jobs
sequentially into one logical unit of work.
Apache Falcon: - Falcon is a data
management framework for simplifying data lifecycle management and processing
pipelines on Apache Hadoop®. It enables users to configure, manage and orchestrate
data motion, pipeline processing, disaster recovery, and data retention
workflows.
Apache Knox: - The Knox Gateway
(“Knox”) is a system that provides a single point of authentication and access
for Apache Hadoop services in a cluster. The goal of the project is to simplify
Hadoop security for users who access the cluster data and execute jobs, and for
operators who control access and manage the cluster.
No comments:
Post a Comment