Apache Hadoop and it's Distribution

To know about Hadoop deployment in the industries we have to be aware of it's distributors. Disctibutors of Hadoop are those companyes that provides Apache Hadoop-based software, support and services, and training to business customers. There are multiple vendor which exist in the market among whom two are the most widely used Apache Hadoop distribution
i.  Cloudera
ii. Hortonworks

Cloudera v. Hortonworks: Tale of the Tape 
Cloudera has plenty to boast about. It has in fact contributed significantly to the open source Apache Hadoop project and its Hadoop distribution is in production at high-profile Web companies like Groupon and Klout. It launched an innovative partner and certification program in September and Cloudera engineers continue to develop new features to help Hadoop meet enterprise-level uptime and security requirements.

In addition, Cloudera has a two-year head start over Hortonworks servicing a small but growing customer base. No question the Hortonworks team learned many valuable lessons working at Yahoo, but supporting an internal Hadoop deployment at one large technology company is a lot different than supporting a large and varied customer base of both technology and non-technology companies. In order for Hortonworks to become a self-sufficient Hadoop support juggernaut, Baldeschwieler’s stated goal, the company needs to prove it can deliver.

Finally, consider the competing Hadoop distributions themselves. Their cores are both based on the open source Apache Hadoop distribution and related sub-projects, with the real differentiation being the installation and administration management add-on tools. Cloudera Management Suite, while proprietary, includes important enterprise-level features such as automated, wizard-based Hadoop deployment capabilities, dashboards for configuration management and a resource management module for capacity and expansion planning. Ambari, Hortonworks' answer to Cloudera Management Suite, is open but is less mature and currently lacks advanced cluster management capabilities.

The reality is that Cloudera’s Hadoop distribution is largely open source and the risk of vendor lock-in due to its relatively few proprietary components is, in Wikibon’s opinion, lower than what Hortonworks marketing implies. Organizations that come to rely on Cloudera Enterprise for crucial parts of the business but later decide to move to a different Hadoop distribution or competing Big Data approach should be able to do so with little difficulty.

That said, Hortonworks’ open 100% approach means that updates and improvements to its distribution are likely to come quicker than those of Cloudera’s distribution and that partners may find it easier to integrate with HDP than Cloudera Enterprise. These are not insignificant factors that potential customers must consider.


Industry Requirement for Hadoop

Well before you dig more dipper into Hadoop you must need to know why it is creating that much of buzz in today's industries. I try to explain in the simplest term the "Industry Requirement for Hadoop" as this blog is meant for the beginners and novice hence not drilling into complex data scientific and customize data mining algorithm of it which gives the developer ample power to tweak their data warehouse on the fly.

Hadoop is the poster child for Big Data, so much so that the open source data platform has become practically synonymous with the wildly popular term for storing and analyzing huge sets of information.

While Hadoop is not the only Big Data game in town, the software has had a remarkable impact. But exactly why has Hadoop been such a major force in Big Data? What makes this software so damn special - and so important?

Sometimes the reasons behind something success can be staring you right in the face. For Hadoop, the biggest motivator in the market is simple: Before Hadoop, data storage was expensive.

Hadoop, however, lets you store as much data as you want in whatever form you need, simply by adding more servers to a Hadoop cluster. Each new server (which can be commodity x86 machines with relatively small price tags) adds more storage and more processing power to the overall cluster. This makes data storage with Hadoop far less costly than prior methods of data storage.


Spendy Storage Created The Need For Hadoop

We're not talking about data storage in terms of archiving… that's just putting data onto tape. Companies need to store increasingly large amounts of data and be able to easily get to it for a wide variety of purposes. That kind of data storage was, in the days before Hadoop, pricey.

And, oh what data there is to store. Enterprises and smaller businesses are trying to track a slew of data sets: emails, search results, sales data, inventory data, customer data, click-throughs on websites… all of this and more is coming in faster than ever before, and trying to manage it all in a relational database management system (RDBMS) is a very expensive proposition.

Historically, organizations trying to manage costs would sample that data down to a smaller subset. This down-sampled data would automatically carry certain assumptions, number one being that some data is more important than other data. For example, a company depending on e-commerce data might prioritize its data on the (reasonable) assumption that credit card data is more important than product data, which in turn would be more important than click-through data.

Assumptions Can Change

That's fine if your business is based on a single set of assumptions. But what what happens if the assumptions change? Any new business scenarios would have to use the down-sampled data still in storage, the data retained based on the original assumptions. The raw data would be long gone, because it was too expensive to keep around. That's why it was down-sampled in the first place.

Expensive RDBMS-based storage also led to data being siloed within an organization. Sales had its data, marketing had its data, accounting had its own data and so on. Worse, each department may have down-sampled its data based on its own assumptions. That can make it very difficult (and misleading) to use the data for company-wide decisions.

Hadoop: Breaking Down The Silos

Hadoop's storage method uses a distributed filesystem that maps data wherever it sits in a cluster on Hadoop servers. The tools to process that data are also distributed, often located on the same servers where the data is housed, which makes for faster data processing.

Hadoop, then, allows companies to store data much more cheaply. How much more cheaply? In 2012, Rainstor estimated that running a 75-node, 300TB Hadoop cluster would cost $1.05 million over three years. In 2008, Oracle sold a database with a little over half the storage (168TB) for $2.33 million - and that's not including operating costs. Throw in the salary of an Oracle admin at around $95,000 per year, and you're talking an operational cost of $2.62 million over three years - 2.5 times the cost, for just over half of the storage capacity.

This kind of price savings mean Hadoop lets companies afford to hold all of their data, not just the down-sampled portions. Fixed assumptions don't need to be made in advance. All data becomes equal and equally available, so business scenarios can be run with raw data at any time as needed, without limitation or assumption. This is a very big deal, because if no data needs to be thrown away, any data model a company might want to try becomes fair game.

That scenario is the next step in Hadoop use, explained Doug Cutting, Chief Architect of Cloudera and an early Hadoop pioneer. "Now businesses can add more data sets to their collection," Cutting said. "They can break down the silos in their organization."

More Hadoop Benefits

Hadoop also lets companies store data as it comes in - structured or unstructured - so you don't have to spend money and time configuring data for relational databases and their rigid tables. Since Hadoop can scale so easily, it can also be the perfect platform to catch all the data coming from multiple sources at once.

Hadoop's most touted benefit is its ability to store data much more cheaply than can be done with RDBMS software. But that's only the first part of the story. The capability to catch and hold so much data so cheaply means businesses can use all of their data to make more informed decisions.

Hadoop Today

If you just pay attention to the world’s largest Hadoop users, you might think the platform is just a better technology for powering search engines or analyzing customer behavior for ad-serving. Of course that’s not the case, but finding those broader use cases can still be kind of difficult. That’s too bad, because the more we highlight what’s possible, the easier it will be to discover entirely new uses.
Cloudera COO Kirk Dunn as well as a few panelists noted some of those use cases. I’ve uncovered a few throughout my years of covering the Hadoop space, too. With that in mind, here are 10 uses cases but I know there are a lot more floating about — feel free to share them in the comments.

1. Online travel. Dunn noted that Cloudera’s Hadoop distribution currently powers about 80 percent of all online travel booked worldwide. He didn’t mention users by name,   one of those customers, Orbitz Worldwide, uses Hadoop.

2. Mobile data. This another of Dunn’s anonymous statistics — that Cloudera powers “70 percent of all smartphones in the U.S.” I assume he’s talking about the storage and processing of mobile data by wireless providers, and a little market-share math probably could help one pinpoint the customers.

3. E-commerce. More anonymity, but Dunn says Cloudera powers more than 10 million online merchants in the United States. Dunn said one large retailer (I assume eBay, which is a major Hadoop user and manages a large marketplace of individual sellers that would help account for those 10-plus million merchants) added 3 percent to its net profits after using Hadoop for just 90 days.

4. Energy discovery. During a panel at Cloudera’s event, a Chevron representative explained just one of many ways his company uses Hadoop: to sort and process data from ships that troll the ocean collecting seismic data that might signify the presence of oil reserves.

5. Energy savings. At the other end of the spectrum from Chevron is Opower, which uses Hadoop to power its service that suggests ways for consumers to save money on energy bills. A representative on the panel noted that certain capabilities, such as accurate and long-term bill forecasting were hardly feasible without Hadoop.

6. Infrastructure management. This is a rather common use case, actually, as more companies (including Etsy, which I profiled recently) are gathering and analyzing data from their servers, switches and other IT gear. At the Cloudera event, a NetApp  rep noted how his company collects device logs (it has more than a petabyte worth at present) from its entire install base and stores them in Hadoop.

7. Image processing. A startup called Skybox Imaging is using Hadoop to store and process images from the high-definition images its satellites will regularly capture as they attempt to detect patterns of geographic change. Skybox recently raised $70 million for its efforts.

8. Fraud detection. This is another oldie but goodie, used by both financial services organizations and intelligence agencies. One of those users, Zions Bancorporation, explained to me recently how a move to Hadoop lets it store all the data it can on customer transactions and spot anomalies that might suggest fraudulent behavior.

9. IT security. As with infrastructure management, companies also use Hadoop to process machine-generated data that can identify malware and cyber attack patterns.  ipTrust, which uses Hadoop to assign reputation scores to IP address, which lets other security products decide whether to accept traffic from those sources.

10. Health care. I suspect there are many ways Hadoop can benefit health care practitioners, but one of them goes back to its search roots.  Apixio, which uses Hadoop to power its service that leverages semantic analysis to provide doctors, nurses and others more-relevant answers to their questions about patients’ health.


Hadoop Tomorrow

In some ways, Hadoop is a like a fine wine: It gets better with age as rough edges (or flavor profiles) are smoothed out, and those who wait to consume it will probably have a better experience. The only problem with this is that Hadoop exists in a world that’s more about MD 20/20 than it is about Relentless Napa Valley 2008: Companies often want to drink their big data fast, get drunk on insights, and then have some more — maybe something even stronger. And with data — unlike technology and tannins — it turns out older isn’t always better.

That’s a crude analogy, of course, but it gets at the essence of what’s currently plaguing Hadoop adoption and what will propel it forward in the next couple years. The work being done by companies like Cloudera and Hortonworks at the distribution level is great and important, as is MapReduce as a processing framework for certain types of batch workloads. But not every company can afford to be concerned about managing Hadoop on a day-to-day basis. And not every analytic job pairs well with MapReduce.


If there’s one big Hadoop theme at our Structure: Data conference March 20-21 in New York, it’s the new realization that people shouldn’t be asking “What’s next after Hadoop?” but rather “What will Hadoop become next?”. Based on what’s transpiring today, the answer to that question is that Hadoop will become faster in all regards and more useful as a result.

Interactivity, big-data-style


SQL is what’s next for Hadoop, and that’s not because of familiarity alone or the types of queries permitted by SQL on relational data. It’s also because the types of massively parallel processing engines developed to analyze relational data over the years are very fast. That means analysts can ask questions and get answers at speeds much closer to the speed of their intuitions than is possible when querying entire data sets using standard MapReduce.

But just as SQL and its processing techniques bring something to Hadoop, Hadoop (the Hadoop Distributed File System, specifically) brings something to the table, too. Namely, it brings scale and flexibility that don’t exist in the traditional data warehouse world, where new hardware and licenses can be expensive; so only the “valuable” data makes its way inside and only after it has been fitted to a pre-defined structure. Hadoop, on the other hand, provides virtually unlimited scale and schema-free storage, so companies can store however much information they want in whatever format they want and worry later about what they’ll actually use it for. (Actually, though, most Hadoop jobs do require some sort of structure in order to run, and Hadoop co-creator Mike Cafarella is working on a project called RecordBreaker that aims to automate this process for certain data types.)

How hot is SQL-on-Hadoop space? I profiled the companies and projects working on it on Feb. 21, and since then EMC Greenplum announced a completely rewritten Hadoop distribution that fuses its analytic database to Hadoop, and an entirely new player called JethroData emerged along with $4.5 million in funding. Even if there’s a major shakeout, there will be a few lucky companies left standing to capitalize on a shift to Hadoop as the center of data gravity that EMC Greenplum’s Scott Yara (albeit a biased source) thinks will be the data equivalent of the mainframe’s demise.

This is your database. This is your database on HDFS

The SQL versus NoSQL debate appears to be dying down as companies and developers begin to realize there’s definitely a place for both in most environments, but a new debate — with Hadoop at the center — might be about to start up. At its core is the concept of data gravity and the large, attractive (in a gravitational sense) entity that is HDFS. Here’s the underlying question that might be posed: If I’m already storing my unstructured data in HDFS and  expected to replace my data warehouse with it, too, why would I also run a handful of other databases that require a separate data store?

This is in part why HBase has attracted such a strong following despite its relative technical and commercial immaturity compared with comparable NoSQL database Cassandra. For applications that would benefit from a relational database, startups such as Drawn to Scale and Splice Machine have turned HBase into a transactional SQL system. Wibidata, the new startup from Cloudera Co-founder Christophe Bisciglia and Aaron Kimball, is pushing an open source framework called Kiji to make it easier to develop applications that use HBase.

“If you talk to anyone from Cloudera or any of the platform vendors, I think they will tell you that a large percentage of their customers use HBase,” Bisciglia said. “It’s something that I only expect to see increasing.”

MapR seems to think so, too: the Hadoop-distribution vendor is getting ahead of the game by selling an enterprise-grade version of HBase called M7. Should hot startups such as TempoDB and Ayasdi decide to take their HBase-reliant cloud services into the data center, they’ll tap into Hadoop clusters, too.

And the National Security Agency built Apache Accumulo, a key-value database similar to HBase but designed for fine-grained security and massive scale. It’s now being sold commercially by a startup called Sqrrl. There’s even a graph-processing project called Giraph that relies on HBase or Accumulo as the database layer.

Whatever “real-time” means to you

Real-time is one of those terms that means different things to different people and different applications. The interactivity that SQL-on-Hadoop technologies promise is one definition, as is the type of stream processing enabled by technologies like Storm. When it comes to the latter, there’s a lot of excitement around YARN as the innovation will make it happen.

YARN, aka MapReduce 2.0, is a resource scheduler and distributed application framework that allows Hadoop users to run processing paradigms other than MapReduce. This could mean things, from traditional parallel-processing methods such as MPI to graph processing to newly developed stream-processing engines such as Storm and S4. Considering for how many years Hadoop meant HDFS and MapReduce, this type flexibility is certainly a big deal.

Stream processing, of course, is the antithesis of batch processing, for which Hadoop is known, and which is inherently too slow for workloads such as serving real-time ads or monitoring sensor data. And even if Storm and other stream-processing platforms somehow don’t make their way onto Hadoop clusters, a startup called HStreaming has made it its mission to deliver stream processing to Hadoop, and it’s on other companies’ radars, as well.

For what it’s worth, though, Verti Cloud Founder and CEO and former Yahoo CTO Raymie Stata thinks we should do away with terms such as batch, real-time and interactive altogether. Instead, he prefers the terms synchronous and asynchronous to describe the human experience with the data rather than the speed of processing it. Synchronous computing happens at the speed of human activity, generally speaking, while asynchronous computing is largely decoupled from the idea of someone sitting in front of a computer screen awaiting a result.

The change in terms is associated with a change in how you manage SLAs for applications. Uploading photos to Flickr: synchronous. Running a MapReduce job: most likely asynchronous. Ironically, according to Stata, stream processing data with Storm is often asynchronous, too. That’s because there’s probably not someone on the other end waiting for a page to update or a query to return. And unless you’re doing something where guaranteed real-time latency is necessary, the occasional difference between milliseconds and 1 second probably isn’t critical.


Time to insight starts at the planning phase

Even when MapReduce is the answer, though, not everyone is game for a long Hadoop deployment process coupled with a consulting deal to identify uses and build applications or workflows. Sometimes, you just want to buy some software and get going.

Already, companies such as Wibidata and Continuuity are trying to make it easier for companies to build Hadoop applications specific to their own needs, and Wibidata’s Bisciglia said his company is doing less and less customization the more it deals with customers in the same vertical markets. “I think it’s still a couple years out before you can buy a generic application that runs on Hadoop,” he said, but he does see opportunity for billion-dollar businesses at this level, possibly selling the Hadoop equivalent of an ERP or CRM application.


And Cloudera CEO Mike Olson said the audience at our Structure: Data conference last year that he’ll connect startups trying to build Hadoop-based applications with funding opportunities. In fact, Cloudera backer Accel Partners launched a Big Data Fund in 2011 with the sole purpose of funding application-level big data startups.

But maybe Cloudera, like database vendor Oracle before it, will just get into the application space itself: According to Hadoop creator and Cloudera chief architect Doug Cutting:

“I wouldn’t be surprised if you see vendors, like Cloudera, starting to creep up the stack and sell some applications. You’ve seen that before from Red Hat, from Oracle. You could argue that the relational database is a platform for Oracle and they’ve sold a lot of applications on top. So I think that happens as the market matures. When it’s young, we don’t want to stomp on potential collaborators at this point, we want to open that up to other people to really enhance the platform.”
Cloud computing is proving to be a big help in getting Hadoop projects off the ground, too. Even low-level services such as Amazon Elastic MapReduce can ease the burden of managing a physical Hadoop cluster, and there are already a handful of cloud services exposing Hadoop as a SaaS application for business intelligence and analytics. The easier it gets to store, process and analyze data in the cloud, the more appealing Hadoop looks to potential users who can’t be bothered to invest in yet another IT project.

Google (and Microsoft): A guiding light

Lest we forget, Hadoop is based on a set of Google technologies, and it seems likely its future will also be influenced by what Google is doing. Already, improvements to HDFS seem to mirror changes to the Google File System a few years back, and YARN will enable some new types of non-MapReduce processing similar to what Google’s new Percolator framework does. (Google claims Percolator lets it “process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.”) The MapR-led Apache Drill project is a Hadoop-based version of Google’s Dremel tool; Giraph was likely inspired by Google’s Pregel graph-processing technology.

Cutting is particularly excited about Google Spanner, a database system that spans data geographies while still maintaining transactional consistency. “It’s a matter of time before somebody implements that in the Hadoop ecosystem,” he said. “That’s a huge change.”

It’s possible Microsoft could be an inspiration to the Hadoop community, too, especially if it begins to surface pieces of its Bing search infrastructure as products like a couple of company executives have told me it will. Bing runs on a combination of tools called Cosmos, Tiger and Scope, and it’s part of the Online Services division ran by former Yahoo VP and Hadoop backer Qi Lu. Lu said that Microsoft (like Google) is looking beyond just search — Hadoop’s original function — and into building an information fabric that changes how data is indexed, searched for and presented.

However it evolves, though, it’s becoming pretty obvious that Hadoop is no longer just a technology for doing cheap storage and some MapReduce processing. “I think there’s still some doubt in people’s minds about whether Hadoop is a flash in the pan … and I think they’re missing the point,” Cutting said. “I think that’s going to be proven to people in the next year.”

Hadoop Framework

The Apache Hadoop framework is composed of the following modules :

  • Hadoop Common - contains libraries and utilities needed by other Hadoop modules
  • Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on the commodity machines, providing very high aggregate bandwidth across the cluster.
  • Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.
  • Hadoop MapReduce - a programming model for large scale data processing.

All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework.

Architecture

Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2) and the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the necessary Java ARchive (JAR) files and scripts needed to start Hadoop. The package also provides source code, documentation and a contribution section that includes projects from the Hadoop Community.
For effective scheduling of work, every Hadoop-compatible file system should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to run work on the node where the data is, and, failing that, on the same rack/switch, reducing backbone traffic. HDFS uses this method when replicating data to try to keep different copies of the data on different racks. The goal is to reduce the impact of a rack power outage or switch failure, so that even if these events occur, the data may still be readable.

A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. These are normally used only in nonstandard applications. Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. The standard start-up and shutdown scripts require Secure Shell (ssh) to be set up between nodes in the cluster.
In a larger cluster, the HDFS is managed through a dedicated NameNode server to host the file system index, and a secondary NameNode that can generate snapshots of the namenode's memory structures, thus preventing file-system corruption and reducing loss of data. Similarly, a standalone JobTracker server can manage job scheduling. In clusters where the Hadoop MapReduce engine is deployed against an alternate file system, the NameNode, secondary NameNode and DataNode architecture of HDFS is replaced by the file-system-specific equivalent.

Hadoop distributed file system
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses the TCP/IP layer for communication. Clients use Remote procedure call (RPC) to communicate between each other.
HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence does theoretically not require RAID storage on hosts (but to increase I/O performance some RAID configurations are still useful). With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the target goals for a Hadoop application. The tradeoff of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append.
HDFS added the high-availability capabilities, as announced for release 2.0 in May 2012, allowing the main metadata server (the NameNode) to be failed over manually to a backup in the event of failure. The project has also started developing automatic fail-over.
The HDFS file system includes a so-called secondary namenode, which misleads some people into thinking that when the primary namenode goes offline, the secondary namenode takes over. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode's directory information, which the system then saves to local or remote directories. These checkpointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because the namenode is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple name-spaces served by separate namenodes.
An advantage of using HDFS is data awareness between the job tracker and task tracker.
The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location. For example: if node A contains data (x,y,z) and node B contains data (a,b,c), the job tracker schedules node B to perform map or reduce tasks on (a,b,c) and node A would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other file systems this advantage is not always available. This can have a significant impact on job-completion times, which has been demonstrated when running data-intensive jobs.
HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write-operations.
Another limitation of HDFS is that it cannot be mounted directly by an existing operating system. Getting data into and out of the HDFS file system, an action that often needs to be performed before and after executing a job, can be inconvenient. A Filesystem in Userspace (FUSE) virtual file system has been developed to address this problem, at least for Linux and some other Unix systems.
File access can be achieved through the native Java API, the Thrift API to generate a client in the language of the users' choosing (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line interface, or browsed through the HDFS-UI webapp over HTTP.

JobTracker and TaskTracker: the MapReduce engine

Above the file systems comes the MapReduce engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns off a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes the JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.
If the JobTracker failed on Hadoop 0.20 or earlier, all ongoing work was lost. Hadoop version 0.21 added some checkpointing to this process; the JobTracker records what it is up to in the file system. When a JobTracker starts up, it looks for any such data, so that it can restart work from where it left off.
Known limitations of this approach are:
The allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots. Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability.
If one TaskTracker is very slow, it can delay the entire MapReduce job - especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative execution enabled, however, a single task can be executed on multiple slave nodes.
Scheduling
By default Hadoop uses FIFO, and optional 5 scheduling priorities to schedule jobs from a work queue.In version 0.19 the job scheduler was refactored out of the JobTracker, and added the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler).
Fair scheduler
The fair scheduler was developed by Facebook.[28] The goal of the fair scheduler is to provide fast response times for small jobs and QoS for production jobs. The fair scheduler has three basic concepts.
Jobs are grouped into Pools.
Each pool is assigned a guaranteed minimum share.
Excess capacity is split between jobs.
By default, jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs.
Capacity scheduler
The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features that are similar to the fair scheduler.
Jobs are submitted into queues.
Queues are allocated a fraction of the total resource capacity.
Free resources are allocated to queues beyond their total capacity.
Within a queue a job with a high level of priority has access to the queue's resources.
There is no preemption once a job is running.


YARN Framework




Resource Manager :- The Resource Manager is the ultimate authority that arbitrates resources among all the applications in the system.
[The Resource Manager has two main components: Scheduler and Applications Manager.
Scheduler :- The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application.
(Container which incorporates elements such as memory, cpu, disk, network etc. In the first version, only memory is supported.
The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the Capacity Scheduler and the Fair Scheduler would be some examples of the plug-in.)
Applications Manager :- The Applications Manager is responsible for accepting job-submissions, negotiating the first container for executing the application specific Application Master and provides the service for restarting the Application Master container on failure.
]
Node Manager :- The Node Manager  is a per-node slave and per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
Application Master :- The per-application Application Master is, in effect, a framework specific library and is tasked with negotiating resources from the Resource Manager and working with the Node Manager(s) to execute and monitor the tasks. It has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.