Organizations
that want to extract more intelligence from their Hadoop deployments
might find help from the relatively little known Tajo open source data
warehouse software, which the Apache Software Foundation has pronounced
as ready for commercial use.
The
new version of Tajo, Apache software for running a data warehouse over
Hadoop data sets, has been updated to provide greater connectivity to
Java programs and third party databases such as Oracle and PostGreSQL.
While less well-known than other Apache big data projects such as Spark or Hive, Tajo could
be a good fit for organizations outgrowing their commercial data
warehouses. It could also be a good fit for companies wishing to analyze
large sets of data stored on Hadoop data processing platforms using
familiar commercial business intelligence tools instead of Hadoop’s
MapReduce framework.
Tajo
performs the necessary ETL (extract-transform-load process) operations
to summarize large data sets stored on an HDFS (Hadoop Distributed File
System). Users and external programs can then query the data through
SQL.
The
latest version of the software, issued Monday, comes with a newly
improved JDBC (Java Database Connectivity) driver that its project
managers say makes Tajo as easy to use as a standard relational database
management system. The driver has been tested against a variety of
commercial business intelligence software packages and other SQL-based
tools.
Other new features include catalogs of built-in SQL commands from both Oracle and PostgreSQL systems.
Like a
growing number of database systems, Tajo now features full support for
JSON (JavaScript Object Notation), easing the process for Web developers
to work with Tajo. Tajo can also work directly with Amazon S3 (Simple
Storage Service)
Gruter, a big data infrastructure startup in South Korea, is leading the charge to develop Tajo. Engineers from Intel, Etsy, NASA, Cloudera and Hortonworks also contribute to the project.
Perhaps
because of its South Korean home base, the software is not very widely
known elsewhere in the world, compared to other open-source SQL-based
Hadoop packages such as Hive or Impala.
At least in one test of
the software, conducted in 2013, Tajo appeared to possess a speed
advantage, according to Gruter. Korea’s SK Telecom telecommunications
firm ran Tajo against 1.7 terabytes worth of data, and found it could complete queries with greater speed than either Hive or Impala, in most instances.
As
with most benchmarks, results may vary according to the specific
workload. New editions of Hive and Impala may have also closed the speed
gap as well.
SK
Telecom uses the software in production duties, as does Korea
University and NASA’s Jet Propulsion Laboratory. The Korean music
streaming service Melon uses the software for analytical processing, and
has found that Tajo executes ETL jobs 1.5 to 10 times faster than Hive.
The Apache Software Foundation provides
support and oversight for more than 350 open source projects, including
Hadoop, the Cassandra NoSQL database and the Apache HTTP server.
No comments:
Post a Comment