At very high level we can say that SQL (structured query
language) is by design targeted at structured data only but most of
initial applications in Hadoop deal with unstructured data such as text.
Following are some more detailed comparison of Hadoop
with SQL databases on specific dimensions:
- Scaling
commercial relational databases is expensive because to run a bigger
database you will need to buy a bigger machine.
- Hadoop
is designed to be a scale-out architecture operating on a cluster of
commodity PC machines, Adding more resources means adding more
machines to the Hadoop cluster, Hadoop clusters with ten to
hundreds of machines is standard.
- In
RDBMS, data resides in tables having relational structure
defined by a schema.
- Hadoop
uses key/value pairs as its basic data unit, which is flexible enough
to work with the less-structured data types. In Hadoop, data can originate
in any form, but it eventually transforms into (key/value) pairs for the
processing functions to work on.
- Under
SQL you have query statements; under MapReduce you have scripts and
codes.
- MapReduce
allows you to process data in a more general fashion than SQL
queries. For example, you can build complex statistical models
from your data or reformat your image data. SQL is not well designed
for such tasks.
- Hadoop
is designed for offline processing and analysis of large-scale data.
It does not work for random reading and writing of a few
records, which is the type of load for online transaction processing.
- Hadoop is best used as a write-once, read-many-times
type of data store. In this aspect it is similar to data warehouses in
the SQL world.
Comparing
SQL databases and Hadoop
Given that Hadoop
is a framework for processing data, what makes it better than standard
relational databases, the workhorse of data processing in most of today’s
applications? One reason is that SQL (structured query language) is by design
targeted at structured data. Many of Hadoop’s initial applications deal with
unstructured data such as text. From this perspective Hadoop provides a more
general paradigm than SQL.
For working only with structured data, the comparison is more nuanced. In principle, SQL and Hadoop can be complementary, as SQL is a query language which can be implemented on top of Hadoop as the execution engine.3 But in practice, SQL databases tend to refer to a whole set of legacy technologies, with several dominant vendors, optimized for a historical set of applications. Many of these existing commercial databases are a mismatch to the requirements that Hadoop targets.
With that in mind, let’s make a more detailed comparison of Hadoop with typical SQL databases on specific dimensions.
For working only with structured data, the comparison is more nuanced. In principle, SQL and Hadoop can be complementary, as SQL is a query language which can be implemented on top of Hadoop as the execution engine.3 But in practice, SQL databases tend to refer to a whole set of legacy technologies, with several dominant vendors, optimized for a historical set of applications. Many of these existing commercial databases are a mismatch to the requirements that Hadoop targets.
With that in mind, let’s make a more detailed comparison of Hadoop with typical SQL databases on specific dimensions.
SCALE-OUT
INSTEAD OF SCALE-UP
Scaling commercial relational databases is expensive. Their
design is friendlier to scaling up. To run a bigger database you need to buy a
bigger machine. In fact, it’s not unusual to see server vendors market their
expensive high-end machines as “database-class servers.” Unfortunately, at some
point there won’t be a big enough machine available for the larger data sets.
More importantly, the high-end machines are not cost effective for many
applications. For example, a machine with four times the power of a standard PC
costs a lot more than putting four such PCs in a cluster. Hadoop is designed to
be a scale-out architecture operating on a cluster of commodity PC machines.
Adding more resources means adding more machines to the Hadoop cluster. Hadoop
clusters with ten to hundreds of machines is standard. In fact, other than for
development purposes, there’s no reason to run Hadoop on a single server.
KEY/VALUE
PAIRS INSTEAD OF RELATIONAL TABLES
A
fundamental tenet of relational databases is that data resides in tables having
relational structure defined by a schema. Although the relational model has
great formal properties, many modern applications deal with data types that
don’t fit well into this model. Text documents, images, and XML files are
popular examples. Also, large data sets are often unstructured or semi structured.
Hadoop uses key/value pairs as its basic data unit, which is flexible enough to
work with the less-structured data types. In Hadoop, data can originate in any
form, but it eventually transforms into (key/value) pairs for the processing
functions to work on.
FUNCTIONAL
PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL)
SQL
is fundamentally a high-level declarative language. You query data by stating
the result you want and let the database engine figure out how to derive it.
Under MapReduce you specify
the actual steps in processing the data, which is more analogous to an
execution plan for a SQL engine. Under SQL you have query statements; under
MapReduce you have scripts and codes. MapReduce allows you to process data in a
more general fashion than SQL queries. For example, you can build complex
statistical models from your data or reformat your image data. SQL is not well
designed for such tasks.
On the other
hand, when working with data that do fit well into relational structures, some
people may find MapReduce less natural to use. Those who are accustomed to the
SQL paradigm may find it challenging to think in the MapReduce way. I hope the
exercises and the examples in this book will help make MapReduce programming
more intuitive. But note that many extensions are available to allow one to
take advantage of the scalability of Hadoop while programming in more familiar
paradigms. In fact, some enable you to write queries in a SQL-like language,
and your query is automatically compiled into MapReduce code for execution.
We’ll cover some of these tools in chapters 10 and 11.
OFFLINE
BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS
Hadoop is designed for offline
processing and analysis of large-scale data. It doesn’t work for random reading
and writing of a few records, which is the type of load for online transaction
processing. In fact, as of this writing (and in the foreseeable future), Hadoop
is best used as a write-once, read-many-times type of data store. In this
aspect it’s similar to data warehouses in the SQL world.
You have seen how Hadoop
relates to distributed systems and SQL databases at a High level. Let’s learn
how to program in it. For that, we need to understand Hadoop’s MapReduce
paradigm.
No comments:
Post a Comment