- Scaling
commercial relational databases is expensive because to run a bigger
database you will need to buy a bigger machine.
- Hadoop
is designed to be a scale-out architecture operating on a cluster of
commodity PC machines, Adding more resources means adding more
machines to the Hadoop cluster, Hadoop clusters with ten to
hundreds of machines is standard.
- In
RDBMS, data resides in tables having relational structure
defined by a schema.
- Hadoop
uses key/value pairs as its basic data unit, which is flexible enough
to work with the less-structured data types. In Hadoop, data can originate
in any form, but it eventually transforms into (key/value) pairs for the
processing functions to work on.
- Under
SQL you have query statements; under MapReduce you have scripts and
codes.
- MapReduce
allows you to process data in a more general fashion than SQL
queries. For example, you can build complex statistical models
from your data or reformat your image data. SQL is not well designed
for such tasks.
- Hadoop
is designed for offline processing and analysis of large-scale data.
It does not work for random reading and writing of a few
records, which is the type of load for online transaction processing.
- Hadoop is best used as a write-once, read-many-times
type of data store. In this aspect it is similar to data warehouses in
the SQL world.
Comparing
SQL databases and Hadoop
For working only with structured data, the comparison is more nuanced. In principle, SQL and Hadoop can be complementary, as SQL is a query language which can be implemented on top of Hadoop as the execution engine.3 But in practice, SQL databases tend to refer to a whole set of legacy technologies, with several dominant vendors, optimized for a historical set of applications. Many of these existing commercial databases are a mismatch to the requirements that Hadoop targets.
With that in mind, let’s make a more detailed comparison of Hadoop with typical SQL databases on specific dimensions.