Wednesday, May 6, 2015

What is difference between Hadoop and RDBMS?


At very high level we can say that SQL (structured query language) is by design targeted at structured data only but most of initial applications in Hadoop deal with unstructured data such as text.

Following are some more detailed comparison of Hadoop with SQL databases on specific dimensions:

  • Scaling commercial relational databases is expensive because to run a bigger database you will need to buy a bigger machine.
  • Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines, Adding more resources means adding more machines to the Hadoop cluster, Hadoop clusters with ten to hundreds of machines is standard.
  • In RDBMS, data resides in tables having relational structure defined by a schema.
  • Hadoop uses key/value pairs as its basic data unit, which is flexible enough to work with the less-structured data types. In Hadoop, data can originate in any form, but it eventually transforms into (key/value) pairs for the processing functions to work on.
  • Under SQL you have query statements; under MapReduce you have scripts and codes.
  • MapReduce allows you to process data in a more general fashion than SQL queries. For example, you can build complex statistical models from your data or reformat your image data. SQL is not well designed for such tasks.
  • Hadoop is designed for offline processing and analysis of large-scale data. It does not work for random reading and writing of a few records, which is the type of load for online transaction processing.
  • Hadoop is best used as a write-once, read-many-times type of data store. In this aspect it is similar to data warehouses in the SQL world.

Comparing SQL databases and Hadoop


Given that Hadoop is a framework for processing data, what makes it better than standard relational databases, the workhorse of data processing in most of today’s applications? One reason is that SQL (structured query language) is by design targeted at structured data. Many of Hadoop’s initial applications deal with unstructured data such as text. From this perspective Hadoop provides a more general paradigm than SQL.

For working only with structured data, the comparison is more nuanced. In principle, SQL and Hadoop can be complementary, as SQL is a query language which can be implemented on top of Hadoop as the execution engine.3 But in practice, SQL databases tend to refer to a whole set of legacy technologies, with several dominant vendors, optimized for a historical set of applications. Many of these existing commercial databases are a mismatch to the requirements that Hadoop targets.

With that in mind, let’s make a more detailed comparison of Hadoop with typical SQL databases on specific dimensions.
SCALE-OUT INSTEAD OF SCALE-UP

Scaling commercial relational databases is expensive. Their design is friendlier to scaling up. To run a bigger database you need to buy a bigger machine. In fact, it’s not unusual to see server vendors market their expensive high-end machines as “database-class servers.” Unfortunately, at some point there won’t be a big enough machine available for the larger data sets. More importantly, the high-end machines are not cost effective for many applications. For example, a machine with four times the power of a standard PC costs a lot more than putting four such PCs in a cluster. Hadoop is designed to be a scale-out architecture operating on a cluster of commodity PC machines. Adding more resources means adding more machines to the Hadoop cluster. Hadoop clusters with ten to hundreds of machines is standard. In fact, other than for development purposes, there’s no reason to run Hadoop on a single server.

KEY/VALUE PAIRS INSTEAD OF RELATIONAL TABLES

A fundamental tenet of relational databases is that data resides in tables having relational structure defined by a schema. Although the relational model has great formal properties, many modern applications deal with data types that don’t fit well into this model. Text documents, images, and XML files are popular examples. Also, large data sets are often unstructured or semi structured. Hadoop uses key/value pairs as its basic data unit, which is flexible enough to work with the less-structured data types. In Hadoop, data can originate in any form, but it eventually transforms into (key/value) pairs for the processing functions to work on.

FUNCTIONAL PROGRAMMING (MAPREDUCE) INSTEAD OF DECLARATIVE QUERIES (SQL)

SQL is fundamentally a high-level declarative language. You query data by stating the result you want and let the database engine figure out how to derive it. Under MapReduce you specify the actual steps in processing the data, which is more analogous to an execution plan for a SQL engine. Under SQL you have query statements; under MapReduce you have scripts and codes. MapReduce allows you to process data in a more general fashion than SQL queries. For example, you can build complex statistical models from your data or reformat your image data. SQL is not well designed for such tasks.

On the other hand, when working with data that do fit well into relational structures, some people may find MapReduce less natural to use. Those who are accustomed to the SQL paradigm may find it challenging to think in the MapReduce way. I hope the exercises and the examples in this book will help make MapReduce programming more intuitive. But note that many extensions are available to allow one to take advantage of the scalability of Hadoop while programming in more familiar paradigms. In fact, some enable you to write queries in a SQL-like language, and your query is automatically compiled into MapReduce code for execution. We’ll cover some of these tools in chapters 10 and 11.

OFFLINE BATCH PROCESSING INSTEAD OF ONLINE TRANSACTIONS

Hadoop is designed for offline processing and analysis of large-scale data. It doesn’t work for random reading and writing of a few records, which is the type of load for online transaction processing. In fact, as of this writing (and in the foreseeable future), Hadoop is best used as a write-once, read-many-times type of data store. In this aspect it’s similar to data warehouses in the SQL world.


You have seen how Hadoop relates to distributed systems and SQL databases at a High level. Let’s learn how to program in it. For that, we need to understand Hadoop’s MapReduce paradigm.

Thursday, February 25, 2010

About SAS

SAS is the leader in business analytics software and services, and the largest independent vendor in the business intelligence market. Through innovative solutions, SAS helps customers improve performance and deliver value by making better decisions faster. SAS gives you THE POWER TO KNOW®.

What is SAS system?

The SAS System known well as Statistical Analysis System is one of the most widely used, flexible data processing, reporting and analyses tools.

SAS is a set of solutions for enterprise-wide business users as well as a powerful fourth-generation programming language for performing tasks or analyses in a variety of realms such as these: -


Analytic Intelligence
General | Data Mining and Statistical Analysis | Forecasting & Econometrics | Operations Research | Quality Improvement

Business Intelligence
General | Applications Development | Content Delivery | Query and Reporting

Data Warehousing
General | ETL & Data Quality | Warehouse Management

The core of the SAS System is Base SAS software, which consists of the following:

SAS language

a programming language that you use to manage your data.

SAS procedures

software tools for data analysis and reporting.

macro facility

a tool for extending and customizing SAS software programs and for reducing text in your programs.

DATA step debugger

a programming tool that helps you find logic problems in DATA step programs.

Output Delivery System (ODS)

a system that delivers output in a variety of easy-to-access formats, such as SAS data sets, listing files, or Hypertext Markup Language (HTML).

SAS windowing environment

an interactive, graphical user interface that enables you to easily run and test your SAS programs.

(Source: SAS.Com)

With Base SAS software as the foundation, you can integrate with SAS many SAS business solutions that enable you to perform large scale business functions, such as data warehousing and data mining, human resources management and decision support, financial management and decision support, and others.

SAS Language Elements

The SAS language consists of statements, expressions, options, formats, and functions similar to those of many other programming languages. In SAS, you use these elements within one of two groups of SAS statements:

  • DATA steps
  • PROC steps.

A DATA step consists of a group of statements in the SAS language that can

  • read data from external files
  • write data to external files
  • read SAS data sets and data views
  • create SAS data sets and data views.

Once your data is accessible as a SAS data set, you can analyze the data and write reports by using a set of tools known as SAS procedures.

A group of procedure statements is called a PROC step. SAS procedures analyze data in SAS data sets to produce statistics, tables, reports, charts, and plots, to create SQL queries, and to perform other analyses and operations on your data. They also provide ways to manage and print SAS files.

SAS Macro Facility

Base SAS software includes the SAS Macro Facility, a powerful programming tool for extending and customizing your SAS programs, and for reducing the amount of code that you must enter to do common tasks. Macros are SAS files that contain compiled macro program statements and stored text. You can use macros to automatically generate SAS statements and commands, write messages to the SAS log, accept input, or create and change the values of macro variables.