Sie sind auf Seite 1von 12


Tushar Jain

Page 1 of 12,

NO SQL = Not Only SQL

Table of Content
1. 2. 3. 4. 5. 6. 7. 8. 9. Structured vs. Unstructured Data ............................................................... 2 Data Management in the Web 2.0 Era ......................................................... 2 Common features of NoSQL DBMS .............................................................. 3 Classification ............................................................................................ 3 Design Philosophy..................................................................................... 6 Scalability Challenge ................................................................................. 7 Key considerations in NOSQL adoption ........................................................ 7 Some Application Design Considerations .................................................... 11 Reference .............................................................................................. 11

Page 1 of 12
Tushar Jain

Page 2 of 12,

1. Structured vs. Unstructured Data

The need for storing and retrieving data in some form has been there since the invention of Computers and Storage Devices. There have been different storage media starting with the punch cards to todays ubiquitous storage hardware such as Hard Drives, NAS and SAN storage solutions and of late Virtualized Storage Options but all these relate only to the physical aspect of storage. The manner in which data gets to these devices has historically been categorized into two types: Structured and Unstructured. Structured Data Storage and retrieval was traditionally done through file access mechanisms like Sequential and Random Access Files. Then came, Databases hierarchical and relational which added a whole new chapter to this story. Some are also experimenting with object and XML storage solutions. Of late, with the advent of Web 2.0, data storage and management has taken on a whole new dimension and the needs for dealing with data have necessitated innovative and radical departures from traditional mechanisms.

2. Data Management in the Web 2.0 Era

With the growing penetration of electronic networks, the management and architecture of structured and unstructured data is becoming challenging and is affecting the way data is stored and processed. In many of todays Web 2.0 businesses such as Google, FaceBook, Twitter and others, its not unheard of to process terabytes and even petabytes of data. The architectural challenge of humungous masses of data along with real-time or near real-time access to it has led to a broad reaching movement to find alternatives to the RDBMS prevalent in enterprise business applications. In any typical software application, data can be classified into two groups on the basis of aging analysis. The first class of data is created in real time and accessed frequently at a given point of time. The second class of data is a collection of once real-time data along with master and configuration data. Apart from this, in todays software systems, much data is hierarchical, key/value pair and graph-structured, which make them difficult to store, retrieve, update and process in traditional Relational Databases (a.k.a. RDBMS). These challenges often result in reduced performance, increased hardware, manpower and license costs and scalability pains. To tackle such challenges, one of the emerging solutions is NOSQLwhich is interpreted by the software community as Not Only SQL. NOSQL implementations focus on dynamic scalability, high availability, real-time access and storage virtualization while pushing features like consistency and transaction management down a few rungs on the priority ladder. As such, the NOSQL approach is applicable where the requirement on Data Stores is loose on its ACID guaranteeavoiding join operationsand where horizontal scaling exists.

Page 2 of 12
Tushar Jain

Page 3 of 12,

NOSQL systems often provide weak consistency guarantees such as eventual consistency and transactions restricted to single data items. However, one can attain full ACID guarantees by adding a middleware layer to their NOSQL systems. Not providing relational capabilities makes it a lot easier to scale data storage by not having to pay the costs associated with relational guarantees. Several NOSQL DBMS employ a distributed architecture, with the data being held in a redundant manner on several servers, often using a distributed hash table. In this way, the system can be scaled up easily by adding more servers, and failure of a server can be tolerated though CAP theorem is not violated.

3. Common features of NoSQL DBMS

Easy to use in conventional load-balanced clusters Persistent data (not just caches) Scale to available memory Have no fixed schemas and allow schema migration without downtime Have individual query systems rather than using a standard query language Are ACID within a node of the cluster and eventually consistent across the cluster

Not every implementation in NOSQL realm has every one of these properties but the majority of the DBMS support most of these.

4. Classification
On the basis of storage medium, NOSQL systems can be classified as: 1. Data remains in memory: These types of DBMS are based on the premise that disk is not less risky than memory, which if run over distributed, redundant machines provide a higher level of reliability and performance throughput. A few implementations of such systems are Memcached, GigaSpaces, XAP, Scalaris, Redis, etc. 2. Data is stored in disk: These types of DBMS are typically based on the premises of both key/value pairs and distributed storage. Prominent implementations in this class are CouchDB, MongoDB, Riak, Voldemort, etc. 3. Configurable: These types of DBMS try to use best of both memory and disk, allowing configuration of how large the Memtable can get, so that provides a lot of control. Examples of this genre are Cassandra, BigTable, Hypertable, Hbase, etc. On the basis of data model complexity, NOSQL systems can be classified as: 1. Key-Value DBMS: It is like a RDBMS in which there only can be a single, three-column entity-attribute-value table, and in which one can't do self-joins. (In that analogy, the key part of the key-value pair may be thought of as an entity-attribute composite.)

Page 3 of 12
Tushar Jain

Page 4 of 12,

Thus, any concept of "object" has to live in the application logic. Key-value stores have performance advantages over the more efficient implementations of other models. Examples: Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB Strengths: Fast lookups Weaknesses: Stored data has no schema Example application: Forum application where home profile page gives the user's statistics (messages posted, etc) and the last ten messages by user. The page reads from a key that is based on the user's id and retrieves a string of JSON that represents all the relevant information. A background process fetch the information every 15 minutes and writes to the DBMS independently. 2. Quasi Tabular DBMS: In this type of store one can store data in row but without worrying about number of columns in a row. So essentially each row can have its own number of columns. This means its schema controlled by the application program rather than a DBA. The original of this breed is Google BigTable. Other examples are Cassandra, and HBase. Famous deployments of this type of store are Cassandra or HBase, with Facebook, Twitter, Digg, StumbleUpon. Examples: Cassandra, HBase Strengths: Fast lookups, good distributed storage of data Weaknesses: Very low-level API Example application: A news site where any piece of content: articles, comments, author profiles, can be voted on and an optional comment supplied on the vote. One store per user and one store per piece of content, using a UUID as the key (generating one for each piece of content and user) is created. The user's store holds every vote they have ever made while the content "bucket" contains a copy of every vote that has been made on the piece of content. Overnight batch job to identify content that users have voted on, generates a list of content for each user that has high votes but which they have not voted on. Push this list of recommended articles into the user's "bucket". 3. Document/Object DBMS: This type of stores keep documents/objects (XML or JSON) as collection of name-value pairs. CouchDB and MongoDB are famous example of this breed. CouchDB and MongoDB have indexing, querying, and/or updating individual "fields" within the document schema. Document DBMS Example: CouchDB and MongoDB Strengths: Tolerant of incomplete data Weaknesses: Query performance, no standard query syntax Example application: Application that creates profiles of refugees with the aim of reuniting them with their families. The details for each person vary tremendously with circumstances of the event and they are built up piecemeal, for example a young child may know her first name and one can photograph her but she may not know her parent's first names. Later a local may claim to recognize the child and provides

Page 4 of 12
Tushar Jain

Page 5 of 12,

additional information that must be recorded but until verification one has to treat it skeptically. Object DBMS Examples: Coherence, db4o, ObjectStore, GemStone, Polar Strengths: Matches OO development paradigm, low-latency ACID, mature technology Weaknesses: Limited querying or batch-update options Example application: A global trading company has a monoculture of development and wants to have trades done on desks in Japan and New York pass through a risk checking process in London. An object representing the trade is pushed into the object store and the risk checker is listening to for appearance or modification of trade objects. When the object is replicated into the local European space the risk checker reads the Trade and assesses the risk. It then rewrites the object to indicate that the trade is approved and generates an actual trade fulfilment request. The trader's client is listening for changes to objects that contain the trader's id and updates the local detail of the trade in the client indicating to the trader that the trader has been approved. The trading system will consume the trade fulfilment and when the trade elapses or is fulfilled feeds back the information to the risk assessor.

4. Graph DBMS: DBMS based on graphical data models are also suggested to be part of NoSQL. But these are file systems that underlie many MapReduce implementations. But as a general rule, those data models are most effective for analytic use cases somewhat apart from the NoSQL mainstream. AllegroGraph is one example in this genre. Typical applications: Social networking, Recommendations Strengths: Graph algorithms e.g. shortest path, connectedness, n degree relationships, etc. Weaknesses: Has to traverse the entire graph to achieve a definitive answer. Not easy to cluster. Example application: Social networking is best suited to a graph database. These same principles can be extended to any application where one need to understand what people are doing, buying or enjoying so that you can recommend further things for them to do, buy or like. Any time you need to answer the question along the lines of "What restaurants do the sisters of boys who are over 30, enjoy skating and have visited Europe dislike?" a graph database will usually help On the basis of distribution of data, NOSQL DBMS can be classified as Distributed and NotDistributed. Distributed DBMS assume the responsibility of data partitioning (for scalability) and replication (for availability) and do not leave that to the client. Examples of distributed DBMS are Amazon S3, Scalaris, Voldemort, CouchDB, Riak, MongoDB, BigTable, Cassandra, HyperTable, HBase, etc. Examples of Not Distributed are Tokyo Tyrant, Amazon SimpleDB, Redis, Memcache, etc. Not all non RDBMS are NOSQL, such as:

Streambase, Skyler: real-time stream processing

Page 5 of 12
Tushar Jain

Page 6 of 12,

MarkLogic: semi-structured data Vertica, Greenplum: mid-range data warehousing Aster: large-scale (aka big data) analytic data warehousing VoltDB: high volume transaction processing MATLAB: scientific data management Do not forget LDAP and other directory systems.

5. Design Philosophy
The design premises for NoSQL DBMS are:

Transaction semantics are unimportant, and locking is annoying. Joins are also unimportant, especially joins of complexity. There are some benefits to having a DBMS even so.

NoSQL DBMS further incorporate one or more of following assumptions:

The database will be big enough that it should be scaled across multiple servers. The application should run well if the database is replicated across multiple geographically distributed data centers, even if the connection between them is temporarily lost. The database should run well if the database is replicated across a host server and a bunch of occasionally-connected mobile devices.

In addition, NoSQL advocates commonly favor the idea that a database should have no fixed schema, other than whatever emerges as a byproduct of the application-writing process. Much of the innovation in the NoSQL arena revolves around "consistency," but that word does not mean the same thing as it does in ACID (Atomicity, Consistency, Isolation, Durability). If anything, consistency is closer to "durability," in that it refers to the desirable property of getting a correct answer back from the DBMS even in a condition of (partial) failure. In essence, there are three reasonable approaches to consistency in a replicated data scenario: 1. Traditional/near-perfect consistency: Processing stops until the system is assured that an update has propagated to all replicas. (This is typically enforced via a two-phase commit protocol.) The downside to this model, of course, is that a single node failure can bring at least part of the system to a halt. 2. Eventual consistency: Inaccurate reads are permissible just so long as the data is synchronized "eventually." With eventual consistency, the network is rarely a bottleneck at all but data accuracy may be less than ideal. 3. Read-your-writes (RYW) consistency: Data from any single write is guaranteed to be read accurately, even in the face of a small number of network outages or node failures. However, a sequence of errors can conceivably produce inaccurate reads in ways that perfect consistency would forbid.

Page 6 of 12
Tushar Jain

Page 7 of 12,

Some implementations allow tuning (such as configuration) as to which consistency model is supported; others are more locked in to a particular choice.

6. Scalability Challenge

The relative positions in the picture above are obviously debatable but it serves the purpose: the key value DBMS and BigTable clones handle size really well. This is because they have data models that can easily be partitioned horizontally. This is great for scale out. The drawback however is that by constraining themselves to simpler data models, key value DBMS have pushed complexity up the stack. So if one has data with a non-trivial structure, then one have to compensate for a simple data model by adding more complex functionality in the upper layers. Document and graph databases, on the other hand, have opted for richer data models. This means that they have more powerful abstractions that make it easy to model both simple and complex domains. But these richer data models introduce more coupling of data and therefore it's more challenging to get them to scale to size.

7. Key considerations in NOSQL adoption

If one of fundamental driver to switch to NOSQL is that one has challenges in his business that are difficult to solve using RDBMS. If one has an excellent relational model running on a mature RDBMS that provides all the features he needs then there is probably NO need to change DBMS.

Page 7 of 12
Tushar Jain

Page 8 of 12,

Immaturity: The very term "NoSQL" has only been around since 2009. Most NoSQL "products" are open source projects backed by tiny companies. Exceptions exist. For most RDBMS maturity is insurance. RDBMS are stable and have rich functionality. In comparison, most NOSQL alternatives are in infancy. Living on the technological leading edge is an exciting prospect for many developers, but enterprises should approach it with extreme caution. Open source and Support: Many NoSQL adopters are constrained, by money or ideology, to avoid closed-source products. Conversely, it is difficult to deal with NoSQL products' immaturity unless you're comfortable with the rough-and-tumble of open source software development. Getting support across the geographies will be challenging. Analytics and business intelligence: NoSQL databases have evolved to meet the scaling demands of Web 2.0 applications. So, most of their features are oriented toward the demands of these applications. However, data in an application has value to the business that goes beyond the insert-read-update-delete cycle of a typical Web application. Businesses mine information in corporate databases to improve their efficiency and competitiveness, and business intelligence (BI) is a key IT issue for all medium to large companies. NoSQL databases offer few facilities for ad-hoc query and analysis. Even a simple query requires significant programming expertise, and commonly used BI tools do not provide connectivity to NoSQL. Some products like HIVE or PIG provide easier access to data in Hadoop clusters. Quest Software also has - Toad for Cloud Databases that can provide ad-hoc query capabilities to a variety of NoSQL databases. Expertise Availability: There are literally millions of developers and DBAs throughout the world, and in every business segment, who are familiar with RDBMS concepts and programming. In contrast, almost every NoSQL developer is in a learning mode. This situation will address naturally over time, but for now, its far easier to find experienced RDBMS programmers or administrators than a NoSQL expert Project size: For a large (and suitable) project, the advantages of NoSQL technology may be large enough to outweigh its disadvantages. For a small, ultimately disposable project, the disadvantages of NoSQL may be minor. In between those extremes, one may be better off with SQL. Choice: The choice of NOSQL DBMS goes far beyond the Big Daddies - Oracle, IBM DB2, Microsoft SQL Server, and SAP/Sybase Adaptive Serve, MySQL, PostgreSQL, and other mid-range SQL DBMS -- open source or otherwise. If your needs are more analytic, there's a whole range of powerful and cost-effective specialized products, Aster Data, EMC/Greenplum, Teradata, and others. Elastic scaling: For years, database administrators have relied on scale up buying bigger servers as database load increases rather than scale out distributing the database across

Page 8 of 12
Tushar Jain

Page 9 of 12,

multiple hosts as load increases. However, as transaction rates and availability requirements increase, and as databases move into the cloud or onto virtualized environments, the economic advantages of scaling out on commodity hardware become irresistible. RDBMS might not scale out easily on commodity clusters, but the new breed of NoSQL databases are designed to expand transparently to take advantage of new nodes, and theyre usually designed with low-cost commodity hardware in mind. Humungous data volume: As transaction rates have grown out of RDBMS capability over the last decade, the volumes of data that are being stored also have increased massively. Today, the volumes of data that can be handled by NoSQL systems, such as Hadoop, outstrip what can be handled by the biggest RDBMS. Administration: Despite the many manageability improvements claimed ( as well as achieved) by RDBMS implementations over the years, high-end RDBMS systems can be maintained only with the assistance of expensive, highly trained DBAs. DBAs are intimately involved in the design, installation, and ongoing tuning of high-end RDBMS systems. NoSQL databases are generally designed from the ground up to require less management: automatic repair, data distribution, and simpler data models lead to lower administration and tuning requirements in theory. In practice, its likely that rumors of the DBAs death have been slightly exaggerated. Someone will always be accountable for the performance and availability of any mission-critical DBMS and even for its installation. Economics: NoSQL databases typically use clusters of cheap commodity servers to manage the exploding data and transaction volumes, while RDBMS tends to rely on expensive servers and storage systems. The result is that the cost per gigabyte or transaction/second for NoSQL can be many times less than the cost for RDBMS, allowing one to store and process more data at a much lower price point. Schema mutability: If you'd like to have different schemas for different parts of the
same "table," NoSQL may be for you. If you like the database reusability guarantees of the relational model, NoSQL is not for you.

Flexible data models: Change management is a big headache for large production RDBMS. Even minor changes to the data model of an RDBMS have to be carefully managed and may necessitate downtime or reduced service levels. NoSQL databases have far more relaxed or even nonexistent data model restrictions. NoSQL Key Value stores and document databases allow the application to store virtually any structure it wants in a data element. Even the more rigidly defined BigTable-based NoSQL databases (Cassandra, HBase) typically allow new columns to be created without too much fuss. The result is that application changes and database schema changes do not have to be managed as one complicated change unit. In theory, this will allow applications to iterate faster, though, clearly, there can be undesirable side effects if the application fails to manage data integrity. Data Life Cycle: All data has a meaningful life cycle. Very little data really is meaningful forever. Data you use to run your business needs to stick around for at least 7 years (compliance

Page 9 of 12
Tushar Jain

Page 10 of 12,

requirement), but beyond that has diminishing value. Shopping carts may only be meaningful for a few days or weeks. How long the data is meaningful is often one of the drivers in the NOSQL. Data Availability: Obviously if you stored the data, you'd like to get back to it. Again, it's important to understand the impact to your application should the data be temporarily unavailable. Data that is always available is costly to achieve and comes with other interesting challenges (i.e. CAP theorem applies). Variety Mix: What volume of transactions will the data need to support? And what is the mix of read to write? The volume of data plays a role here as well, but in most scale problems it's the transaction rate more than the data volume that presents challenges in scaling. Non Portability: Most of the NOSQL products are inter operable ( no standard like SQL 92/95). Application developed on Hadoop can not be moved to CouchDB without significant code change. Ad hoc data fixing: With the non-distributed NoSQL stores, which do posess a query and manipulation language, ad hoc fixing is easier, while it is harder with distributed ones (Voldemort, Cassandra, etc). Ad hoc data fixing, querying and reporting: With the non-distributed NoSQL stores, which do possess a query and manipulation language, ad hoc fixing is easier, while it is harder with distributed ones (Voldemort, Cassandra, etc). The better the query capabilities (CouchDB, MongoDB) the easier ad hoc reporting becomes. For some of those reporting woes Hadoop is a solution. But remember if application is considering NOSQL then its data store has already crossed the limits of ad hoc query. Query Language maturity: Unlike RDBMS, there is no common query language for NOSQL. Though SparQL, a standard for querying RDF or tuple-data is gaining acceptance but still not mature enough like SQL. Here are some use cases where it is sub-optimal to use RDBMS:

Relational database will not scale to traffic at an acceptable cost Data is supplied in small updates spread over time so the number of tables required to maintain a normal form has grown disproportionally to the data being held. Business generates a lot of temporary data that does not really belong in the main data store. Common examples include shopping carts, retained searches, site personalization and incomplete user questionnaires. RDBMS has already been de-normalized for reasons of performance or for convenience in manipulating the data in application. Dataset consists of large quantities of text or images and the column definition is simply a Large Object (CLOB or BLOB). Queries against data do not involve simple hierarchical relations; common examples are recommendations or business intelligence questions that involve an absence of data. For the latter consider "all women in Paris who do have a dog and whose ex sister-inlaws have not yet purchased a paperback this year" as a contrived example, "all people

Page 10 of 12
Tushar Jain

Page 11 of 12,

in a social network who have not purchased a book this year who are once removed from people who have" is a real one if one wants to target advertising on a site that says "Fred bought X". Local data transactions that do not have to be very durable. For example "liking" items on websites: creating transactions for these kind of interactions are overkill because if the action fails the user is likely to just repeat it until it works. AJAX-heavy websites tend to have a lot of these use-cases.

8. Some Application Design Considerations

1. RDBS will not have any validation logic (data type) nor any correlating data nor integrity constraint nor any business or technical logic ( triggers, procedures or functions) 2. Relaxed consistency is the norm in NOSQL. Deal this in data management layer of application code. 3. NOSQL does not support transaction, deal with them in application code 4. Database management needs to move to two layer architecture, separating the concerns of data modeling and data storage. 5. With this two layered approach, the data storage server should be coupled to a particular data model manager that ensures consistency and integrity. All access must go through this data model manager to protect the invariants enforced by the managerial layer. 6. With a coupled management layer, storage servers are most efficiently accessed through a programmatic API, preferably keeping the storage system in-process to minimize communication overhead. NOSQL is not No To SQL. NOSQL means Not Only SQL, as in: in the future, persistence layer will consist of Not Only SQL databases but also key-value stores, graph databases and more.

9. Reference
1. 2. 3. 4. 5. 6. 7. sql-sql-is-that-the-question.html 8. 9. 10. 11. 12. 13.

Page 11 of 12
Tushar Jain

Page 12 of 12,

howArticle.jhtml;jsessionid=44FGF1FUY4C3VQE1GHPCKH4ATMY32JVN ?articleID=227701021 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32.

Page 12 of 12