Sie sind auf Seite 1von 6

QUICK REFERENCE GUIDE

TO OPEN-SOURCE NOSQL DATABASES


John Schulz, Principal Consultant Open-Source Databases
Derek Downey, Open-Source Database Practice Advocate

WHY OPEN-SOURCE NOSQL?


With the cost and complexity of many of the leading RDBMS solutions, many organizations are exploring new,
more cost-effective ways to manage and process large amounts of data. This often leads them to look at
alternatives to RDBMS technologies, such as open-source NoSQL solutions.

NoSQL databases are non-relational and are designed for storing unstructured data. Because NoSQL is built
specifically to meet the requirements of big data, mobile applications, and the Internet of Things (IoT), it
provides the flexibility, scalability, and performance many businesses need to drive new, innovative services
and revenue streams.

Most NoSQL solutions are open-source, a particularly attractive option if cost and vendor independence are
important to you.

Making a move to an open-source NoSQL technology could be right for your organization, depending on your
specific needs, your applications, and the type and volume of your data. To help you decide if its the right
choiceeither a complementary technology to an existing RDBMS or as a complete replacementweve
created this quick reference guide. It outlines five of the top open-source NoSQL databases, and provides
overviews, use cases, limitations, and support options.

www.pythian.com
CASSANDRA
OVERVIEW
Apache Cassandra is an eventually consistent distributed database designed
to accept very high write rates and to operate over a geographically
distributed environment.

Data is stored in partitions and rows. Partitions are used for sharding and rows
are very similar to RDBMS tuples or rows. For example, a partition could be a
user_id and rows could be a users class grades and test scores.

Cassandra is well designed to support worldwide distributed operations where


eventual consistency and an SQL-like language are desired.

Cassandra does not support ACID (Atomicity, Consistency, Isolation, Durability)


operations.

REASONS TO USE INSTEAD OF AN RDBMS LIMITATIONS


Built-in geographic distribution. Does not natively support joins or true SQL.
Very high availability. Note: Apache Spark and Apache Ignite can run on
Scales linearly by adding nodes to the cluster. top of Cassandra and provide SQL support and
Extremely good for time series data (IOT, joins. However, expect performance degradation.
transaction logging, customer history records). Very limited support for aggregates.
Has no single point of failure; in fact, has no Cassandra data models will include a degree of
concept of failover because all nodes are equal. duplication of data.
Your application and data do not require strict ACID Physical data models must closely match usage
capabilities and are capable of accepting stale reads. patterns for Cassandra to work at all.
Your application can tolerate eventual consistency, Taking a consistent point-in-time backup from
especially at the geographically distributed level. sharded clusters is non-trivial, and we recommend
Your application typically modifies one document you have a file system that supports snapshots.
(row) at a time. Encryption options are limited without an enterprise
You want native high-availability capability. license. As a workaround, you can use a third-party
You want native sharding capability. provider such as Vormetric.
You need or want true multi-master replication.
You can write to any node and read from any node.

SUPPORT OPTIONS
Cassandra is available as open-source software via the Apache 2.0 license.
DataStax offers a licensed product called DataStax Enterprise Edition, which is based on open-source Cassandra.

www.pythian.com
COUCHBASE SERVER
OVERVIEW
Couchbase Server is a document-store NoSQL solution that provides fast reads and writes. A Couchbase
document is in JavaScript Object Notation (JSON) format and is similar to a row or tuple in traditional RDBMS
terminology. However, Couchbase allows much more complex data representations than traditional RDBMS rows.

Couchbase is a great tool for providing built-in high-availability and sharding capabilities for data sets that do
not require full transactional ACID compliance.

REASONS TO USE INSTEAD OF AN RDBMS LIMITATIONS


Couchbase N1QL is similar to SQL. Couchbases asynchronous write back strategy can
Very high read/write capability makes Couchbase result in data loss on some shards/buckets that will
ideal to act as a persistent cache in front of other eventually be recovered.
databases. Couchbase N1QL is similar to SQL but its much
It is attractive to many developers because its more limited in functionality.
relatively schemaless. Enabling durability via write concern across
Couchbase can scale linearly beyond a single server members of a bucket can drastically decrease
across a cluster of geographically distributed servers. performance.
Indexes are defined in terms of MapReduce and do Taking a consistent point-in-time backup from
not need to be based on actual column values. sharded clusters is non-trivial, and we recommend
Your application typically modifies one document you have a file system that supports snapshots.
(row) at a time. Encryption options are limited without an enterprise
You want native high-availability capability. license. As a workaround, you can use a third-party
You want native sharding capability. provider such as Vormetric.

SUPPORT OPTIONS
Couchbase is available as open-source software via the Couchbase community license; it is available only in
object code form from the Couchbase website.
Couchbase is also available via the Couchbase enterprise license.

HBASE
OVERVIEW
Apache HBase is an eventually consistent distributed database designed to accept very high write rates and to
operate on top of an Hadoop Distributed File System (HDFS) cluster. HBase is conceptually very similar to Cassandra.

Data is stored in large rows very similar to Google Cloud Bigtable. Rows are similar to RDMS tuples or rows
except there is no need to define each column before using it. For example, a row key could be a user_id and
columns in the row could be a users class grades and test scores.

HBase does not support ACID operations.

www.pythian.com
REASONS TO USE INSTEAD OF AN RDBMS LIMITATIONS
Scales linearly by adding nodes to the underlying No secondary index support.
HDFS cluster. Very limited schema features.
Requires a minimal schema. Very limited support for aggregates.
Supports very high write rates. Requires HDFS and Apache ZooKeeper to operate;
Is accessible from Apache Hive, Apache Pig, Spark if access by other Hadoop ecosystem tools is not
SQL, and MapReduce. required, Cassandra is probably a better choice.
Is a good choice for storing data accessible by Access by programming languages outside of the
key if the data needs to be accessed by other Java virtual machine (JVM) ecosystem is limited to a
Hadoop tools. poorly defined Thrift interface.

SUPPORT OPTIONS
Available as open-source software via the Apache 2.0 license.
Enterprise licenses available from Cloudera, Hortonworks, and MapR Technologies.

MONGODB
OVERVIEW
MongoDB is a cross-platform NoSQL solution that uses a document-oriented data model. A MongoDB collection
is analogous to a table in a traditional RDBMS, and a MongoDB document is analogous to a row in an SQL table.

REASONS TO USE INSTEAD OF AN RDBMS LIMITATIONS


MongoDB is touted as being schemaless. This Joins are not available until MongoDB 3.2, and
means that adding fields to and removing them from even then are a questionable use case to keep
a collection can be done trivially, without traditional reads and writes fast.
SQL ALTER TABLE and locking requirements. Even though MongoDB is schemaless, it still
Your data is denormalized and has no requires proper indexing to keep access fast.
requirements for joins. Scaling reads to multiple members of a replica set is
Your application and data do not require strict ACID not recommended because reads are considered
capabilities and are capable of accepting stale reads. eventually consistent and could be out of date.
Your application typically modifies one document Enabling durability via write concern across members
(row) at a time. of a replica set can drastically decrease performance.
You want native high-availability capability. Taking a consistent point-in-time backup from
You want native sharding capability. sharded clusters is non-trivial, and we recommend
you have a file system that supports snapshots.
Encryption options are limited without an enterprise
license or using a third-party provider such as
Percona Server for MongoDB.

SUPPORT OPTIONS
MongoDB is available as open-source software via the AGPL v3.0 license.
MongoDB, Inc. provides an enterprise license that includes monitoring, deployment, backups, and support.
Additional features such as encryption are only available with the enterprise license.
Percona Server for MongoDB is an open-source solution that provides similar enterprise-only features of
MongoDB as well as additional storage engines not found in the upstream open source version.

www.pythian.com
NEO4J
OVERVIEW
Neo Technologys Neo4j is a popular graph database. In a graph database, like a relational database, there are
entities and relationships between entities. However, instead of focusing on the entities, graph databases
focus on the relationships. There can be thousands or even millions of relationships between entities
represented by a graph database.

Neo4J is ACID compliant.

REASONS TO USE INSTEAD OF AN RDBMS LIMITATIONS


Relationships between entities are more important Access to entities is limited to data retrieval, with
than the entities themselves. As a result, the number minimal manipulation available.
of relationships dwarfs the number of entities. Scaling beyond a single node is only available in
Access to the relationships is the principal goal of the enterprise version.
the database.
Social applications frequently use graph databases
to represent complex relationships between users.

SUPPORT OPTIONS
Neo4J is available in a free community edition under GPL V3 but it is limited to running on only one node due
to a lack of clustering. There are also no hot backups.
An enterprise version of Neo4J is available from the Neo Technology.
A government edition extends the enterprise edition, adding extra government-specific services.

PYTHIAN CAN HELP


Pythian can help you determine if open-source NoSQL is the right choice for
your business. And if it is, we also offer managed services to provide 24x7
support for MongoDB and Cassandra databases.

We provide custom database solution services in a range of on premises and


public cloud environments. We offer comprehensive, proactive, expert support
for all your database management requirementsincluding database design,
capacity planning, installation, health checks, upgrades, performance tuning,
and recovery. In addition, we deliver round the clock monitoring, problem
detection and resolution for both NoSQL and traditional RDBMS platforms.

www.pythian.com
ABOUT THE AUTHORS

John Schulz is passionate about open-source databases and is a trusted


specialist in data modelling. He has evolved in his career, refining and expanding
his knowledge base on MySQL, Cassandra, Couchbase, MongoDB, and much
more. One of Johns core assets is his ability to communicate complicated
problems to all levels of technical expertise, from executives to service delivery
personnel, and most importantly, to clients. This culmination of experience and
interpersonal ability has earned him the reputation of being a technical translator.
Contact John at shulz@pythian.com

Derek Downey is the Practice Advocate for the Open-source Database practice
at Pythian, helping to align technical and business objectives for the company and
for our clients. Derek loves automating MySQL, implementing visualization
strategies, and creating repeatable training environments.
Follow Derek on Twitter @derek_downey

ABOUT PYTHIAN
Pythian is a global technology-enabled IT services company that helps businesses compete by
adopting disruptive technologies such as advanced analytics, big data, cloud, databases, DevOps
and infrastructure management to advance innovation and increase agility. Specializing in
designing, implementing, and managing systems that directly contribute to revenue growth and
business success, Pythians highly skilled technical teams work as an integrated extension of our
clients organizations to deliver solutions that enable strategic use of data, accelerate software
delivery, and ensure reliable scalable IT systems.

Pythian, The Pythian Group, love your data, pythian.com, and Adminiscope are trademarks of The Pythian Group Inc. Other
product and company names mentioned herein may be trademarks or registered trademarks of their respective owners. The
information presented is subject to change without notice. Copyright 2016. The Pythian Group Inc. All rights reserved.

www.pythian.com

Das könnte Ihnen auch gefallen