Beruflich Dokumente
Kultur Dokumente
http://nosql-database.org/
Next generation databases are: Non-relational, Distributed, Open-source, Horizontal scalable Often more characteristics: Schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge data amount
Shortcomings of RDBMS
Transactions under heavy load Complexities of vertical scaling 2 phase commit (2PC) protocol
Sharding
If you cant split it, you cant scale it (Randy Shoup, distinguished architect, eBay) Sharging approach
Feature-based shard or functional segmentation Key-based sharding Lookup table
The real question is not Whats wrong with relational databases? but rather, What problem do you have?
Availability
Consistency
Partition Tolerance
Consistency
Neo4j, Google Big Table and its derivatives: MongoDB, Redis, Hypertable
Partition Tolerance
in 50 words or less
Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazons Dynamo and its data model on Googles Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web.
Cassandra outlines
BASE (Basically Available Soft-state Eventual consistency) and not ACID (Atomicity, Consistency, Isolation, Durability) Distributed and decentralized Elastic scalability High availability and fault tolerance Tunable consistency
Writes
Memtable No reads No seeks Fast Sequential disk access Atomic within a column family Any node Always writable (hinted hand-off) 0.2 ms
Commit log
Threshold
Write
SSTable
SSTable
Reads
Memtable Read Bloomfilter field to determine whether a provided key is in the SSTable Index field for quick read Any node Read repair 15 ms
Bf
Idx
Bf
Idx
SSTable
SSTable
Column Family\Column
Column A name value pair (contains also a time-stamp for conflict resolution on the server side)
column name : byte[] column value : byte[]
+ timestamp : long
Column Family A container for columns sorted by their names. Column Families are referenced and sorted by row keys.
column name 1 column value 1 column name n column value n
row key
column value 1
column value n
column value nm
row key
column value nm
Integrating
Hadoop (http://hadoop.apache.org) is a set of open source projects that deal with large amounts of data in a distributed way. Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to application data. Hadoop MapReduce: a software framework for distributed processing of large data sets on compute clusters. Other Hadoop-related projects at Apache include: Cassandra: a scalable multi-master database with no single points of failure. Hive: a data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout: a Scalable machine learning and data mining library. Pig: a high-level data-flow language and execution framework for parallel computation.
The end
Questions?