Opening A Fabulous World of Cassandra

: whats all the buzz about?
http://nosql-database.org/
Next generation databases are: Non-relational, Distributed, Open-source, Horizontal scalable Often more characteristics: Schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge data amount
List of NoSQL databases [122+]

Wide Column Store / Column Families HBase, Cassandra, Hypertable, Cloudata, Cloudera, Amazon SimpleDB Document Stores CouchDB, MongoDB, Terrastore, ThruDB, OrientDB, RavenDB, Citrusleaf, SisoDB Key Value / Tuple Store Azure Table Storage, MEMBASE, Riak, Redis, Chordless, GenieDB, Scalaris, Tokyo Cabinet / Tyrant, Keyspace Berkeley DB, MemcacheDB, Faircom C-Tree, Mnesia, LightCloud, Hibari, HamsterDB, STSdb, Pincaster, RaptorDB Eventually Consistent Key Value Stores Amazon Dynamo, Voldemort, Dynomite, KAI Graph Databases Neo4J, Infinite Graph, Sones, InfoGrid, HyperGraphDB, Trinity, AllegroGraph, Bigdata, DEX, OpenLink, Virtuoso, VertexDB, FlockDB Object Databases db4o, Versant, Objectivity, Gemstone, Progress, Starcounter, Perst, Caching, ZODB, NEO, PicoLisp, Sterling More and more databases
So whats wrong with relational databases?
Main principals of RDBMS

SQL ACID
Atomic all or nothing Consistent means that data moves from one correct state to another correct state, with no possibility that readers could view different values that dont make sense together. Isolated means that transactions executing concurrently will not become entangled with each other. Durable once a transaction has succeeded, the changes will not be lost.
Shortcomings of RDBMS
Transactions under heavy load Complexities of vertical scaling 2 phase commit (2PC) protocol
Sharding
If you cant split it, you cant scale it (Randy Shoup, distinguished architect, eBay) Sharging approach
Feature-based shard or functional segmentation Key-based sharding Lookup table
Shared-nothing or Cassandra like sharding
The real question is not Whats wrong with relational databases? but rather, What problem do you have?
Brewers CAP Theorem
Availability
Consistency
Partition Tolerance
Brewers CAP Theorem

Availability
Relational: MySQL, Oracle, MSSQL
Amazon Dynamo derivatives: Cassandra, Voldemort, Riak, CouchDB
Consistency
Neo4j, Google Big Table and its derivatives: MongoDB, Redis, Hypertable
Partition Tolerance
in 50 words or less
Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tuneably consistent, column-oriented database that bases its distribution design on Amazons Dynamo and its data model on Googles Bigtable. Created at Facebook, it is now used at some of the most popular sites on the Web.
Cassandra case studies
Cassandra outlines
BASE (Basically Available Soft-state Eventual consistency) and not ACID (Atomicity, Consistency, Isolation, Durability) Distributed and decentralized Elastic scalability High availability and fault tolerance Tunable consistency
Use cases for Cassandra

Large deployments Lots of writes, statistics and analysis Geographical distribution Evolving applications
Writes
Memtable No reads No seeks Fast Sequential disk access Atomic within a column family Any node Always writable (hinted hand-off) 0.2 ms
Commit log
Threshold
Write
SSTable
SSTable
Reads
Memtable Read Bloomfilter field to determine whether a provided key is in the SSTable Index field for quick read Any node Read repair 15 ms
Bf
Idx
Bf
Idx
SSTable
SSTable
The tenets of column-oriented model

Keyspace Column Family Outer container, that contains column families (is sort of like a relational database) Logical division that associates similar data (very roughly analogous to tables in the relational world) Name/value pair (and a client-supplied timestamp of when it was last updated) Container for super columns sorted by their names Structure with name and set of dependent columns
Column Super Column Family Super Column
Column Family\Column
Column A name value pair (contains also a time-stamp for conflict resolution on the server side)
column name : byte[] column value : byte[]
+ timestamp : long
Column Family A container for columns sorted by their names. Column Families are referenced and sorted by row keys.
column name 1 column value 1 column name n column value n
row key
Super Column Family\Super Column

Super Column A sorted associative array of columns.
super column name column name 1 column name n
column value 1
column value n
Super Column Family

A container for super columns sorted by their names. Like Column Families, Super Column Families are referenced and sorted by row keys. super column name 1 row key column name 1 column value 1 column name n1 column value n1 super column name m column name 1 column value 1 column name nm
column value nm
Addressing Column Family

column name 1
column value 1 column name n column value n
row key
Four-dimensional hash [Keyspace][ColumnFamily][Key][Column] Addressing Super Column Family

super column name 1 row key column name 1 column value 1 column name n1 column value n1 super column name m column name 1 column value 1 column name nm
column value nm
Five-dimensional hash [Keyspace][ColumnFamily][Key][SuperColumn][SubColumn]
Cassandra client options

Thrift (12 different languages) Avro (data serialization system) Java: Hector: http://github.com/rantav/hector (abstraction over thrift) Pelops: http://github.com/s7/scale7-pelops (abstraction over thrift) CQL: JDBC driver for Cassandra version starting from 0.8 (SQL like language) Hector JPA: https://github.com/riptano/hector-jpa (ORM client) Cassandrelle: http://demoiselle.sf.net/component/demoiselle-cassandra/ (documentation ???) Kundera: http://code.google.com/p/kundera/ (buggy ???) Python: Pycassa, Telephus Grails: grails-cassandra .NET: Aquiles, FluentCassandra Ruby: Cassandra PHP: phpcassa, SimpleCassie
Cassandra\RDBMS query differences

No update query Record-level atomicity on writes No duplicate keys Basic write properties: consistency level (ZERO, ANY, ONE, QUORUM, ALL) Basic read properties: consistency level (ONE, QUORUM, ALL)
Integrating
Hadoop (http://hadoop.apache.org) is a set of open source projects that deal with large amounts of data in a distributed way. Hadoop Distributed File System (HDFS): a distributed file system that provides high-throughput access to application data. Hadoop MapReduce: a software framework for distributed processing of large data sets on compute clusters. Other Hadoop-related projects at Apache include: Cassandra: a scalable multi-master database with no single points of failure. Hive: a data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout: a Scalable machine learning and data mining library. Pig: a high-level data-flow language and execution framework for parallel computation.
The end
Questions?

Opening A Fabulous World of Cassandra

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Opening A Fabulous World of Cassandra

Hochgeladen von

Copyright:

Verfügbare Formate

: whats all the buzz about?

List of NoSQL databases [122+]

So whats wrong with relational databases?

Main principals of RDBMS

Shared-nothing or Cassandra like sharding

Brewers CAP Theorem

Brewers CAP Theorem

Relational: MySQL, Oracle, MSSQL

Amazon Dynamo derivatives: Cassandra, Voldemort, Riak, CouchDB

Cassandra case studies

Use cases for Cassandra

The tenets of column-oriented model

Column Super Column Family Super Column

Super Column Family\Super Column

Super Column Family

Addressing Column Family

Four-dimensional hash [Keyspace][ColumnFamily][Key][Column] Addressing Super Column Family

Five-dimensional hash [Keyspace][ColumnFamily][Key][SuperColumn][SubColumn]

Cassandra client options

Cassandra\RDBMS query differences

Das könnte Ihnen auch gefallen