Beruflich Dokumente
Kultur Dokumente
August 2017
IoT is a New Database Workload
With the rise of IoT, we are entering the era of things data. In it, IoT applications process data
generated by millions of sensors. Data is analyzed in real time to monitor and control the
connected vehicles we drive, the machinery we operate, and smart-cities we inhabit.
Gartner Research suggests that IoT will pose new data volume, query complexity, and
integration challenges. And the TPC, the independent standards-setter for DBMS benchmarks,
is defining a new mixed workload benchmark for IoT.
At Crate.io, we engineered CrateDB to process IoT data. By building a distributed SQL engine
on a NoSQL storage and clustering foundation, weve made it easy and economical for
mainstream developers to meet data requirements like these:
Ingest millions of data points per second - sensor or GPS readings, network messages,
logs...
Query data in real-time
Handle a wide variety of data structures
Execute complex queries such as time series, geospatial, text search, and machine
learning
Process data at the edge and in the cloud
Clickdrive.io collects GPS and system data from fleet vehicles in order to
provide real-time location tracking and to inform dispatchers that repairs are
needed. As a result, Clickdrive has helped its customers reduce breakdowns
and accidents, and lower fleet maintenance costs by 20%
Skyhigh Networks analyzes cloud network traffic in real time for nearly half of
the Fortune to help keep them safe from cyber-security threats. They manage
over 80TB of data, and replacing MySQL & Elasticsearch with CrateDB
reduced their database hosting costs by 75%.
CrateDB is compatible with much of the ANSI SQL 92 standard. It supports joins, aggregations,
indexes, BLOBs, sub-queries, user-defined functions, and so on. We juiced our SQL up with
some nice things commonly found with NoSQL, like full-text search, geospatial queries, and
nested JSON object columns.
CrateDB uses bits of the following open source projects to form its physical foundation:
Lucene - storage and indexing, including text search and geospatial
Elasticsearch- masterless clustering and transaction logging
Netty - asynchronous, event-driven, full-mesh networking between nodes
CrateDB is packaged into a single binary, which is simple to install and start.
Access to scaling and replication features is simple, via SQL. CREATE TABLE supports
additional storage and table parameters for sharding, replication and routing of the data. In the
example below, a table to hold sensor readings is partitioned by week; queries will execute on
relevant partitions only, which speeds up performance. And partitions can be dropped, which
makes data deletion or archival of aging data easier.
The example also creates shards, which contain subsets of the table data and are distributed
across the cluster. A rule of thumb is to have as many shards as there are CPUs in the cluster;
CrateDB will parallelize query execution across all of the shards for maximum throughput.
Replicas create redundant copies of the data, which are also distributed across the cluster for
high availability and query throughput.
CrateDB distributes shards and replicas intelligently and automatically. This helps avoid
performance bottlenecks and ensures that the database will continue to operate reliably, even if
node hardware failures occur.
a) Enforce the original schema by rejecting the INSERT and throwing an error
b) Dynamically update the schema by adding the new column found in the INSERT
statement.
Internally, each relational record in CrateDB is actually stored as a JSON document, and those
can change structure on the fly. This gives CrateDB the flexibility to handle evolving data
structures.
For example: a global packaging manufacturer collects data from 900 different types of sensors
on each of its production lines. In SQL Server, they stored that data in 900 different tables, one
per sensor type. After moving to CrateDB, they stored all the readings in just one table. Much
simpler. And queries executed 40 times faster.
Distributed query processing also contributes to fast performance, and a query planner that
makes smart decisions about which nodes are best-suited to finalize processing of aggregations
and joins.
Provide up to 33x better price-performance than traditional SQL databases, when executing
complex time series and text search queries:
Experiences might differ based on use case, but heres how CrateDB generally compares to
other database categories for IoT workloads:
Traditional
NoSQL CrateDB
SQL
Fire hose of data No Yes Yes
Next Steps...
CrateDB is freely available under the Apache 2 open source license, or with a commercial
license for the CrateDB Enterprise Edition.