Sie sind auf Seite 1von 10

CrateDB: The SQL DB for Machine Data

Designing a Real-time SQL DBMS for the Things Data Era

August 2017
IoT is a New Database Workload
With the rise of IoT, we are entering the era of things data. In it, IoT applications process data
generated by millions of sensors. Data is analyzed in real time to monitor and control the
connected vehicles we drive, the machinery we operate, and smart-cities we inhabit.

Gartner Research suggests that IoT will pose new data volume, query complexity, and
integration challenges. And the TPC, the independent standards-setter for DBMS benchmarks,
is defining a new mixed workload benchmark for IoT.

At Crate.io, we engineered CrateDB to process IoT data. By building a distributed SQL engine
on a NoSQL storage and clustering foundation, weve made it easy and economical for
mainstream developers to meet data requirements like these:

Ingest millions of data points per second - sensor or GPS readings, network messages,
logs...
Query data in real-time
Handle a wide variety of data structures
Execute complex queries such as time series, geospatial, text search, and machine
learning
Process data at the edge and in the cloud

CrateDB is a unique combination of SQL, NoSQL, and Container technology

2017 Crate.io, Inc. CrateDB for IoT - Technical Overview 1


CrateDB - Machine Data Customer Use Cases
Over 75% of CrateDB customers use it to manage machine-generated data in systems such as:
Industrial IoT
Connected cities and buildings
Vehicle fleet tracking & management
Network & IT security monitoring

Here are some examples of typical CrateDB customer projects...

Alpla, a $4B global manufacturer of packaging products, uses CrateDB to process


data from thousands of different sensors on each of its 1000+ production lines. The
data provides real-time insights that enable operators to optimize machinery
efficiency and reduce defects, downtime, and raw material waste.

Zumtobel, a $2B global producer of intelligent lighting systems uses CrateDB to


monitor and control smart-lighting in large retail chains and buildings. They migrated
from MySQL to CrateDB to enable better scaling and performance in a system that
provides real-time monitoring of system status, lighting & sensor outages, and energy
consumption.

Clickdrive.io collects GPS and system data from fleet vehicles in order to
provide real-time location tracking and to inform dispatchers that repairs are
needed. As a result, Clickdrive has helped its customers reduce breakdowns
and accidents, and lower fleet maintenance costs by 20%

Skyhigh Networks analyzes cloud network traffic in real time for nearly half of
the Fortune to help keep them safe from cyber-security threats. They manage
over 80TB of data, and replacing MySQL & Elasticsearch with CrateDB
reduced their database hosting costs by 75%.

2017 Crate.io, Inc. CrateDB for IoT - Technical Overview 2


Designing a DBMS for the IoT Era - the CrateDB Architecture
CrateDB was started in 2014 to make database development and scaling simple. It was one of
the first databases to combine the familiarity of SQL with the scalability and data flexibility of
NoSQL. These are the design choices we made to build a database for the IoT era:

Architecture: Distributed, shared-nothing, container-native


CrateDB operates in a shared-nothing architecture as
a cluster of identically configured servers (nodes) The
nodes coordinate seamlessly with each other, and the
execution of write and query operations are
automatically distributed across the nodes in the
cluster.

Increasing or decreasing database capacity is a


simple matter of adding or removing nodes. We
worked hard on the simple part by automating the
sharding, replication (for fault tolerance), and
rebalancing of data as the cluster changes size.

CrateDB was born in the container era and allows you


to scale and administer it easily via container
orchestration platforms like Docker or Kubernetes in a
microservices environment.

2017 Crate.io, Inc. CrateDB for IoT - Technical Overview 3


Access: SQL via Postgres wire protocol, JDBC, ODBC, Rest...
We chose SQL as the data access language to make CrateDB easy for mainstream developers
to adopt. Everyone knows SQL; its powerful, and it makes integration easy. CrateDB is
compatible with most SQL tools, interfacing via the PostgreSQL wire protocol, JDBC, ODBC,
and a REST interface.

CrateDB is compatible with much of the ANSI SQL 92 standard. It supports joins, aggregations,
indexes, BLOBs, sub-queries, user-defined functions, and so on. We juiced our SQL up with
some nice things commonly found with NoSQL, like full-text search, geospatial queries, and
nested JSON object columns.

Open machine data stack


Another benefit of SQL is ease of integration. With CrateDB, you are free to choose your own
machine data stack rather than being locked into ETL and visualization and reporting tools
written for specific NoSQL engines like Splunk, Elasticsearch, or InfluxDB. CrateDB can be
accessed via SQL from most new and legacy ETL, BI and Reporting, programming frameworks,
and so on.

Other machine data interfaces


CrateDB supports other access interfaces that are common in IoT and machine data:
MQTT - CrateDB (Enterprise Edition) embeds an MQTT broker, which enables it to
subscribe to and receive MQTT messages, parse them, and store them in a table. This
simplifies application architectures by eliminating the need for message queueing
middleware.
Telegraf interface - CrateDB is a Telegraf target, which makes it easy to rout time
series data from various Telegraf-supported source systems into CrateDB.
Prometheus remote reader/writer - Enables the Prometheus time series database to
pass data and queries through to CrateDB for processing larger volumes of data or
performing more complex analyses. It makes it easy to scale up software systems (such
as Docker) that support Prometheus as an endpoint for time series metrics.
Grafana - a Grafana plugin makes it easy to visualize and interact with time series data
from CrateDB.
Apache Kafka, Spark, Node-Red, StreamSets, et al - CrateDB fits well into the IoT
ecosystem, with customers using tools like Kafka and Spark and many others to build
scalable, fault-tolerant systems. Contact Crate.io if you have questions about specific
integrations.

2017 Crate.io, Inc. CrateDB for IoT - Technical Overview 4


Storage & Indexing: NoSQL-style
CrateDB was one of the first databases to combine the familiarity of SQL with the scalability and
data flexibility of NoSQL. This was accomplished by building a distributed SQL engine on a
foundation of our own and other open source NoSQL technologies instead of using traditional
relational DBMS techniques.

CrateDB uses bits of the following open source projects to form its physical foundation:
Lucene - storage and indexing, including text search and geospatial
Elasticsearch- masterless clustering and transaction logging
Netty - asynchronous, event-driven, full-mesh networking between nodes

CrateDB is packaged into a single binary, which is simple to install and start.

Access to scaling and replication features is simple, via SQL. CREATE TABLE supports
additional storage and table parameters for sharding, replication and routing of the data. In the
example below, a table to hold sensor readings is partitioned by week; queries will execute on
relevant partitions only, which speeds up performance. And partitions can be dropped, which
makes data deletion or archival of aging data easier.

The example also creates shards, which contain subsets of the table data and are distributed
across the cluster. A rule of thumb is to have as many shards as there are CPUs in the cluster;
CrateDB will parallelize query execution across all of the shards for maximum throughput.
Replicas create redundant copies of the data, which are also distributed across the cluster for
high availability and query throughput.

CREATE TABLE IF NOT EXISTS t1 (


"ts" TIMESTAMP,
"tenant_id" INTEGER,
"sensor_id" STRING,
"v1" INTEGER,
"v3" FLOAT,
"v5" BOOLEAN,
"week_generated" TIMESTAMP GENERATED ALWAYS AS date_trunc('week', ts)
) with (number_of_replicas = 2)
PARTITIONED BY ("week_generated")
CLUSTERED BY ("tenant_id") INTO 3 SHARDS;

CrateDB distributes shards and replicas intelligently and automatically. This helps avoid
performance bottlenecks and ensures that the database will continue to operate reliably, even if
node hardware failures occur.

2017 Crate.io, Inc. CrateDB for IoT - Technical Overview 5


Schema: Dynamic
Another benefit of the CrateDB SQL-NoSQL architecture is schema flexibility. Traditional
relational schemas are rigid and changing them is a pain. As you saw before, tables are defined
using the CREATE TABLE statement. If an INSERT statement includes a column that wasnt
defined in the table, CrateDB can be configured to either:

a) Enforce the original schema by rejecting the INSERT and throwing an error
b) Dynamically update the schema by adding the new column found in the INSERT
statement.

Internally, each relational record in CrateDB is actually stored as a JSON document, and those
can change structure on the fly. This gives CrateDB the flexibility to handle evolving data
structures.

For example: a global packaging manufacturer collects data from 900 different types of sensors
on each of its production lines. In SQL Server, they stored that data in 900 different tables, one
per sensor type. After moving to CrateDB, they stored all the readings in just one table. Much
simpler. And queries executed 40 times faster.

Writing: High Velocity INSERTs


IoT systems ingest streams of machine-generated data. We decided on an eventually
consistent, non-blocking, data insertion model. This allows CrateDB to insert tens of thousands
of data points per second per node, while querying the data at the same time.

The CrateDB distributed architecture provides


linearly scalable INSERT performance. As the
customer benchmark here shows, CrateDB
provided superior ingestion versus other
distributed databases as the customer increased
the number of threads concurrently connecting to
CrateDB.

Data durability and consistency are also


important, and we took steps to address those
with as little impact on performance as possible.
To ensure data durability, we implemented
write-ahead logging. For consistency, CrateDB
includes record versioning, optimistic
concurrency control, and a table-level refresh
frequency setting, which forces CrateDB data to
become consistent on a periodic basis (every n milliseconds).

2017 Crate.io, Inc. CrateDB for IoT - Technical Overview 6


Querying: Real-time via in-memory columnar indexing
Real-time databases usually require all data to fit in main memory, but that limits how much data
you can manage. Our solution for real-time performance without data volume limitations was to
implement memory-resident columnar field caches on each node. The caches tell the query
engine whether there are rows on that node that meet the query criteria and where the rows are
located; this is all performed at in-memory speed.

Distributed query processing also contributes to fast performance, and a query planner that
makes smart decisions about which nodes are best-suited to finalize processing of aggregations
and joins.

Benchmarks show the CrateDB query architecture to be linearly scalable:

Provide up to 33x better price-performance than traditional SQL databases, when executing
complex time series and text search queries:

2017 Crate.io, Inc. CrateDB for IoT - Technical Overview 7


And provide 10x higher time series query throughput under load than specialized time series
databases like InfluxDB:

Platform: Java, run at the edge or in the cloud


IoT data processing is often distributed, from cloud data centers to remote sites and even onto
devices. DBMS portability makes cloud and edge architectures easier to implement, so we
wrote CrateDB in Java. Thus, CrateDB can run anywhere, on JVMs in the data center or
remotely if internet network latency overhead is intolerable or if data needs to be aggregated
before being pipelined to a central cloud instance for wider-scale processing.

2017 Crate.io, Inc. CrateDB for IoT - Technical Overview 8


How does CrateDB Compare?
The CrateDB architecture combines the familiarity of SQL with the scalability and data flexibility
of NoSQL. CrateDB is oriented towards analytic workloads--mixed, with heavy querying and
ingestion.

Experiences might differ based on use case, but heres how CrateDB generally compares to
other database categories for IoT workloads:

Traditional
NoSQL CrateDB
SQL
Fire hose of data No Yes Yes

Query versatility &


real-time No No Yes
performance

Data versatility No Yes Yes

SQL access Yes No Yes

Simple scalability No Yes Yes

Next Steps...
CrateDB is freely available under the Apache 2 open source license, or with a commercial
license for the CrateDB Enterprise Edition.

You can download CrateDB and find other resources at c rate.io.

2017 Crate.io, Inc. CrateDB for IoT - Technical Overview 9

Das könnte Ihnen auch gefallen