Beruflich Dokumente
Kultur Dokumente
A MySQL Whitepaper
Table of Contents
Introduction ....................................................................................................3
1.
2.
3.
4.
Conclusion ....................................................................................................25
Additional Resources ..................................................................................25
Page 2
Introduction
Today the terms Big Data and Internet of Things draw a lot of attention, but behind the hype
there's a simple story. For decades, companies have been making business decisions based on
traditional enterprise data. Beyond that critical data, however, is a potential treasure trove of
additional data: weblogs, social media, email, sensors, photographs and much more that can be
mined for useful information. Decreases in the cost of both storage and compute power have made
it feasible to collect this data - which would have been thrown away only a few years ago. As a
result, more and more organizations are looking to include non-traditional yet potentially very
valuable data with their traditional enterprise data in their business intelligence analysis.
As the worlds most popular open source database, and the leading open source database for
Web-based and Cloud-based applications, MySQL is a key component of numerous big data
platforms. This whitepaper explores how you can unlock extremely valuable insights using MySQL
with the Hadoop platform.
Machine-generated /sensor data includes Call Detail Records (CDR), weblogs, smart
meters, manufacturing sensors, equipment logs (often referred to as digital exhaust) and
trading systems data.
Social data includes customer feedback streams, micro-blogging sites like Twitter, social
media platforms like Facebook.
The McKinsey Global Institute estimates that data volume is growing 40% per year1. But while its
often the most visible parameter, volume of data is not the only characteristic that matters. We
often refer to the Vs defining big data:
Velocity. Social media data streams while not as massive as machine-generated data
produce a large influx of opinions and relationships valuable to customer relationship
Big data: The next frontier for innovation, competition, and productivity: McKinsey Global Institute 2011
Page 3
management. Even at 140 characters per tweet, the high velocity (or frequency) of Twitter
data ensures large volumes.
Variety. Traditional data formats tend to be relatively well defined by a data schema and
change slowly. In contrast, non-traditional data formats exhibit a dizzying rate of change. As
new services are added, new sensors deployed, or new marketing campaigns executed,
new data types are needed to capture the resultant information.
Sentiment analysis
Marketing campaign analysis
Customer churn modeling
Fraud detection
Research and Development
Risk Modeling
And more
Page 4
Gartner estimates the total economic value-add from the Internet of Things across industries will
reach US$1.9 trillion worldwide in 20202.
For example, just a few years from now, your morning routine might be a little different thanks to
Internet of Things technology. Your alarm goes off earlier than usual because your home smart
hub has detected traffic conditions suggesting an unusually slow commute. The weather sensor
warns of a continued high pollen count, so because of your allergies, you decide to wear your suit
with the sensors that track air quality and alert you to allergens that could trigger an attack.
You have time to check your messages at the kitchen e-screen. The test results from your recent
medical checkup are in, and theres a message from your doctor that reiterates his
recommendations for a healthier diet. You send this information on to your home smart hub. It
automatically displays a chart comparing your results with those of the general population in your
age range, and asks you to confirm the change to healthier options on your online grocery order.
The e-screen on the refrigerator door suggests yogurt and fresh fruit for breakfast.
Major Advances in Machine-to-Machine Interactions Mean Incredible Changes
The general understanding of how things work on the internet is a familiar pattern: humans connect
through a browser to get the information or do the action they want to do on the internet.
The Internet of Things changes that model. In the Internet of Things, things talk to things, and
processes have two-way interconnectivity so they can interoperate both locally and globally.
Decisions can be made according to predetermined rules, and the resulting actions happen
automatically without the need for human intervention. These new interactions are driving
tremendous opportunities for new services.
Peter Middleton, Peter Kjeldsen, and Jim Tully, Forecast: The Internet of Things, Worldwide, 2013,
(G00259115), Gartner, Inc., November 18, 2013.
Page 5
Oracle delivers an integrated, secure, comprehensive platform for the entire IoT architecture
across all vertical markets. For more information on Oracles Internet of Things platform, visit:
http://www.oracle.com/us/solutions/internetofthings/overview/index.html
We shall now consider the lifecycle of Big Data, and how to leverage the Hadoop platform to derive
added value from data acquired in MySQL solutions.
Page 6
Page 7
Acquire: Through NoSQL APIs, MySQL is able to ingest high volume, high velocity data, without
sacrificing ACID guarantees, thereby ensuring data quality. Real-time analytics can also be run
against newly acquired data, enabling immediate business insight, before data is loaded into
Hadoop. In addition, sensitive data can be pre-processed, for example healthcare or financial
services records can be anonymized, before transfer to Hadoop.
Organize: Data can be transferred in batches from MySQL tables to Hadoop using Apache Sqoop
or the MySQL Hadoop Applier. With the Applier, users can also invoke real-time change data
capture processes to stream new data from MySQL to HDFS as they are committed by the client.
Analyze: Multi-structured data ingested from multiple sources is consolidated and processed within
the Hadoop platform.
Decide: The results of the analysis are loaded back to MySQL via Apache Sqoop where they
power real-time operational processes or provide analytics for BI tools.
Each of these stages and their associated technology are discussed below.
Page 8
Native Memcached API access is available for MySQL 5.6 and MySQL Cluster. By using its
ubiquitous API for writing and reading data, developers can preserve their investments in
Memcached infrastructure by re-using existing Memcached clients, while also eliminating the need
for application changes.
As discussed later, MySQL Cluster also offers additional NoSQL APIs including Node.js, Java,
JPA, HTTP/REST and C++.
SQL
MySQL Server
Memcached Protocol
Memcached Plug-in
innodb_
memcached
Handler API
local cache
(optional)
InnoDB API
Page 9
With the Memcached code running in the same process space, users can insert and query data at
high speed. With simultaneous SQL access, users can maintain all the advanced functionality
offered by InnoDB including support for crash-safe transactional storage, Foreign Keys, complex
JOIN operations, etc.
Benchmarks demonstrate that the NoSQL Memcached API for InnoDB delivers up to 9x higher
performance than the SQL interface when inserting new key/value pairs, with a single low-end
commodity server3 supporting nearly 70,000 Transactions per Second.
TPS
50000
40000
Memcached API
30000
SQL
20000
10000
0
8
32
128
512
Client Connections
The delivered performance demonstrates MySQL with the native Memcached NoSQL interface is
well suited for high-speed inserts with the added assurance of transactional guarantees.
MySQL Cluster
MySQL Cluster has many attributes that make it ideal for new generations of high volume, high
velocity applications that acquire data at high speed, including:
The benchmark was run on an 8-core Intel server configured with 16GB of memory and the Oracle Linux operating system.
Page 10
As MySQL Cluster stores tables in network-distributed data nodes, rather than in the MySQL
Server, there are multiple interfaces available to access the database.
The chart below shows all of the access methods available to the developer. The native API for
MySQL Cluster is the C++ based NDB API. All other interfaces access the data through the NDB
API.
At the extreme left hand side of the chart, an application has embedded the NDB API library
enabling it to make native C++ calls to the database, and therefore delivering the lowest possible
latency.
On the extreme right hand side of the chart, MySQL presents a standard SQL interface to the data
nodes, providing connectivity to all of the standard MySQL drivers.
NoSQL
Native
memcached
SQL
JDBC / ODBC
PHP / PERL
Python / Ruby
JavaScript
NDB API
Whichever API is used to insert or query data, it is important to emphasize that all of these SQL
and NoSQL access methods can be used simultaneously, across the same data set, to provide the
ultimate in developer flexibility.
Benchmarks executed by Intel and Oracle demonstrate the performance advantages that can be
realized by combining NoSQL APIs with the distributed, multi-master design of MySQL Cluster4.
http://mysql.com/why-mysql/benchmarks/mysql-cluster/
Page 11
1.2 Billion write operations per minute (19.5 million per second) were scaled linearly across a
cluster of 30 commodity dual socket (2.6GHz), 8-core Intel servers, each equipped with 64GB of
RAM, running Linux and connected via Infiniband.
Synchronous replication within node groups was configured, enabling both high performance and
high availability without compromise. In this configuration, each node delivered 650,000 ACIDcompliant write operations per second.
25
20
15
10
0
2
10
12
14
16
18
20
22
24
26
28
30
These results demonstrate how users can acquire transactional data at high volume and high
velocity on commodity hardware using MySQL Cluster.
To learn more about the NoSQL APIs for MySQL, and the architecture powering MySQL Cluster,
download the Guide to MySQL and NoSQL:
http://www.mysql.com/why-mysql/white-papers/mysql-wp-guide-to-nosql.php
MySQL Fabric
MySQL is powering some of the most demanding Web applications, thereby collecting an
enormous amount of data potentially adding tremendous value to the businesses capable of
harnessing it. MySQL Fabric makes it easier and safer to scale out MySQL databases in order to
acquire large amounts of information:
Indeed, while MySQL Replication provides the mechanism to scale out reads (having one master
MySQL server handle all writes and then load balance reads across as many slave MySQL servers
as you need), a single server must handle all of the writes. As modern applications become more
and more interactive, the proportion of writes will continue to increase. The ubiquity of social media
means that the age of the publish once and read a billions times web site is over. Add to this the
promise offered by Cloud platforms - massive, elastic scaling out of the underlying infrastructure and you get a huge demand for scaling out to dozens, hundreds or even thousands of servers.
The most common way to scale out is by sharding the data between multiple MySQL Servers; this
can be done vertically (each server holding a discrete subset of the tables - say those for a specific
Copyright 2015, Oracle and/or its affiliates. All rights reserved.
Page 12
set of features) or horizontally where each server holds a subset of the rows for a given table.
While effective, sharding has required developers and DBAs to invest a lot of effort in building and
maintaining complex logic at the application and management layers - detracting from higher value
activities.
The introduction of MySQL Fabric makes all of this far simpler. MySQL Fabric is designed to
manage pools of MySQL Servers - whether just a pair for High Availability or many thousands to
cope with scaling out huge web application.
MySQL Fabric provides a simple and effective option for High Availability as well as the option of
massive, incremental scale-out. It does this without sacrificing the robustness of MySQL and
InnoDB; requiring major application changes or needing your Dev Ops teams to move to unfamiliar
technologies or abandon their favorite tools.
For more information about MySQL Fabric, get MySQL Fabric - A Guide to Managing MySQL High
Availability & Scaling Out5.
http://www.mysql.com/why-mysql/white-papers/mysql-fabric-product-guide/
Page 13
Apache Sqoop
Originally developed by Cloudera, Sqoop is now an Apache Top-Level Project6. Apache Sqoop is a
tool designed for efficiently transferring bulk data between Hadoop and structured datastores such
as relational databases. Sqoop can be used to:
1. Import data from MySQL into the Hadoop Distributed File System (HDFS), or related systems
such as Hive and HBase.
2. Extract data from Hadoop typically the results from processing jobs - and export it back to
MySQL tables. This will be discussed more in the Decide stage of the big data lifecycle.
3. Integrate with Oozie7 to allow users to schedule and automate import / export tasks.
Sqoop uses a connector-based architecture that supports plugins providing connectivity between
HDFS and external databases. By default Sqoop includes connectors for most leading databases
including MySQL and Oracle Database, in addition to a generic JDBC connector that can be used
to connect to any database that is accessible via JDBC. Sqoop also includes a specialized fastpath connector for MySQL that uses MySQL-specific batch tools to transfer data with high
throughput.
When using Sqoop, the dataset being transferred is sliced up into different partitions and a maponly job is launched with individual mappers responsible for transferring a slice of this dataset.
Each record of the data is handled in a type-safe manner since Sqoop uses the database
metadata to infer the data types.
http://sqoop.apache.org/
http://oozie.apache.org/
Page 14
When initiating the Sqoop import, the user provides a connect string for the database and the
name of the table to be imported.
As shown in the figure above, the import process is executed in two steps:
1. Sqoop analyzes the database to gather the necessary metadata for the data being imported.
2. Sqoop submits a map-only Hadoop job to the cluster. It is this job that performs the actual data
transfer using the metadata captured in the previous step.
The imported data is saved in a directory on HDFS based on the table being imported, though the
user can specify an alternative directory if they wish.
By default the data is formatted as CSV (Comma Separated Values), with new lines separating
different records. Users can override the format by explicitly specifying the field separator and
record terminator characters.
You can see practical examples of importing and exporting data with Sqoop on the Apache blog.
Credit goes to the ASF for content and diagrams:
https://blogs.apache.org/sqoop/entry/apache_sqoop_overview
Page 15
Replication via the Hadoop Applier is implemented by connecting to the MySQL master reading
events from the binary log8 events as soon as they are commited on the MySQL master, and
writing them into a file in HDFS. Events describe database changes such as table creation
operations or changes to table data.
The Hadoop Applier uses an API provided by libhdfs, a C library to manipulate files in HDFS. The
library comes precompiled with Hadoop distributions.
It connects to the MySQL master to read the binary log and then:
Databases are mapped as separate directories, with their tables mapped as sub-directories with a
Hive data warehouse directory. Data inserted into each table is written into text files (named as
datafile1.txt) in Hive / HDFS. Data can be in comma separated or other formats, configurable by
command line arguments.
http://dev.mysql.com/doc/refman/5.6/en/binary-log.html
Page 16
The installation, configuration and implementation are discussed in detail in the Hadoop Applier
9
blog . Integration with Hive is documented as well.
You can download and evaluate Hadoop Applier code from the MySQL labs10 (select the Hadoop
Applier build from the drop down menu).
Note that this code is currently a technology preview and not certified or supported for production
deployment.
http://innovating-technology.blogspot.fi/2013/04/mysql-hadoop-applier-part-2.html
http://labs.mysql.com
11
http://spark.apache.org/
10
Page 17
As we have already seen, Sqoop and the Hadoop Applier are key technologies to connect MySQL
with Hadoop available for use with multiple Hadoop distributions, e.g. Cloudera, HortonWorks and
MapR.
Step 4: Decide
Results sets from Hadoop processing jobs are loaded back into MySQL tables using Apache
Sqoop, where they become actionable for the organization.
As with the Import process, Export is performed in two steps as shown in the figure below:
1. Sqoop analyzes MySQL to gather the necessary metadata for the data being exported.
2. Sqoop divides the dataset into splits and then uses individual map tasks to push the splits to
MySQL. Each map task performs this transfer over many transactions in order to ensure
optimal throughput and minimal resource utilization.
The user would provide connection parameters for the database when executing the Sqoop export
process, along with the HDFS directory from which data will be exported and the name of the
MySQL table to be populated.
Once the data is in MySQL, it can be consumed by BI tools such as Oracle Business Intelligence
solutions, Pentaho, JasperSoft, Talend, etc. to populate dashboards and reporting software.
In many cases, the results can be used to control a real-time operational process that uses MySQL
as its database. Continuing with the on-line retail example cited earlier, a Hadoop analysis would
have been able to identify specific user preferences. Sqoop can be used to load this data back into
MySQL, and so when the user accesses the site in the future, they will receive offers and
recommendations based on their preferences and behavior during previous visits.
The following diagram shows the total workflow within a web architecture.
Copyright 2015, Oracle and/or its affiliates. All rights reserved.
Page 18
Figure 12: MySQL & Hadoop Integration Driving a Personalized Web Experience
Web Data
Acquired in
MySQL
Acquire
Analyzed with
Oracle Exadata
Organize
Decide
Analyze
Organized with
Oracle Big Data
Appliance
You
can
learn
more
about
Oracle
http://www.oracle.com/us/technologies/big-data/index.html
Decide Using
Oracle Exalytics
Big
Data
solutions
here:
MySQL Enterprise Edition is integrated and certified with the following products:
Page 19
Full Text Search support for the InnoDB storage engine increases the range of queries
and workloads that MySQL can serve.
For more details on those capabilities and MySQL 5.6, get the following Guide:
http://www.mysql.com/why-mysql/white-papers/whats-new-mysql-5-6/
Page 20
The MySQL Enterprise Monitor is a web-based application that can manage MySQL within the
safety of a corporate firewall or remotely in a public cloud. MySQL Enterprise Monitor provides:
As noted earlier, it is also possible to monitor MySQL via Oracle Enterprise Manager.
The MySQL Query Analyzer
The MySQL Query Analyzer helps developers and DBAs improve application performance by
monitoring queries and accurately pinpointing SQL code that is causing a slowdown. Using the
Performance Schema with MySQL Server 5.6, data is gathered directly from the MySQL server
without the need for any additional software or configuration.
Queries are presented in an aggregated view across all MySQL servers so DBAs and developers
can filter for specific query problems and identify the code that consumes the most resources. With
the MySQL Query Analyzer, DBAs can improve the SQL code during active development and
continuously monitor and tune the queries in production.
Copyright 2015, Oracle and/or its affiliates. All rights reserved.
Page 21
Design: MySQL Workbench includes everything a data modeler needs for creating complex
ER models, forward and reverse engineering, and also delivers key features for performing
difficult change management and documentation tasks that normally require much time and
effort.
Develop: MySQL Workbench delivers visual tools for creating, executing, and optimizing
SQL queries. The SQL Editor provides color syntax highlighting, reuse of SQL snippets, and
Page 22
execution history of SQL. The Database Connections Panel enables developers to easily
manage database connections. The Object Browser provides instant access to database
schema and objects.
Migrate: MySQL Workbench now provides a complete, easy to use solution for migrating
Microsoft SQL Server, Microsoft Access, Sybase ASE, PostgreSQL, and other RDBMS
tables, objects and data to MySQL. Developers and DBAs can quickly and easily convert
existing applications to run on MySQL. Migration also supports migrating from earlier
versions of MySQL to the latest releases.
Page 23
for database threats, automatically creates a whitelist of approved SQL statements and blocks
unauthorized database activity.
MySQL Enterprise Audit
MySQL Enterprise Audit enables you to quickly and seamlessly add policy-based auditing
compliance to new and existing applications. You can dynamically enable user level activity
logging, implement activity-based policies, manage audit log files and integrate MySQL auditing
with Oracle and third-party solutions
MySQL Enterprise High Availability
MySQL Enterprise High Availability enables you to make your database infrastructure highly
available. MySQL provides you with certified and supported solutions.
Oracle Premier Support for MySQL
MySQL Enterprise Edition provides 24x7x365 access to Oracles MySQL Support team, staffed by
database experts ready to help with the most complex technical issues, and backed by the MySQL
developers. Oracles Premier support for MySQL provides you with:
In addition to MySQL Enterprise Edition, the following services may also be of interest to Big Data
professionals:
Oracle University
Oracle University offers an extensive range of MySQL training from introductory courses (i.e.
MySQL Essentials, MySQL DBA, etc.) through to advanced certifications such as MySQL
Performance Tuning and MySQL Cluster Administration. It is also possible to define custom
training plans for delivery on-site. You can learn more about MySQL training from the Oracle
University here: http://www.mysql.com/training/
MySQL Consulting
To ensure best practices are leveraged from the initial design phase of a project through to
implementation and sustaining, users can engage Professional Services consultants. Delivered
remote or onsite, these engagements help in optimizing the architecture for scalability, high
availability and performance. You can learn more at http://www.mysql.com/consulting/
Page 24
Conclusion
Big Data and the Internet of Things are generating significant transformations in the way
organizations capture and analyze new and diverse data streams. As this paper has discussed,
MySQL can be seamlessly integrated within a Big Data lifecycle. Using MySQL solutions with the
Hadoop platform and following the best practices outlined in this document can enable you to yield
more insight than was ever previously imaginable.
Additional Resources
MySQL Whitepapers
http://www.mysql.com/why-mysql/white-papers/
MySQL Webinars:
Live: http://www.mysql.com/news-and-events/web-seminars/index.html
On Demand: http://www.mysql.com/news-and-events/on-demand-webinars/
MySQL Enterprise Edition Demo:
http://www.youtube.com/watch?v=guFOVCOaaF0
MySQL Cluster Demo:
https://www.youtube.com/watch?v=A7dBB8_yNJI
MySQL Enterprise Edition Trial:
http://www.mysql.com/trials/
MySQL Case Studies:
http://www.mysql.com/why-mysql/case-studies/
MySQL TCO Savings Calculator:
http://mysql.com/tco
To contact an Oracle MySQL Representative:
http://www.mysql.com/about/contact/
Page 25