Sie sind auf Seite 1von 10

Chap 3.

Application Architectures For Big Data And Analytics

Big Data Warehouse & Analytics


Big data Warehouse System requirements & Hybrid
Architectures
Enterprise Data Platform Ecosystem
Big Data and Master Data Management

Big Data Warehouse & Analytics

What is Big Data/ Data warehousing?

In order to examine the truth (or lack thereof) in this line of thinking, we need to start with
the basics. First, what is big data? There are actually many different forms of big data. But
the most widely understood form of big data is the form found in Hadoop, Cloudera, et al.

A good working definition of big data solutions is:

 Technology capable of holding very large amounts of data.

 Technology that can hold the data in inexpensive storage devices.

 Technology where processing is done by the “Roman census” method.

 Technology where the data is stored in an unstructured format.

There are probably other ramifications and features, but these basic characteristics are a
good working description of what most people mean when they talk about a big data
solution

What is a Data Warehouse?

There are different interpretations of what is meant by big data, and there are different
interpretations of what is meant by data warehousing. In principle, there is the Kimball
approach to data warehousing, and there is the Inmon approach to data warehousing. For
the purposes of this article, the Inmon approach to data warehousing will be discussed. The
Inmon approach to data warehousing centers around the definition of a data warehouse,
which was given many years ago. A data warehouse is a subject-oriented, nonvolatile,
integrated, time variant collection of data created for the purpose of management’s decision
making. Another way of saying the same thing is that a data warehouse provides a “single
version of the truth” for decision making in the corporation. With a data warehouse there is
1
an integrated, granular, historical single point of reference for data in the corporation.

So why do people want a big data solution? People want a big data solution because in a lot
of corporations there is a lot of data. And in those corporations that data – if unlocked
properly – can contain much valuable information that can lead to better decisions that, in
turn, can lead to more revenue, more profitability and more customers. And that is what
most corporations want.

And why do people need a data warehouse? People need a data warehouse in order to make
informed decisions. In order to really know what is going on in your corporation, you need
data that is reliable, believable and accessible to everyone.

Comparing Big Data Solutions to a Data Warehouse

So when we compare a big data solution to a data warehouse, what do we find? We find that
a big data solution is a technology and that data warehousing is an architecture. They are
two very different things. A technology is just that – a means to store and manage large
amounts of data. A data warehouse is a way of organizing data so that there is corporate
credibility and integrity. When someone takes data from a data warehouse, that person
knows that other people are using the same data for other purposes. There is a basis for
reconcilability of data when there is a data warehouse.

Big Data and the Roman Census Approach

One of the cornerstones of Big Data architecture is processing referred to as the “Roman
Census approach”. By using the Roman census approach a Big Data architecture can
accommodate the processing of almost unlimited amounts of data.

When people first hear the “Roman census approach” it appears to be counter-intuitive and
unfamiliar. The reaction most people have is – “and just exactly what is a Roman census
approach?” Yet the approach – architecturally – is at the core of the functioning of Big Data.
And – surprisingly – it turns out that many people are much more familiar with the Roman
census approach than they ever realized.

Once upon a time the Romans decided that they wanted to tax everyone in the Roman
empire. But in order to tax the citizens of the Roman empire the Romans first had to have a
census. The Romans quickly figured out that trying to get every person in the Roman empire
to march through the gates of Rome in order to be counted was an impossibility. There were
people in North Africa, in Spain, in Germany, in Greece, in Persia, in Israel, and so forth. Not
only were there a lot of people in far away places, trying to transport everyone on ships and
carts and donkeys to and from the city of Rome was simply an impossibility.

2
So the Romans realized that creating a census where the processing (i.e., the counting, the
taking of the census) was done centrally was an impossibility. The Romans solved the
problem by creating a body of “census takers”. The census takers were organized in Rome
and then were sent all over the Roman empire and on the appointed day a census was taken.
Then, after taking the census, the census takers headed back to Rome where the results were
tabulated centrally.

In such a fashion the work being done was sent to the data, rather than trying to send the
data to a central location and doing the work in one place. By distributing the processing, the
Romans solved the problem of creating a census over a large diverse population.

Many people don’t realize that they are very familiar with the Roman census method and
don’t know it. You see there once was a story about two people – Mary and Joseph – who had
to travel to a small city – Bethlehem – for the taking of a Roman census. On the way there
Mary had a little baby boy – named Jesus – in a manger. And the shepherds flocked to see
this baby boy. And Magi came and delivered gifts. Thus born was the religion many people
are familiar with – Christianity. The Roman census approach is intimately entwined with the
birth of Christianity.

The Roman census method then says that you don’t centralize processing if you have a large
amount of data to process. Instead you send the processing to the data. You distribute the
processing. In doing so you can service the processing over an effectively unlimited amount
of data.

Big data Warehouse System requirements & Hybrid Architectures

3
In spite of what you may have heard, Hadoop is not the sum total of big data. Another big
data "H"—hybrid—is becoming dominant, and Hadoop is an important (but not all-
encompassing) component of it. In the larger evolutionary perspective, big data is
evolving into a hybridized paradigm under which Hadoop, massively parallel processing
(MPP) enterprise data warehouses (EDW), in-memory columnar, stream computing,
NoSQL, document databases, and other approaches support extreme analytics in the
cloud.

Hybrid architectures address the heterogeneous reality of big data environments and
respond to the need to incorporate both established and new analytic database
approaches into a common architecture. The fundamental principle of hybrid
architectures is that each constituent big data platform is fit-for-purpose to the role for
which it's best suited. These big data deployment roles may include any or all of the
following:

1. Data acquisition
2. Collection
3. Transformation
4. Movement
5. Cleansing
6. Staging
7. Sandboxing
8. Modeling
9. Governance
10. Access
11. Delivery
12. Interactive exploration
13. Archiving

In any role, a fit-for-purpose big data platform often supports specific data sources,
workloads, applications, and users.

Hybrid is the future of big data because users increasingly realize that no single
type of analytic platform is always best for all requirements. Also, platform churn—
plus the heterogeneity it usually produces—will make hybrid architectures more
common in big data deployments. The inexorable trend is toward hybrid
environments that address the following enterprise big data imperatives:

 Extreme scalability and speed: The emerging hybrid big data platform will
support scale-out, shared-nothing massively parallel processing, optimized appliances,
optimized storage, dynamic query optimization, and mixed workload management.
 Extreme agility and elasticity: The hybrid big data environment will persist data
in diverse physical and logical formats across a virtualized cloud of interconnected
memory and disk that can be elastically scaled up and out at a moment's notice.

4
 Extreme affordability and manageability: The hybrid environment will
incorporate flexible packaging/pricing, including licensed software, modular appliances,
and subscription-based cloud approaches.

Hybrid deployments are already widespread in many real-world big data deployments.
The most typical are the three-tier—also called "hub-and-spoke"—architectures. These
environments may have, for example, Hadoop (e.g., IBM InfoSphere BigInsights) in the
data acquisition, collection, staging, preprocessing, and transformation layer; relational-
based MPP EDWs (e.g., IBM PureData System for Analytics) in the hub/governance
layer; and in-memory databases (e.g., IBM Cognos TM1) in the access and interaction
layer.

The complexity of hybrid architectures depends on range of sources, workloads, and


applications you're trying to support. In the back-end staging tier, you might need
different preprocessing clusters for each of the disparate sources: structured, semi-
structured, and unstructured. In the hub tier, you may need disparate clusters configured
with different underlying data platforms—RDBMS, stream computing, HDFS, HBase,
Cassandra, NoSQL, and so on—-and corresponding metadata, governance, and in-
database execution components. And in the front-end access tier, you might require
various combinations of in-memory, columnar, OLAP, dimensionless, and other database
technologies to deliver the requisite performance on diverse analytic applications, ranging
from operational BI to advanced analytics and complex event processing.

Ensuring that hybrid big data architectures stay cost-effective demands the following
multipronged approach to optimization of distributed storage:

 Apply fit-for-purpose databases to particular big data use cases: Hybrid


architectures spring from the principle that no single data storage, persistence, or
structuring approach is optimal for all deployment roles and workloads. For example, no
matter how well-designed the dimensional data model is within an OLAP environment,
users eventually outgrow these constraints and demand more flexible decision support.
Other database architectures—such as columnar, in-memory, key-value, graph, and
inverted indexing—may be more appropriate for such applications, but not generic
enough to address other broader deployment roles.
 Align data models with underlying structures and applications: Hybrid
architectures leverage the principle that no fixed big data modeling approach—physical
and logical—can do justice to the ever-shifting mix of queries, loads, and other
operations. As you implement hybrid big data architectures, make sure you adopt tools
that let you focus on logical data models, while the infrastructure automatically
reconfigures the underlying big data physical data models, schemas, joins, partitions,
indexes, and other artifacts for optimal query and data load performance.
 Intelligently compress and manage the data: Hybrid architectures should allow
you to apply intelligent compression to big data sets to reduce their footprint and make
optimal use of storage resources. Also, some physical data models are more inherently

5
compact than others (e.g., tokenized and columnar storage are more efficient than row-
based storage), just as some logical data models are more storage-efficient (e.g., third-
normal-form relational is typically more compact than large demoralized tables stored in a
dimensional star schema).

Yes, more storage tiers can easily mean more tears. The complexities, costs, and
headaches of these multi-tier hybridized architectures will drive you toward greater
consolidation, where it's feasible.

But it may not be as feasible as you wish.

The hybrid big data environment will continue the long-term trend away from centralized
and hub-and-spoke topologies toward the new worlds of cloud-oriented and federated
architectures. The hybrid platform is evolving away from a single master “schema” and
more toward database virtualization behind a semantic abstraction layer. Under this new
paradigm, the hybrid big data environment will require virtualized access to the disparate
schemas of the relational, dimensional, and other constitute DBMS and other repositories
that constitute a logically unified cloud-oriented resource.

Our best hope is that the abstraction/virtualization layer of the hybrid environment will
reduce tears, even as tiers proliferate. If it can provide your big data professionals with
logically unified access, modeling, deployment, optimization, and management of this
heterogeneous resourceThe architectural centerpiece of this new hybridized landscape
must be a standard query-virtualization or abstraction layer that supports transparent
SQL access to any and all back-end platforms. SQL will continue to be the lingua franca
for all analytics and transactional database applications. Consequently, big data solution
providers absolutely must allow SQL developers to transparently tap into the full range of
big data platforms, current and future, without modifying their code.

Enterprise Data Platform Ecosystem:

Big Data ecosystem as the broad view of the variety of data stores for different data
structures, different methods of search and query, different algorithms and approaches to
analyze, store, recombine both structure and unstructured data, and to report out results
that are both descriptive and predictive and open new possibilities. We will discuss the
main capabilities required for a more scoped view of Big Data, a reference architecture
which shows the broad set of capabilities of the Big Data Ecosystem

6
 Big Data and Master Data Management

Update Information Strategy and Architecture

Many organizations have had success leveraging big data insight around specific
business operations, but typically it’s limited to a single business unit or use case. Few
firms have explored how to make big data insights actionable across the entire
organization, by linking big data sources with trusted master data.

For example, many marketing organizations use data from social sources — such as
Twitter and Facebook — to inform their campaigns, but they don’t reconcile this with
trusted data in customer/prospect repositories that are used by customer services or
sales. This can lead to incoherent customer communication that can actually undermine
the sales or customer service process.

7
Become More Agile

Effective use of big data requires a mixture of old and new technologies and practices.
This necessitates an agile approach that applies a bimodal IT framework to information
governance (see “Why Digital Business Needs Bimodal IT”). MDM traditionally uses a
Mode 1 approach which is policy-driven and approval-based. Big data typically uses a
Mode 2 approach with little or no predefined processes or controls. Tactical and
exploratory initiatives are much better suited to the “faster” Mode 2.

Move to Limit Risk Exposure

When an organization executes actions based on information sources outside the


curation of MDM — as is the case in many big data implementations — exposure to
certain types of business risk increases. Factors such as poor data quality, loss of critical
information, and access to unauthorized information become more likely. Gartner
recommends appointing a lead information steward role in relevant business units to
assist in creating and executing risk controls with regards to data use in business
operations.

Identify Medium and Long-Term Requirements

The scale of challenges that face organizations and their information infrastructure are
rapidly shifting. This doesn’t necessarily mean that current MDM implementations are no
longer fit for purpose. Immediate assessment and planning in light of new and future
requirements is, however, essential to ensure that both investment and hiring keep up
with the information demands of digital business. Data modelling, quality management,
integration and synchronization are all areas that may soon require additional tools and
skills for a business to remain competitive.

Master Data Management (MDM) systems and the content they contain may seem
counterintuitive or even diametrically opposed to Big Data systems. Some of the
considerable differences between Master Data and Big Data include:

 Volume: Comparatively, Master Data sets are much smaller than those for Big
Data. One of the pivotal attractions for Big Data is that it encompasses enormous
volumes; a person could argue that one of the points of attraction for Master Data is the
opposite.
 Structure: Master Data tends to contain structured data, while the majority of Big
Data is either unstructured or semi-structured.
 Relationship to the enterprise: Typically, MDM systems contain an organization’s
most trusted data, which tends to be internal, while Big Data platforms quarter
8
enormous amounts of external data from any number of cloud, social media, mobile,
and other sources beyond the enterprise’s firewall. As indicated by Gartner, “MDM is
moreoriented around internal, enterprise-centric data; in an environment the
organization feels it has a chance to effect change, and so formal information
governance.”
Despite these differences, there are numerous ways in which Master Data Management
can enhance Big Data applications, and in which the latter can do so for the former. The
basic paradigm for the relationship between these two types of data pertains to the
context offered by Big Data and the trust gleaned from Master Data. These virtues can
inform one another equally. According to Forrester, MDM can be:
“…a hub for context in customer experience – sitting between systems of record and
systems of engagement to translate, manage and evolve dynamically the full fidelity of
customer identity through interactions directly or as viewed through indirect business
processes and supporting activities.”

Input: Providing Context to MDM


Organizations can expand their Master Data Management with Big Data by applying the
context of data from the external world to their trusted internal data. In this respect, MDM
cannot only take advantage of relatively new sources of (Big) Data, but also help provide
the proverbial 360 degree, comprehensive view of customers.

Although there are numerous domains for MDM, the customer domain is perhaps most
readily enhanced by Big Data. The incorporation of mobile, social, and cloud data can
provide numerous points of reference about a customer and his or her experience with an
organization’s products that can greatly inform data traditionally stored in MDM. Such
data includes customer interactions and relevant transactional data. Thus, Big Data can
sufficiently enrich Master Data and facilitate the sort of context that is a critical boon of
the former and lead to greater customer understanding. Furthermore, this approach
results in Big Data augmenting Master Data to the point where the former is actually
aggregated in an MDM hub. Additionally, it is possible to position one’s MDM in the cloud
and enable applications to access it as part of Service Oriented Architecture.

Input: Facilitating Big Data Context to MDM


The challenge with applying Big Data to MDM systems lies in distinguishing relevant
unstructured data that relates to Master Data from data which do not. A few options exist
for this purpose: vendors have recently implemented NoSQL offerings to attain this end.
The distinction in the sheer quantities of data between Master Data and Big Data

9
generally rule out utilizing Hadoop as a means of integrating relevant data, although there
are vendors who are working in this vein, as well.

A third alternative is the deployment of analytics options (such as those specializing in


sentiment data incorporating Natural Language Processing (NLP) and other semantic
technologies) to first ascertain which data have bearing on germane MDM fields. Aside
from recently released MDM solutions that utilize NoSQL methods, it is typically not
advantageous to merely add Big Data to an MDM hub without first filtering it. The
aforementioned analytics approach can provide that preliminary point of distinction so
that organizations can discern which Big Data can add context to their Master Data.
Output: Providing Trust for Big Data
The degree of governance that is bestowed upon Master Data and regulated within MDM
systems is designed for ready incorporation into any variety of applications, including
those for Big Data. Organizations can leverage their Master Data to effectively gauge the
trustworthiness of Big Data—and of whatever governance mechanisms are in place at
the application level. For instance, incorporating Master Data with Big Data sets can
enable organizations to identify the names of customers and products in their Big Data. In
such a way, Master Data can influence a number of operational systems, including those
that pertain to Big Data and those that do not. As indicated here, “MDM can feed Big
Data by “providing the data model backbone to bind the Big Data facts.” Viewed from this
perspective, Master Data Management is a critical prerequisite for Big Data Governance
—particularly when one considers the various facets of governance that are a part of any
competitive MDM system. Those include aspects of:

 Lifecycle Management
 Data Quality (Reduplication)
 Data Cleansing
 Metadata Management
 Reference Data Management

10

Das könnte Ihnen auch gefallen