Big Data Overview

Harnessing Big Data
OLTP: Online Transaction Processing (DBMSs)

OLAP: Online Analytical Processing (Data Warehousing)
RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
What’s driving Big Data
- Optimizations and predictive analytics

- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time
- Ad-hoc querying and reporting

- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets
The Evolution of Business
Intelligence
Interactive Business
Speed
Intelligence & Big Data:
In-memory RDBMS Scale
Real Time &
Single View
BI Reporting QliqView, Tableau, HANA
OLAP &
Graph Databases
Dataware house
Business Objects, SAS, Big Data: Speed
Scale
Informatica, Cognos other SQL Batch Processing &
Reporting Tools
Distributed Data Store
Hadoop/Spark; HBase/Cassandra
1990’s 2000’s 2010’s

Big Data Analytics
Big data is more real-time in

nature than traditional DW
applications
Traditional DW architectures (e.g.
Exadata, Teradata) are not well-
suited for big data apps
Shared nothing, massively parallel
processing, scale out
architectures are well-suited for
big data apps
4
Big Data Tech Stack
Redundant Physical Infrastructure
• Redundant physical infrastructure is fundamental to the operation
and scalability of a big data architecture.
• To support an unanticipated or unpredictable volume of data, a
physical infrastructure for big data has to be different than that for
traditional data.
• The physical infrastructure is based on a distributed computing
model. This means that data may be physically stored in many
different locations and can be linked together through networks, the
use of a distributed file system, and various big data analytic tools
and applications.
• Redundancy is important because we are dealing with so much data
from so many different sources. Redundancy comes in many forms.
Redundancy may be internal cloud for load balancing.
• Redundancy may be in external cloud services to augment its
internal resources. In some cases, this redundancy may come in the
form of a Software as a Service (SaaS).
Security Infrastructure
• As big data analysis becomes more important to companies,

the more important it will be to secure that data. (You may
have to secure privacy of the clients, patients etc)
• Should have layered access control to sensitive data
• Individual data should be protected when deriving trends,
prediction from group of data to conform with the law of land
• Data encryption
• Threat detection
Operational data bases
• One have to incorporate all the data sources that will give a
complete picture of business and see how the data impacts the way
the business operated. Traditionally, an operational data source
consisted of highly structured data managed by a relational
database. But now data has to encompass a broader set of data
sources, including unstructured sources such as customer and
social media data in all its forms. New emerging approaches to
data management in the big data world, including document, graph,
columnar, and geospatial database architectures. Collectively,
these are referred to as NoSQL, or not only SQL.
• One need data architectures that support complex unstructured
content.
• One need to include both relational databases and non-relational
databases in approach to harnessing big data. It is also necessary
to include unstructured data sources, such as content management
systems, to get the complete business view.
Operational data bases
• Performance might also determine the kind of database you would
use. In some situations, you may want to understand how two very
distinct data elements are related. What is the relationship between
buzz on a social network and the growth in sales? This is not the
typical query you could ask of a structured, relational database. A
graphing database might be a better choice.Using the right database
will also improve performance. Typically the graph database will be
used in scientific and technical applications.
• Other important operational database approaches include columnar
databases that store information efficiently in columns rather than
rows. This approach leads to faster performance because
input/output is extremely fast. When geographic data storage is part
of the equation, a spatial database is optimized to store and query
data based on how objects are related in space.
Organizing data services and tools
• A growing amount of data comes from a variety of sources that
aren’t quite as organized or straightforward, including data that
comes from machines or sensors, and massive public and
private data sources.
• It was simply too expensive or too overwhelming to manage
this data. Even if companies were able to capture the data,
they did not have the tools to do anything about it. Either there
were no tools or the tools that did exist were complex to use
and did not produce results in a reasonable time frame.
• Most were forced to work with snapshots of data. This has the
undesirable effect of missing important events because they
were not in a particular snapshot.
Organizing data services and tools
• With the evolution of computing technology, it is now

possible to manage immense volumes of data that
previously could have only been handled by expensive
computers only.
• New technologies like MapReduce, Hadoop, and Big
Table are available to harness the massive computing
power available to analyze massive data.
Traditional and advanced analytics
• It requires many different approaches to analysis, depending on the
problem being solved. Some analyses will use a traditional data
warehouse, while other analyses will take advantage of advanced
predictive analytics. Managing big data holistically requires many different
approaches to help the business to successfully plan for the future.
• Analytical data warehouses and data marts : Sometimes it may be
pragmatic to take the subset of data that reveals patterns and put it into a
form that’s available to the business. These warehouses and marts
provide compression, multilevel partitioning, and a massively parallel
processing architecture.
• Big data analytics : The capability to manage and analyze petabytes of
data helps companies to get insight that could have an impact on the
business. This requires analytical engines that can manage this highly
distributed data and provide results that can be optimized to solve a
business problem. Analytics can get quite complex with big data. For
example, some organizations are using predictive models that couple
structured and unstructured data together to predict fraud. Social media
analytics, text analytics, and new kinds of analytics are being utilized by
organizations looking to gain insight into big data.
Reporting and visualization
• Organizations relied on reports to give them an understanding
of the present and future of companies.
• Big data changes the way that data is managed and used. If a
company can collect, manage, and analyze enough data, it can
use a new generation of tools to help management truly
understand the impact not just of a collection of data elements
but also how these data elements offer context based on the
business problem being addressed. With big data, reporting
and data visualization become tools for looking at the context
of how data is related and the impact of those relationships on
the future.
Big data Application
• Some of the emerging applications are in areas such as
healthcare, manufacturing management, traffic
management, and so on. All of them rely on huge
volumes, velocities, and varieties of data to transform the
behavior of a market. In healthcare, a big data
application might be able to monitor premature infants to
determine when data indicates when intervention is
needed. In manufacturing, a big data application can be
used to prevent a machine from shutting down during a
production run. A big data traffic management application
can reduce the number of traffic jams on busy city
highways to decrease accidents, save fuel, and reduce
pollution.
Big data processes
The overall process of extracting insights from big data can be broken down into five stages.
These five stages form the two main sub-processes: data management and analytics. Data
management involves processes and supporting technologies to acquire and store data and
to prepare and retrieve it for analysis. Analytics, on the other hand, refers to techniques
used to analyze and acquire intelligence from big data. Thus, big data analytics can be
viewed as a sub-process in the overall process of ‘insight extraction’ from big data.
Big data management
MapReduce model
Distributed grep
Split data grep matches
Very All
big Split data grep matches cat
matches
data
Distributed Word Count

Split data count count
Very merged
big Split data count count merge
count
data
First Map, then Reduce
R
M E
Very Partitioning
A D Result
big Function
P U
data
C
E
Map: Reduce :
– Accepts input key/value pair – Accepts intermediate key/value*
– Emits intermediate key/value pair pair
– Emits output key/value pair
Directed Acyclic Graph model
MapReduce model simply states that distributed computation on a large dataset can
be boiled down to two kinds of computation steps - a map step and a reduce step.
One pair of map and reduce does one level of aggregation over the data. Complex
computations typically require multiple such steps. When you have multiple such
steps, it essentially forms a DAG of operations. So a DAG execution model is
essentially a generalization of the MapReduce model.
Computations expressed in MapReduce boil down to multiple iterations of
• Read data from HDFS
• Apply map and reduce,
• Write back to HDFS.
Each map-reduce round is completely independent of each other, and Hadoop does
not have any global knowledge of what MR steps are going to come after each MR.
For many iterative algorithms this is inefficient as the data between each map-reduce
pair gets written and read from filesystem. Newer systems like Spark and Tez
improves performance over Hadoop by considering the whole DAG of map-reduce
steps and optimizing it globally (e.g., pipelining consecutive map steps into one, not
write intermediate data to HDFS). This prevents writing data back and forth after
every reduce.
"RDD" - Resilient Distributed Dataset. DAG in Apache Spark is a set of

Vertices and Edges, where vertices represent the RDDs and the edges
represent the Operation to be applied on RDD. In Spark DAG, every edge is
directed from earlier to later in the sequence. On calling of Action, the created
DAG is submitted to DAG Scheduler which further splits the graph into the
stages of the task.
DAG is a finite directed graph with no directed cycles. There are finitely many
vertices and edges, where each edge directed from one vertex to another. It
contains a sequence of vertices such that every edge is directed from earlier
to later in the sequence. It is a strict generalization of MapReduce model.
DAG operations can do better global optimization than other systems like
MapReduce.
Apache Spark DAG allows the user to dive into the stage and expand on
detail on any stage. In the stage view, the details of all RDDs belonging to that
stage are expanded. The Scheduler splits the Spark RDD into stages based
on various transformation applied. Each stage is comprised of tasks, based on
the partitions of the RDD, which will perform same computation in parallel.
The graph here refers to navigation, and directed and acyclic refers to how it
is done.
RDDs are great if you want to keep holding a data set in memory and fire a
series of queries - this works better than fetching data from disk every time.
Another important RDD concept is that there are two types of things that can
be done on an RDD - 1) Transformations like, map, filter than results in
another RDD. 2) Actions like count that result in an output. A spark job
comprises of a DAG of tasks executing transformations and actions on RDDs.
Graph Model
Consider a relationship between two people interacting via
Facebook. There are characteristics of the relationship that are
not necessarily attributes of either individual, such as the nature
of their relationship (personal or professional), the duration,
where they met, or how frequently they correspond.
The relational model can’t optimally capture all the valuable
information associated with the connections. These limitations
have become more acute as the domain of data sets extends
beyond traditional structured data models and encompasses
unstructured data and data streams continuously fed by Internet-
connected devices and sensors and human-generated content
from Internet communities.
Graph databases provide an alternative approach to data
representation that not only captures information about entities
and their attributes but also elevates relationships among the
entities to be first-class objects.
Graph Model
Graphs consist of a collection of vertices (also called nodes or points) that represent the
modeled entities; vertices are connected by edges (also called links, connections, or
relationships) that capture the way that two entities are related.
This model can

represent attributes of
each entity (such as
the manufacturer’s
address) as well as
attributes of each
relationship, such as
the dates of
employment
associated with each
“employed-by” edge in
the graph.
Graph Model
Graph data processing engines can ingest and represent the qualitative characteristics of
both the entities and the links among them. The captured information is embedded in the
connections between things, not just the characteristics of the things themselves.
Business environments suited to a graph data processing solution share these general
features:
-- Connectivity: First and foremost, the environment involves documenting and
understanding connected entities.
-- Entity volume: There are a large number of entities that can possibly be connected, such
as the number of e-commerce website visitors and the products they view.
-- Entity variety: There are entities with different characteristics, such as individuals with
different job skills using a recruiting application.
-- Link attribution: There are relevant characteristics associated with the connections
between entities. For example, a person may have an employment relationship with a
company, and that relationship may have a title, a duration, a location, and a salary.
Both vertices and edges can have descriptive attributes, and these attribute values are
used for specialized analyses and searches within the graph. Edges can be undirected
when the relationship is mutual (such as “Joe is the sibling of Jane”). Alternatively, the
edge can be directed, which means that the relationship flows differently according to its
direction. An example is “Joe is the brother of Jane” and “Jane is the sister of Joe.”
Graph Model
One method for representing graphs uses the W3C’s Resource Description Framework, called RDF. In RDF, you specify
triples that follow a subject-predicate-object format, such as “Larry Brown is-employed-by Millipede Electronics.” In this
example, “Larry Brown” is the subject, “Millipede Electronics” is the object, and “is-employed-by” is the predicate that
relates the subject to the object.
In effect, the RDF triple defines a directed edge between the subject and the object. A collection of RDF triples can
capture the format of a graph. Within the graph database, various data structures represent the graph. Some are
straightforward, such as Java objects linked with pointers, and others are optimized using different types of data
structures, such as a sparse matrix data structure.
Graphs are queried using a graph query language such as SPARQL -- a recursive acronym for SPARQL Protocol and
RDF Query Language. SPARQL is a semantic query language, allowing queries based on the attributes of the vertices,
the attributes of the edges, and the structure of the connections. For example, you can query the graph for “all individuals
who have three outgoing edges that connect to companies that have more than 500 employees.”
Search is just one analytics algorithm that you can apply to graphs. Others include:
-- Partitioning isolates portions of the graph into smaller pieces with particular properties. Example uses include
partitioning a telecommunications network based on serving particular geographic regions or organizing sales
territories by the location of the sales staff.
-- Shortest path analysis seeks the most efficient path between two nodes. A good example is examining all the delivery
points for a parcel delivery company each day to determine routes that require the least amount of fuel.
-- Analysis can locate connected components, subgraphs in which all the vertices can be reached from any other
member of the connected component. Connected components often represent distinct clusters of entities, and you
can use them for segmentation.
-- Page rank characterizes the importance of a vertex within the graph. The algorithm, named after Google founder Larry
Page, is part of how search engines rank websites based on their content and connections to other websites.
-- Centrality is another algorithm used to identify the most important or most influential entities within the graph.
BSP Model
The Bulk Synchronous Parallel (BSP) abstract computer is a model for designing parallel
algorithms. An important part of analysing a BSP algorithm rests on quantifying the
synchronization and communication needed.
A BSP computer consists of
• Components capable of processing and/or local memory transactions (i.e., processors),
• A network that routes messages between pairs of such components, and
• A hardware facility that allows for the synchronization of all or a subset of components.
This is commonly interpreted as a set of processors which may follow different threads of
computation, with each processor equipped with fast local memory and interconnected by a
communication network. A BSP algorithm relies heavily on the third feature; a computation
proceeds in a series of global supersteps, which consists of three components:
Concurrent computation: Every participating processor may perform local computations,
i.e., each process can only make use of values stored in the local fast memory of the
processor. The computations occur asynchronously of all the others but may overlap with
communication.
Communication: The processes exchange data between themselves to facilitate remote
data storage capabilities.
Barrier synchronization: When a process reaches this point (the barrier), it waits until all
other processes have reached the same barrier.
BSP Model
The computation and communication actions do not have to be ordered in time.
Communication typically takes the form of the one-sided put and get Direct
Remote Memory Access (DRMA) calls, rather than paired two-sided send and
receive message passing calls. The barrier synchronization concludes the
superstep. It ensures that all one-sided communications are properly
concluded. Systems based on two-sided communication include this
synchronization cost implicitly for every message sent. The method for barrier
synchronization relies on the hardware facility of the BSP computer. This facility
periodically checks if the end of the current superstep is reached globally.
The BSP model is also well-suited to enable automatic memory management
for distributed-memory computing through overdecomposition of the problem
and oversubscription of the processors. The computation is divided into more
logical processes than there are physical processors, and processes are
randomly assigned to processors. This strategy leads to almost perfect load
balancing, both of work and communication.
Big Data Technology
Thank you
The End

Big Data Overview

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Big Data Overview

Hochgeladen von

Copyright:

Verfügbare Formate

Harnessing Big Data

OLTP: Online Transaction Processing (DBMSs)

- Optimizations and predictive analytics

- Ad-hoc querying and reporting

1990’s 2000’s 2010’s

Big data is more real-time in

• As big data analysis becomes more important to companies,

• With the evolution of computing technology, it is now

Distributed Word Count

"RDD" - Resilient Distributed Dataset. DAG in Apache Spark is a set of

This model can

Das könnte Ihnen auch gefallen