Beruflich Dokumente
Kultur Dokumente
VISUALIZATION
MapReduce – Hadoop, Hive, MapR – Sharding
– NoSQL Databases - S3 - Hadoop Distributed
file systems – Visualizations - Visual data
analysis techniques, interaction techniques;
Systems and applications:
MapReduce
MapReduce
R
M Partitioning E
Very
A Function D
big RESULT
P U
data
C
E
Map: Reduce :
Accepts input key/value pair Accepts intermediate key/value* pair
Emits intermediate key/value pair Emits output key/value pair
EX: PageRank
• Simulates a random-surfer
• Begins with pair (URL, list-of-URLs)
• Maps to (URL, (PR, list-of-URLs))
• Maps again taking above data, and
for each u in list-of-URLs returns (u,
PR/|list-of-URLs|), as well as (u, new-
list-of-URLs)
• Reduce receives (URL, list-of-URLs),
and many (URL, value) pairs and
calculates (URL, (new-PR, list-of-
URLs))
Hadoop
• Open source
• Java-based implementation of MapReduce
• Use HDFS as underlying file system
• The framework takes care of scheduling tasks, monitoring and re-executes the failed
tasks.
• Typically the compute nodes and the storage nodes are the same, that is, the MapReduce
framework and the Hadoop Distributed File System are running on the same set of
nodes.
• Why use Hadoop
Cheaper,Faster,Better.
• Characteristics
Reliable,Economical,Scalable,Flexible
Hadoop
Map
HDFS
Reduce
Architecture and Core components
• Core Components of Hadoop Cluster:
Hadoop cluster has 3 components:
• Client
• Master
• Slave
• Client:
It is neither master nor slave, rather play a role of loading the data into cluster, submit MapReduce jobs describing how
the data should be processed and then retrieve the data to see the response after job completion.
• Masters:
The Masters consists of 3 components NameNode,
Secondary Node name and JobTracker.
• NameNode:
NameNode does NOT store the files but only the file's metadata. In later section we will see it is actually the
DataNode which stores the files.
NameNode oversees the health of DataNode and coordinates access to the data stored in DataNode.
Name node keeps track of all the file system related information such as to
• Which section of file is saved in which part of the cluster
• Last access time for the files
• User permissions like which user have access to the file
• JobTracker:
JobTracker coordinates the parallel
processing of data using MapReduce.
Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
• Store the data
• Process the computation
• Each slave runs both a DataNode and Task Tracker daemon which communicates to their
masters. The Task Tracker daemon is a slave to the JobTracker and the DataNode daemon
a slave to the NameNode
Hadoop Ecosystem
Hadoop Limitations
1. Security Concerns
Just managing a complex application such as Hadoop can be challenging. A classic example can be seen in the Hadoop security model, which is
disabled by default due to sheer complexity. If whoever’s managing the platform lacks the knowhow to enable it, your data could be at huge risk.
Hadoop is also missing encryption at the storage and network levels, which is a major selling point for government agencies and others that prefer
to keep their data under wraps.
2. Vulnerable By Nature
Speaking of security, the very makeup of Hadoop makes running it a risky proposition. The framework is written almost entirely in Java, one of the
most widely used yet controversial programming languages in existence. Java has been heavily exploited by cybercriminals and as a result,
implicated in numerous security breaches. For this reason, several experts have suggested dumping it in favor of safer, more efficient alternatives.
3. Not Fit for Small Data
While big data isn’t exclusively made for big businesses, not all big data platforms are suited for small data needs. Unfortunately, Hadoop happens
to be one of them. Due to its high capacity design, the Hadoop Distributed File System or HDFS, lacks the ability to efficiently support the random
reading of small files. As a result, it is not recommended for organizations with small quantities of data.
4. Potential Stability Issues
Hadoop is an open source platform. That essentially means it is created by the contributions of the many developers who continue to work on the
project. While improvements are constantly being made,
like all open source software, Hadoop has had its fair share of stability issues. To avoid these issues, organizations are strongly recommended to
make sure they are running the latest stable version, or run it under a third-party vendor equipped to handle such problems.
5. General Limitations
One of the most interesting highlights of the Google article referenced earlier mentions that when it comes to making the most of big data,
Hadoop may not be the only answer. The article introduces Apache Flume, MillWheel, and Google’s own Cloud Dataflow as possible solutions.
What each of these platforms have in common is the ability to improve the efficiency and reliability of data collection, aggregation, and
integration. The main point the article stresses is that companies could be missing out on big benefits by using Hadoop alone.
Apache Hive is a data warehouse infrastructure built on
top of Hadoop for providing data summarization, query,
and analysis.
Major components of the Hive architecture are:
•Metastore: Stores metadata for each of the tables such as their schema and location. It also includes the partition metadata
which helps the driver to track the progress of various data sets distributed over the cluster. The data is stored in a
traditional RDBMS format. The metadata helps the driver to keep a track of the data and it is highly crucial. Hence, a backup
server regularly replicates the data which can be retrieved in case of data loss.
•Driver: Acts like a controller which receives the HiveQL statements. It starts the execution of statement by creating sessions and
monitors the life cycle and progress of the execution. It stores the necessary metadata generated during the execution of an
HiveQL statement. The driver also acts as a collection point of data or query result obtained after the Reduce operation.
•Compiler: Performs compilation of the HiveQL query, which converts the query to an execution plan. This plan contains the
tasks and steps needed to be performed by the Hadoop MapReduce to get the output as translated by the query. The compiler
converts the query to an Abstract syntax tree (AST). After checking for compatibility and compile time errors, it converts
the AST to a directed acyclic graph (DAG).DAG divides operators to MapReduce stages and tasks based on the input query and
data.
•Optimizer: Performs various transformations on the execution plan to get an optimized DAG. Various transformations can be
aggregated together, such as converting a pipeline of joins by a single join, for better performance. It can also split the tasks, such
as applying a transformation on data before a reduce operation, to provide better performance and scalability. However, the logic
of transformation used for optimization used can be modified or pipelined using another optimizer.
•Executor: After compilation and Optimization, the Executor executes the tasks according to the DAG. It interacts with the job
tracker of Hadoop to schedule tasks to be run. It takes care of pipelining the tasks by making sure that a task with dependency
gets executed only if all other prerequisites are run.
•CLI, UI, and Thrift Server: Command Line Interface and UI (User Interface) allow an external user to interact with Hive by
submitting queries, instructions and monitoring the process status. Thrift server allows external clients to interact with Hive just
like how JDBC/ODBC servers do.
MapR
• MapR is a commercial distribution of Hadoop
aimed at enterprises.
• It includes its own file systems that are a
replacement for HDFS, along with other tweaks
to the framework, like distributed name nodes for
improved reliability.
• The new file system aims to offer increased
performance, as well as easier backups and
compatibility with NFS to make it simpler to
transfer data in and out.
• The programming model is still the standard
Hadoop one; the focus is on improving the
infrastructure surrounding the core framework to
make it more appealing to corporate customers
Sharding
• Any database that’s spread across multiple machines needs some scheme to
decide which machines a given piece of data should be stored on.
• A sharding system makes this decision for each row in a table, using its key. In
the simplest case, the application programmer will specify an explicit rule to use
for sharding.
• The biggest problems with sharding are splitting the data evenly across machines
and dealing with changes in the size of the cluster.
• To ease the pain of these problems, more complex schemes are used to split up
the data.
• Another popular approach is the use of consistent hashing for the sharding. This
technique uses a small table splitting the possible range of hash values into
ranges, with one assigned to each shard.
NoSQL Databases
NoSQL Databases
• NoSQL seems too combative, think of it as NotOnlySQL. These are all tools designed to trade
the reliability and ease-of-use of traditional databases for the flexibility and performance required
by new problems developers are encountering.
Should I be using NoSQL Databases?
NoSQL Data storage systems makes sense for applications that
need to deal with very very large semi-structured data
Log Analysis
• Like Voldemort, Riak was inspired by Amazon’s Dynamo database, and it offers a key/ value
interface and is designed to run on large distributed clusters.
• It also uses consistent hashing and a gossip protocol to avoid the need for the kind of
centralized index server that BigTable requires, along with versioning to handle update conflicts.
• Querying is handled using MapReduce functions written in either Erlang or JavaScript.
• It’s open source under an Apache license, but there’s also a closed source commercial version with
some special features designed for enterprise customers.
• The ZooKeeper framework was originally built at Yahoo! to make it easy for the
company’s applications to access configuration information in a robust and
easy-to-understand way, but it has since grown to offer a lot of features that help
coordinate work across distributed clusters.
• One way to think of it is as a very specialized key/value store, with an interface
that looks a lot like a filesystem and supports operations like watching callbacks,
write consensus, and transaction IDs that are often needed for coordinating
distributed algorithms.
• This has allowed it to act as a foundation layer for services like LinkedIn’s
Norbert, a flexible framework for managing clusters of machine.
S3
• Amazon’s S3 service lets you store large chunks of data on an online service, with an interface that makes
it easy to retrieve the data over the standard web protocol, HTTP.
• (example, you could store the data for a .png image into the system using the API provided by Amazon.
You’d first have to create a bucket, which is a bit like a global top-level directory that’s owned by one user,
and which must have a unique name. You’d then supply the bucket name, a file name (which can contain
slashes, and so may appear like a file in a subdirectory), the data itself, and any metadata to create the
object. web browser at an address like http://yourbucket.s3.amazonaws.com/your/ file/name.png.)
• It’s cheap
• well-documented
• reliable
• fast
• copes with large amounts of traffic
• It is very easy to access from almost any environment
InfoSphere Guardium Amazon S3 architecture on
Amazon Cloud
Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS) is
designed to support applications like MapReduce jobs
that read and write large amounts of data in batches,
rather than more randomly accessing lots of small
files.
• unlike S3, it does support renaming and moving files,
easier to handle coherency problems
• The client software stores up written data in a
temporary local file, until there’s enough to fill a
complete HDFS block.(All files are stored in these
blocks, with a default size of 64 MB)
• To simplify the architecture, HDFS uses a single name
node to keep track of which files are stored where. This
does mean there’s a single point of failure and
potential performance bottleneck.
Visualizations
• One of the best ways to communicate the meaning of data is by extracting the
important parts and presenting them graphically.
• This is helpful both for internal use, as an exploration technique to spot patterns
that aren’t obvious from the raw values, and as a way to succinctly present end
users with understandable results
Gephi
• Gephi is an open source Java application that creates network visualizations from
raw edge and node graph data. It’s very useful for understanding social network
information; one of the project’s founders was hired by LinkedIn, and Gephi is now
used for LinkedIn visualizations. There are several different layout algorithms,
each with multiple parameters you can tweak to arrange the positions of the
nodes in your data. If there are any manual changes you want to make, to either
the input data or the positioning, you can do that through the data laboratory, and
once you’ve got your basic graph laid out, the preview tab lets you customize the
exact appearance of the rendered result. Though Gephi is best known for its
window interface, you can also script a lot of its functions from automated
backend tools, using its toolkit library.
GraphViz
• GraphViz is a command-line network graph visualization tool. It’s mostly used for
general purpose flowchart and tree diagrams rather than the less structured
graphs that Gephi’s known for. It also produces comparatively ugly results by
default, though there are options to pretty-up the fonts, line rendering, and drop
shadows. Despite those cosmetic drawbacks, GraphViz is still a very powerful tool
for producing diagrams from data. Its DOT file specification has been adopted as
an interchange format by a lot of programs, making it easy to plug into many
tools, and it has sophisticated algorithms for laying out even massive numbers of
nodes.
Processing
• Initially best known as a graphics programming language that was accessible to
designers, Processing has become a popular general-purpose tool for creating
interactive web visualizations. It has accumulated a rich ecosystem of libraries,
examples, and documentation, so you may well be able to find an existing
template for the kind of information display you need for your data
Protovis
• Protovis is a JavaScript framework packed full of ready-to-use visualization
components like bar and line graphs, force-directed layouts of networks, and
other common building blocks. It’s great as a high-level interface to a toolkit of
existing visualization templates, but compared to Processing, it’s not as easy to
build completely new components. Its developers have recently announced that
Protovis will no longer be under active development, as they focus their efforts on
the D3 library, which offers similar functionality but in a style heavily influenced by
the new generation of JavaScript frameworks like jQuery.
Fusion Tables
• Google has created an integrated online system that lets you store large amounts
of data in spreadsheet-like tables and gives you tools to process and visualize the
information. It’s particularly good at turning geographic data into compelling
maps, with the ability to upload your own custom KML outlines for areas like
political constituencies. There is also a full set of traditional graphing tools, as well
as a wide variety of options to perform calculations on your data. Fusion Tables is a
powerful system, but it’s definitely aimed at fairly technical users; the sheer
variety of controls can be intimidating at first. If you’re looking for a flexible tool to
make sense of large amounts of data, it’s worth making the effort.
Tableau
• Originally a traditional desktop application for drawing graphs and visualizations,
Tableau has been adding a lot of support for online publishing and content
creation. Its embedded graphs have become very popular with news organizations
on the Web, illustrating a lot of stories. The support for geographic data isn’t as
extensive as Fusion’s, but Tableau is capable of creating some map styles that
Google’s product can’t produce. If you want the power user features of a desktop
interface or are focused on creating graphics for professional publication, Tableau
is a good choice.
Basic Interaction Techniques
Interaction techniques are used in information visualization to overcome various
limitations in screen real estate and bounded coginition.
• Selecting
• Mouse click
• Mouseover / hover / tooltip
• Lasso / drag
• Rearrange
• Move
• Sort
• Delete
Selecting
Advanced Interaction Techniques
• No distortion
• Shows both overview and
details simultaneously
• Drawback: requires the viewer
to consciously shift there focus
of attention.
Focus + Context
• A single view shows information in context
• Contextual info is near to focal poin
• Distortion may make some parts hard to interpret
• Distortion may obscure structure in data
• We’ll have a lecture on distortion later
• Standard Zooming
• Get close in to see information in more detail
• Example: Google earth zooming in
• Intelligent Zooming
• Show semantically relevant information out of proportion
• Smart speed up and slow down
• Example: speed-dependent zooming, Igarishi & Hinkley
• Semantic Zooming
• Zooming can be conceptual as opposed to simply reducing pixels
• Example tool: Pad++ and Piccolo projects
Standard vs. Semantic Zooming
• Geometric (standard) zooming:
• The view depends on the physical properties of what is being viewed
• Semantic Zooming:
• When zooming away, instead of seeing a scaled-down version of an object, see a
different representation
• The representation shown depends on the meaning to be imparted.