Sie sind auf Seite 1von 14

DATA ANALYSIS

An Introduction to Big Data Concepts

PREPARED BY
ASST.PROF. SANTOSH KUMAR RATH
GOVERNMENT COLLEGE OF ENGINEERING KALAHANDI,
BHAWANIPATNA

Big Data Analysis


Big data is a term for data sets that are so large or complex that traditional data processing
applications are inadequate. Challenges include analysis, capture, data curation, search, sharing,
storage, transfer, visualization, querying and information privacy.
Big Data is becoming one of the most talked about technology trends nowadays. The real challenge
with the big organization is to get maximum out of the data already available and predict what kind
of data to collect in the future. How to take the existing data and make it meaningful that it provides
us accurate insight in the past data is one of the key discussion points in many of the executive
meetings in organizations. With the explosion of the data the challenge has gone to the next level
and now a Big Data is becoming the reality in many organizations.
3 Vs of Big Data Volume, Velocity and Variety
As organizations have grown the data associated with them also grew exponentially and today
there are lots of complexities to their data. Most of the big organizations have data in multiple
applications and in different formats. The data is also spread out so much that it is hard to
categorize with a single algorithm or logic. The mobile revolution which we are experimenting right
now has completely changed how we capture the data and build intelligent systems. Big
organizations are indeed facing challenges to keep all the data on a platform which give them
a single consistent view of their data. This unique challenge to make sense of all the data coming in
from different sources and deriving the useful actionable information out of is the revolution Big
Data world is facing.
Big data can be described by the following characteristics
Volume: The quantity of generated and stored data. The size of the data determines the value and
potential insight- and whether it can actually be considered big data or not.
Variety: The type and nature of the data. This helps people who analyze it to effectively use the
resulting insight.
Velocity: In this context, the speed at which the data is generated and processed to meet the
demands and challenges that lie in the path of growth and development.

Page | 2

3 Vs of Big Data

Volume
We currently see the exponential growth in the data storage as the data is now more than text data.
We can find data in the format of videos, musics and large images on our social media channels. It is
very common to have Terabytes and Petabytes of the storage system for enterprises. As the
database grows the applications and architecture built to support the data needs to be reevaluated
quite often. Sometimes the same data is re-evaluated with multiple angles and even though the
original data is the same the new found intelligence creates explosion of the data. The big volume
indeed represents Big Data.
Velocity
The data growth and social media explosion have changed how we look at the data. There was a
time when we used to believe that data of yesterday is recent. The matter of the fact newspapers is
still following that logic. However, news channels and radios have changed how fast we receive the
news. Today, people reply on social media to update them with the latest happening. On social
media sometimes a few seconds old messages (a tweet, status updates etc.) is not something
interests users. They often discard old messages and pay attention to recent updates. The data
movement is now almost real time and the update window has reduced to fractions of the seconds.
This high velocity data represent Big Data.
Variety
Data can be stored in multiple format. For example database, excel, csv, access or for the matter of
the fact, it can be stored in a simple text file. Sometimes the data is not even in the traditional
format as we assume, it may be in the form of video, SMS, pdf or something we might have not
thought about it. It is the need of the organization to arrange it and make it meaningful. It will be
easy to do so if we have data in the same format, however it is not the case most of the time. The
real world have data in many different formats and that is the challenge we need to overcome with
the Big Data. This variety of the data represent represent Big Data.
Page | 3

Evolution of Big Data


In earlier days data was stored in the flat file and there was no structure in the flat file. If any data
has to be retrieved from the flat file it was a project by itself. There was no possibility of retrieving
the data efficiently and data integrity has been just a term discussed without any modeling or
structure around. Database residing in the flat file had more issues than we would like to discuss in
todays world. It was more like a nightmare when there was any data processing involved in the
application. Though, applications developed at that time were also not that advanced the need of
the data was always there and there was always need of proper data management.
Relational Database Management Systems
Edgar Frank Codd was a British computer scientist who, while working for IBM, invented the
relational model for database management, the theoretical basis for relational databases. He
presented 12 rules for the Relational Database and suddenly the chaotic world of the database
seems to see discipline in the rules. Relational Database was a promising land for all the
unstructured database users. Relational Database brought into the relationship between data as
well improved the performance of the data retrieval.
Data Warehousing
The enormous data growth now presented a big challenge for the organizations who wanted to
build intelligent systems based on the data and provide near real time superior user experience to
their customers. Various organizations immediately start building data warehousing solutions
where the data was stored and processed. The trend of the business intelligence becomes the need
of everyday. Data was received from the transaction system and overnight was processed to build
intelligent reports from it. Though this is a great solution it has its own set of challenges. The
relational database model and data warehousing concepts are all built with keeping traditional
relational database modeling in the mind and it still has many challenges when unstructured data
was present.
Interesting Challenge
Every organization had expertise to manage structured data but the world had already changed to
unstructured data. There was intelligence in the videos, photos, SMS, text, social media messages
and various other data sources. All of these needed to now bring to a single platform and build a
uniform system which does what businesses need. The way we do business has also been changed.
There was a time when user only got the features what technology supported, however, now users
ask for the feature and technology is built to support the same. The need of the real time
intelligence from the fast paced data flow is now becoming a necessity.
Large amount (Volume) of difference (Variety) of high speed data (Velocity) is the properties of the
data. The traditional database system has limits to resolve the challenges this new kind of the data
presents. Hence the need of the Big Data Science. We need innovation in how we handle and
manage data. We need creative ways to capture data and present to users.
Page | 4

Basics of Big Data Architecture


Big Data Cycle
Just like every other database related applications, bit data project have its development cycle.
Though three Vs (link) for sure plays an important role in deciding the architecture of the Big
Data projects. Just like every other project Big Data project also goes to similar phases of the data
capturing, transforming, integrating, analyzing and building actionable reporting on the top of the
data.
While the process looks almost same but due to the nature of the data the architecture is often
totally different. Here are few of the questions which everyone should ask before going ahead
with Big Data architecture.
How big is your total database?
What is your requirement of the reporting in terms of time real time, semi real time or at
frequent interval?
How important is the data availability and what is the plan for disaster recovery?
What are the plans for network and physical security of the data?
What platform will be the driving force behind data and what are different service level
agreements for the infrastructure?
Building Blocks of Big Data Architecture

Above image gives good overview of how in Big Data Architecture various components are
associated with each other. In Big Data various different data sources are part of the architecture
hence extract, transform and integration are one of the most essential layers of the architecture.
Page | 5

Most of the data is stored in relational as well as non relational data marts and data warehousing
solutions. As per the business need various data are processed as well converted to proper reports
and visualizations for end users. Just like software the hardware is almost the most important part
of the Big Data Architecture. In the big data architecture hardware infrastructure is extremely
important and failure over instances as well as redundant physical infrastructure is usually
implemented.
NoSQL in Data Management
NoSQL is a very famous buzz word and it really means Not Relational SQL or Not Only SQL. This is
because in Big Data Architecture the data is in any format. It can be unstructured, relational or in
any other format or from any other data source. To bring all the data together relational technology
is not enough, hence new tools, architecture and other algorithms are invented which takes care of
all the kind of data. This is collectively called NoSQL.
NoSQL stands for Not Relational SQL or Not Only SQL. Lots of people think that NoSQL means
there is No SQL, which is not true they both sound same but the meaning is totally
different. NoSQL does use SQL but it uses more than SQL to achieve its goal. As per
Wikipedias NoSQL Database Definition A NoSQL database provides a mechanism for storage and
retrieval of data that uses looser consistency models than traditional relational databases.
Why use NoSQL?
A traditional relation database usually deals with predictable structured data. Whereas as the world
has moved forward with unstructured data we often see the limitations of the traditional relational
database in dealing with them. For example, nowadays we have data in format of SMS, wave files,
photos and video format. It is a bit difficult to manage them by using a traditional relational
database. I often see people using BLOB filed to store such a data. BLOB can store the data but when
we have to retrieve them or even process them the same BLOB is extremely slow in processing the
unstructured data. A NoSQL database is the type of database that can handle unstructured,
unorganized and unpredictable data that our business needs it.
Along with the support to unstructured data, the other advantage of NoSQL Database is high
performance and high availability.

Page | 6

Hadoop
What is Hadoop?
Apache Hadoop is an open-source, free and Java based software framework offers a powerful
distributed platform to store and manage Big Data. It is licensed under an Apache V2 license. It runs
applications on large clusters of commodity hardware and it processes thousands of terabytes of
data on thousands of the nodes. Hadoop is inspired from Googles MapReduce and Google File
System (GFS) papers. The major advantage of Hadoop framework is that it provides reliability and
high availability.
What are the core components of Hadoop?
There are two major components of the Hadoop framework and both fo them does two of the
important task for it.

Hadoop MapReduce is the method to split a larger data problem into smaller chunk and
distribute it to many different commodity servers. Each server have their own set of
resources and they have processed them locally. Once the commodity server has processed
the data they send it back collectively to main server. This is effectively a process where we
process large data effectively and efficiently. (We will understand this in tomorrows blog
post).
Hadoop Distributed File System (HDFS) is a virtual file system. There is a big difference
between any other file system and Hadoop. When we move a file on HDFS, it is
automatically split into many small pieces. These small chunks of the file are replicated and
stored on other servers (usually 3) for the fault tolerance or high availability. (We will
understand this in the day after tomorrows blog post).

Besides above two core components Hadoop project also contains following modules as well.

Hadoop Common: Common utilities for the other Hadoop modules


Hadoop Yarn: A framework for job scheduling and cluster resource management

A Multi-node Hadoop Cluster Architecture

Page | 7

A small Hadoop cluster includes a single master node and multiple worker or slave node. As
discussed earlier, the entire cluster contains two layers. One of the layer of MapReduce Layer and
another is of HDFS Layer. Each of these layer have its own relevant component. The master node
consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node consists of
a DataNode and TaskTracker. It is also possible that slave node or worker node is only data or
compute node. The matter of the fact that is the key feature of the Hadoop.
Why Use Hadoop?
There are many advantages of using Hadoop. Let me quickly list them over here:

Robust and Scalable We can add new nodes as needed as well modify them.
Affordable and Cost Effective We do not need any special hardware for running Hadoop.
We can just use commodity server.
Adaptive and Flexible Hadoop is built keeping in mind that it will handle structured and
unstructured data.
Highly Available and Fault Tolerant When a node fails, the Hadoop framework
automatically fails over to another node.

Page | 8

MapReduce
What is MapReduce?
MapReduce was designed by Google as a programming model for processing large data sets with a
parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google proprietary
technology, it has been quite a generalized term in the recent time.
MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering
and sorting operation on data where as procedure Reduce() performs a summary operation of the
data. This model is based on modified concepts of the map and reduce functions commonly
available in functional programming. The library where the procedure Map () and Reduce ()
belongs is written in many different languages. The most popular free implementation of
MapReduce is Apache Hadoop, which we will explore tomorrow.

Advantages of MapReduce Procedures


The MapReduce Framework usually contains distributed servers and it runs various tasks in
parallel to each other. There are various components which manages the communications between
various nodes of the data and provides the high availability and fault tolerance. Programs written in
MapReduce functional styles are automatically parallelized and executed on commodity machines.
The MapReduce Framework takes care of the details of partitioning the data and executing the
processes on distributed server on run time. During this process if there is any disaster the
framework provides high availability and other available modes take care of the responsibility of
the failed node. s you can clearly see more this entire MapReduce Frameworks provides much more
than just Map() and Reduce() procedures; it provides scalability and fault tolerance as well.
A typical implementation of the MapReduce Framework processes many petabytes of data and
thousands of the processing machines.

Page | 9

How do MapReduce Framework Works?


A typical MapReduce Framework contains petabytes of the data and thousands of the nodes. Here is
the basic explanation of the MapReduce Procedures which uses this massive commodity of the
servers.
Map() Procedure
There is always a master node in this infrastructure which takes an input. Right after taking input
master node divides it into smaller sub-inputs or sub-problems. These sub-problems are
distributed to worker nodes. A worker node later processes them and does necessary analysis.
Once the worker node completes the process with this sub-problem it returns it back to master
node.
Reduce() Procedure
All the worker nodes return the answer to the sub-problem assigned to them to master node. The
master node collects the answer and once again aggregate that in the form of the answer to the
original big problem which was assigned master node.
This Framework does the above Map () and Reduce () procedure in the parallel and independent to
each other. All the Map() procedures can run parallel to each other and once each worker node had
completed their task they can send it back to master code to compile it with a single answer. This
particular procedure can be very effective when it is implemented on a very large amount of data
(Big Data).
This Framework has five different steps:

Preparing Map() Input


Executing User Provided Map() Code
Shuffle Map Output to Reduce Processor
Executing User Provided Reduce Code
Producing the Final Output

Here is the Dataflow of MapReduce Framework:

Input Reader
Map Function
Partition Function
Compare Function
Reduce Function
Output Writer

Page | 10

HDFS (Hadoop Distributed File System)


What is HDFS ?
HDFS stands for Hadoop Distributed File System and it is a primary storage system used by
Hadoop. It provides high performance access to data across Hadoop clusters. It is usually deployed
on low-cost commodity hardware. In commodity hardware deployment server failures are very
common. Due to the same reason HDFS is built to have high fault tolerance. The data transfer
rate between compute nodes in HDFS is very high, which leads to reduced risk of failure.
HDFS creates smaller pieces of the big data and distributes it on different nodes. It also copies each
smaller piece to multiple times on different nodes. Hence when any node with the data crashes the
system is automatically able to use the data from a different node and continue the process. This is
the key feature of the HDFS system.
Architecture of HDFS
The architecture of the HDFS is master/slave architecture. An HDFS cluster always consists of
single NameNode. This single NameNode is a master server and it manages the file system as well
regulates access to various files. In additional to NameNode there are multiple DataNodes. There is
always one DataNode for each data server. In HDFS a big file is split into one or more blocks and
those blocks are stored in a set of DataNodes.
The primary task of the NameNode is to open, close or rename files and directory and regulate
access to the file system, whereas the primary task of the DataNode is read and write to the file
systems. DataNode is also responsible for the creation, deletion or replication of the data based on
the instruction from NameNode.
In reality, NameNode and DataNode are software designed to run on commodity machine build in
Java language.

Page | 11

Let us understand how HDFS works with the help of the diagram. Client APP or HDFS Client
connects to NameSpace as well as DataNode. Client App access to the Data Node is regulated by
NameSpace Node. NameSpace Node allows Client App to connect to the Data Node based by
allowing the connection to the DataNode directly. A big data file is divided into multiple data blocks
(let us assume that those data chunks are A,B,C and D. Client App will later on write data blocks
directly to the DataNode. Client App does not have to directly write to all the node. It just has to
write to any one of the node and Name Node will decide on which other Data Node it will have to
replicate the data. In our example Client App directly writes to Data Node 1 and detained 3.
However, data chunks are automatically replicated to other nodes. All the information like in which
DataNode which data block is placed is written back to NameNode.
High Availability During Disaster
Now as multiple DataNode have same data blocks in the case of any DataNode which faces the
disaster, the entire process will continue as other DataNode will assume the role to serve the
specific data block which was on the failed node. This system provides very high tolerance to
disaster and provides high availability.
If you notice there is only single NameNode in our architecture. If that node fails our entire Hadoop
Application will stop performing as it is a single node where we store all the metadata. As this node
is very critical, it is usually replicated on another clustered as well as on another data rack. Though,
that replicated node is not operational in architecture, it has all the necessary data to perform the
task of the NameNode in the case of the NameNode fails.
The entire Hadoop architecture is built to function smoothly even there are node failures or
hardware malfunction. It is built on the simple concept that data is so big it is impossible to have
come up with a single piece of the hardware which can manage it properly. We need lots of
commodity (cheap) hardware to manage our big data and hardware failure is part of the
commodity servers. To reduce the impact of hardware failure Hadoop architecture is built to
overcome the limitation of the non-functioning hardware.

Page | 12

Importance of Relational Database in Big Data World


NoSQL Movement
The reason for the NoSQL Movement in recent time was because of the two important advantages
of the NoSQL databases.
1. Performance
2. Flexible Schema
Situations in Relational Database Outperforms
Adhoc reporting is the one of the most common scenarios where NoSQL is does not have optimal
solution. For example reporting queries often needs to aggregate based on the columns which are
not indexed as well are built while the report is running, in this kind of scenario NoSQL databases
(document database stores, distributed key value stores) database often does not perform well. In
the case of the ad-hoc reporting I have often found it is much easier to work with relational
databases.
SQL is the most popular computer language of all the time. I have been using it for almost over 10
years and I feel that I will be using it for a long time in future. There are plenty of the tools,
connectors and awareness of the SQL language in the industry. Pretty much every programming
language has a written drivers for the SQL language and most of the developers have learned this
language during their school/college time. In many cases, writing query based on SQL is much
easier than writing queries in NoSQL supported languages. I believe this is the current situation but
in the future this situation can reverse when No SQL query languages are equally popular.
ACID (Atomicity Consistency Isolation Durability) Not all the NoSQL solutions offers ACID
compliant language. There are always situations (for example banking transactions,
eCommerce shopping carts etc.) where if there is no ACID the operations can be invalid as well
database integrity can be at risk. Even though the data volume indeed qualify as a Big Data there are
always operations in the application which absolutely needs ACID compliance matured language.

Page | 13

NewSQL
NewSQL stands for new scalable and high performance SQL Database vendors. The products sold
by NewSQL vendors are horizontally scalable. NewSQL is not kind of databases but it is about
vendors who supports emerging data products with relational database properties (like ACID,
Transaction etc.) along with high performance. Products from NewSQL vendors usually follow in
memory data for speedy access as well are available immediate scalability.
NewSQL is our shorthand for the various new scalable/high performance SQL database vendors.
We have previously referred to these products as ScalableSQL to differentiate them from the
incumbent relational database products. Since this implies horizontal scalability, which is not
necessarily a feature of all the products, we adopted the term NewSQL in the new report. And to
clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about
the NewSQL vendors is the vendor, not the SQL.
In other words NewSQL incorporates the concepts and principles of Structured Query Language
(SQL) and NoSQL languages. It combines reliability of SQL with the speed and performance of
NoSQL.
Categories of NewSQL
There are three major categories of the NewSQL
New Architecture In this framework each node owns a subset of the data and queries are split into
smaller query to sent to nodes to process the data. E.g. NuoDB, Clustrix, VoltDB
MySQL Engines Highly Optimized storage engine for SQL with the interface of MySQ Lare the
example of such category. E.g. InnoDB, Akiban
Transparent Sharding This system automatically split database across multiple nodes.

Page | 14

Das könnte Ihnen auch gefallen