Sie sind auf Seite 1von 10

Literature Review On Big Data

Tanvi Ahlawat and Dr. Radha Krishna Rambola


School Of Computing Science and Engineering
Galgotias University, Greater Noida

Abstract:

Big Data is a concept that is leading the world right now and taking it by storm. We have tried to discuss
on the fundamentals of Big Data and tools and techniques associated with it. We also have tried to
categorize the Big Data elements into a model and tried to derive Big Data Ecosystem from it. The V
Model for the Big Data has been defined and categorized into 3V, 4V or 5V dependent on the
organization which uses it and under which business scenario. Catering to the aforementioned models, we
have classified data into various forms and explanations have been provided on the same to gain a better
insight and understanding on the same.

Keywords: Big Data, Eco System, Hadoop, Map Reduce.

Introduction:

In the current scenario, Web and its associated entity, Internet, a shadow has been cast on the same with
the data explosion that has taken place in the last couple of years considering the interaction that has been
taking place between people and systems associated at multiple touch points. This huge entity which is
taking place at every touch point as mentioned above in its wholesome behavior is known as Big Data.
Some decades earlier, Kilobytes and Megabytes used to be entities, which used to combine the entire
definition of data existing on the planet, and due to continuous interactions between people and systems
that have been taking place which has lead to exponential growth of data due to which new terms such as
Gigabytes, Terabytes, Petabytes, Exabytes & Zettabytes have graced the steps of the computing world.
Theorists and Researchers have propagated this that as Moores law was to growth of transistors inside
the circuits, Data in Internet would exceed the entire brain capacity of the living species. Technological
advances have been taking place continuously across all the domains and the major reasons for it are
advances in digital sensors, computation, communications and storage that have created humongous
collection of data.

As explained above, data is generated through various sources which will be used by multiple
organizations to run and understand the various business scenarios which help them understand and run
their business. All the above data when analyzed through various sources and methods of data analysis
help organizations in studying customer behavior, interpreting market trends and taking strategic and
financial decisions.

When we define the term, as we have done above, we often forget to define that the same consists of the
Big Datasets, which cannot be managed efficiently by common database management systems often
denoted by Relational Database Management Systems (RDBMS) and these datasets often range from
Exabytes to Petabytes to Zettabytes.

Tanvi, Dr. Radha Krishna Page 21


The huge amount of data that we have been speaking about, that is created from multiple interactions,
across Cellular phones, credit cards, social networking platforms and RFID (Radio Frequency
Identification) devices and it is not necessary that all this data may be used and hence most of this data
resides at unknown servers, in unstructured and unutilized form for many years.
However, in the current scenario that we have been speaking of and with the evolution of the Big Data the
same data can be accessed and analyzed to generate useful insights.

According to the Big Corporations, that have been the at the platform of Information Technology, we
create quintillions of data in a single day alone, or as explained earlier, 90% of the data that is existent in
the world today, has been created in the last two years. There is no single source of the data, as it comes
from multiple sources and to name a few, we can easily target it to Cell phone and associated GPS
signals, digital pictures and videos, transactional records of purchase and selling, Social media activities,
sensors used to gather climate information. In the current scenario of Data existence, it is everywhere and
anywhere and in every possible format i.e. numbers, images, videos and text. Data, for which we have
been speaking about earlier, in all its beauty, has had an exponential pace of growth, but this humongous
collection of data has numerous critical issues associated with it and challenges which can often be put in
nomenclature as transfer speed, diverse data, security issues and rapid data growth.

What Is Big Data:

When we speak about Big Data, as we have done above, we often identify it as a jargon, catch phrase
which means a exponential volume of unstructured and structured data that contains so many huge
datasets which cannot be processed by traditional database management techniques and associated
software techniques. With the size of the big data and simply the capacity of the data that it encompasses,
it carries in itself the potential that will help companies, in making far better, intelligent and data driven
decisions and help in improving operations. For most of the organizational scenarios, it can be easily
identified either the data is in excess of the current storage and processing capacity, or the volume of the
data is too big or it moves too fast. To give insights using the same data, that we have spoken about
earlier, it has to help us in giving insights which would help us in gain competitive advantage, increasing
revenues and customer retention and for that we need to capture the data, clean the data, format,
manipulate, store and analyze the same.

Big Data is a concept and a concept can have various interpretations, for which the same topic can have
multiple definitions:

Big Data is the amount of data beyond the ability of technology to store, manage and process efficiently
(Manyika et.al, 2011).

Big Data is a term which defines the hi-tech, high speed, high-volume, complex and multivariate data to
capture, store, distribute, manage and analyze the information (TechAmerica Foundation, 2014).

Big data is high volume, high velocity, and/or high variety information assets that require new forms of
processing to enable enhanced decision making, insight discovery and process optimization (Gartner,
2014; Grsakal, 2014).

Tanvi, Dr. Radha Krishna Page 22


Big Data Technologies are new generation technologies and architectures which were designed to extract
value from multivariate high volume data sets efficiently by providing high speed capturing, discovering
and analyzing (Gantz and Reinsel, 2011).

Hashem et.al. Define Big Data by combining various definitions in literature as follows: The cluster of
methods and technologies in which new forms are integrated to unfold hidden values in diverse, complex
and high volume data sets (Hashem et.al. 2015).

As per the definitions, Data should be complex and increasing in multiplicity inclusive of its size. Simply
considering the size of the data gives us enough oversight to understand that conventional methods would
not be suitable in analyzing big data sets and to compensate for the same, new methods and technologies
are needed. Aforementioned points should be taken into consideration, while going for the analysis of Big
Data.

Data Forms:

Structured:
When we talk about structured data, we often conclusively identify that, as soon as we placed our current
data ware house in the relational database management system, the structure of the relational database
management system was enforced on the current data ware house system, which is inclusive to
understand the meaning associated with it. So we know, which columns are placed where, whom are they
associated with and how the columns are associated in between tables and table spaces. The format of the
data can be in text or numerical, but it is common understanding that for every person there is a unique
identifier in terms of Age.

The entire data is organized in terms of Entities (Semantic Chunks).

Relations or Classes (Similar entities are grouped together).


Attributes (Same descriptions for entities existing in the Same groups)
Schema (All Entities in the group have a description associated with it.
o All are present & follow same order.
o All of them have same format defined and length defined.

Semi Structured:
As we move on from structured data to semi structured data, there is little to demarcate and often the
differentiating lines goes blurry. The data format that we are describing here does not conform to an
explicit and fixed schema, however the tags associated with the data, if found associated with
organizational structure, then the same data would be easier to analyze and organize. The same concept
described here would predate the idea of XML but not HTML.

Data is available in many formats, in the current scenario, electronically


o Database Systems
o File Systems e.g., Bibliographic data, Web data
o Data Exchange Formats, e.g. EDI, Scientific data
Data that is not completely structured, but partially as spoken earlier
o Grouping of Similar Entities and semantically organized.
o Entities may not have same attributes in the group

Tanvi, Dr. Radha Krishna Page 23


o Order of attributes not important & all attributes may not be required
o In a Group, size & type of same attributes may differ.

Unstructured:
We have already discussed about the Structured and Semi Structured formats. Moving on to the
unstructured format, this type would consists of formats that cannot be easily indexed.

When we talk about indexing, it is with reference of relational tables and for the purpose of querying or
analysis. This would include the file types that are associated with audio, video and image files.

Data Any type.


No Format; No Sequence.
No Rules in data.
Unpredictability is spread across the data.
Examples Audio, Images, Video

Importance of Big Data:

Importance would be defined in terms of how effective this concept has been for organizations, in
improving their most important KPIs, also not with the quantity of data the organization has, but with the
insights that it has helped to generate. Data is taken from multiple source and integrated across various
environments which when analyzed can help us give answers to following:
1. Time & Cost reductions.
2. Customized & Optimized Market Offerings & New Product Development.
3. Strategy Development & Smart Decision Making.

In a Business environment, there are a lot of decisions that are to be taken on the basis of Data &
associated analytics and in simple terms, we could define it as Big Data when combined with powered
analytics, lot of business related tasks can be accomplished such as:

1. Root Cause Analysis can be conducted in real time for associated defects, failures and issues.
2. POS based generated coupons based on Consumer Behavior.
3. Risk Portfolio Quick Calculations/Re-Calculations can be conducted in minutes.
4. Conducting Fraud Detection & use of Fraud analytics before hitting organization.

Big Data Characteristics:

As mentioned before, big data is a concept and the same can be defined through a model and in our case,
it can be defined through 3V model, whose definition was casted by Laney, high-volume, high-velocity
and high-variety information assets that demand cost-effective, innovative forms of information
processing for enhanced insight and decision making.

Recently in 2012, Gartner processed the definition of Big Data as Big data is high volume, high
velocity, and/or high variety information assets that require new forms of processing to enable enhanced
decision making, insight discovery and process optimization.

Tanvi, Dr. Radha Krishna Page 24


By definitions, both aforementioned incorporate three main features: Volume, Variety, and Velocity.
When the concept is spread across organizations and different business models & big data practitioners,
the 3V model can be extended to 4V (V for Value added) or 5V (V for Veracity) could be another factors
dependent on the organization, which model they want to adapt.

Getting a summary of the same, we can easily state that these models, provide a straightforward and all
accepted definitions related to what all is incorporated in a big data based application, solution, problem
and framework.

Volume: This would refer to the data from multiple sources, data being in huge capacity. It can include
all and any kind of data, including the data that is created from all the connected devices, IoT & mobile
data and all the data that is being resulted from this communication.

In the current scenario, it can be easily stated that, data that is being generated is being approached in
computer memory sizes, that were being heard of : exabytes, petabytes & zettabytes. It will be coined to
reach terms, that are still undefined and new names would have to be though for the same. However,
since the data is being generated at the capacity, that we are unable to comprehend, and organizations are
still trying to keep up with that pace.

Consequences, that are being resulted from these actions, it has become a normalcy for the companies, to
store enormous and varied amounts of data : financial, biochemistry, electronics, computer records,
genetic, social networks & healthcare.

The benefits that are being generated from incorporating this data, are companies at their disposal have a
lot of data, which is a challenges in itself, however, valuable information can be obtained from the same
regarding people & companies.

Velocity: When we talk about transferring a movie, than we really do not worry about velocity, for that
movie will be approximately 1 Gigabyte in size and would take a minute to complete. But when we talk
about the big data, we can easily state that, for the data that is of the size of exabytes & petabytes, the
same data, would take a lot of time to transfer and hence the velocity becomes a very important factor as
it affects performance also.

When we speak about the data, the contents are constantly changing, via introduction of previous or
legacy collections & absorption of complementary data collections and will involve data streamed
through multiple sources.Velocity not only involves, the speed at which the data is transferred, but will
also involve, data streams, creation of structured records, access to data & delivery. The issues does not
only lie with the velocity of incoming data but also to stream outgoing data for batch processing.

Variety: This refers to varied data types and the same can be accumulated from various sources, sources
being: social networks, Smartphone, sensors in the forms of videos, images, audio, logs etc. This data can
be highly structured (data fetched from the traditional database systems), semi-structured (feeds social,
rss, raw; web logs;) or unstructured (clicks, audio, images, videos).

Value: It refers to the critical & valuable information that is being extracted from big datasets that are
associated with the concept of Big Data and this concept in its entirety is called as Big Data Analytics.
When we speak about the 4V model, V that stands for Value becomes the most critical factor for any

Tanvi, Dr. Radha Krishna Page 25


application based on Big Data & this for the sole reason that it allows to generate useful business
information.

Till recent times, large volumes of data were recorded as part of regulations but never analyzed or
exploited. Considering that fact, Value is highly subjective in nature. Big Data as a concept brings with
itself the technologies, that enables people & organizations to help exploit the data, the way it was never
done before.

Veracity: This term would refer to the accuracy & correctness of the data on which the analysis is to be
conducted. A lot of uncertainties can be caused for the most simple of reasons such as : Data
inconsistency, Data Ambiguity, Data Duplications, incompleteness, deception, fraud, duplication,
Approximated models, spam & latency. It is not necessary, that the analysis on top of big data, would
give a perfect conclusive result. However, everything can be assigned a probability.

Big data - Pipeline Analysis:

Processing Pipeline Phases:

Data Acquisition and Recording: After discussing so much on Big Data, we can safely assume that, Big
Data does not simply appear out of thin air. There are multiple sources from which data is recorded. Since
we are accumulating so much data, it will be obvious, that part of the data will be of no use & speaking in
that way, filtering & compression techniques can be used to sort the matter out. When speaking out about
filtering & compressing, we have to take care to define them in such a way, that they should not leave out
important information.

The other challenge that we would be speaking about is the generation of right metadata to describe the
recording of the data and measurement of the same. Also, at the same point of time, we can easily state that,
recording the information about the data at the start/birth is not useful since in the pipeline it will keep on
changing and interpreted in different ways while being carried through the data analysis stages.

Tanvi, Dr. Radha Krishna Page 26


Information Extraction and Cleaning: Simply collecting the information would not help in generating
analysis or will not be in a state to generate analysis and insights. Data simply in this collected form, will
not be in a format which can be analyzed. To do this correctly, we need a process which will help us in
extracting information, that will help us in pulling out the required information and will be presented in a
structured analysis, which will help us in generating insights from the same.

Doing the aforementioned process again and again, and that too in a continual manner with correctness
being the top priority is a technical challenges that is continuous in nature.

Data cleaning is one of the primary focus areas, that assumes constraints that are well documented and well
recognized on valid data or error models that have a deep understanding with reference to the data. For
most of the areas, where Big Data emergence is new, these models do not exist.

Data Integration, Aggregation, and Representation: When we speak on the current topic, we would be
talking about Data analysis. Data Analysis is not as simple as the meaning it connotates, for it includes
challenges than simple location, identification, understanding & citing data. When we speak about the
analysis that is constituted on a large scale, it would be happening in a purely automated fashion.

In the current scenario, when we speak about Database Design, it is an art as opposed to science. When we
speak about the aforementioned top heroic as science, it has to be developed in a context, whether it being
in the Enterprise or Cloud context.

Domain Scientists has emerged as a new context, where in they are highly paid professionals along with
enabling other professions in the same domain as well, to develop effective database designs. These can be
achieved via various ways : design process will be assisted by devising tools, revamping the designing
process, development of new techniques and all of them will be used for effective creation of intelligible
database design.

Query Processing, Data Modeling, and Analysis: Big data is traditionally different from traditional
database management systems and hence the methods for querying & mining in Big Data are different from
traditional statistical techniques which will be different on Big & small samples.

When we talk about Big Data, we often associate it with being dynamic, inter-relationships, untrustworthy
& noisy as opposed to the process of mining, which will require clean, trustworthy, integrated, efficient,
accessible data which can be accessed via mining interfaces using declarative queries, scalable mining
algorithms and computing environments for big data.

When we are speaking of the aforementioned topics, we can easily state that the data mining itself can be
used to help with trust quotient of data as well improving quality of the same, understanding the associated
semantics of the data & provide insightful & intelligible querying functions. One of the major problem that
Big Data faces is that there is no coordination between the database packages which host the big datasets
that are part of the Big Data. Part of the Database systems host the data, part deal with SQL Querying and
other ones host the data mining and statistical analysis.

Interpretation: We should understand that analysis is of limited value, if user cannot understand the same
with reference to Big Data. Even if analysis is done all the reports and graphs are generated, one still has to
sit and interpret the same. The interpretation cannot happen, while sitting alone in a cubicle or vacuum as

Tanvi, Dr. Radha Krishna Page 27


the person analyzing the reports and graphs has to take care of the assumptions that were used while
generating the analysis and retracing the steps.

Big Data Ecosystem:

While speaking about the Big Data, we can sense that and would like to state that, it is a problem, not only
related to database or Hadoop, but would constitute technologies at its core and components for data
processing on large scale and data analytics.

The entire structure of components to analyze by storing, processing, visualizing and delivering results to
applications which were the target incorporating Big Data as fuel for all the processes which are data
related and associated source, target and outcome. All the associations between the components and the
intertwined relationships can be incorporated into the BDE or Big Data Ecosystem that will incorporate in
itself all the data, supporting infrastructure, models during entire Big Data Lifecycle.

Techniques and Technologies:

In the current paper, we will not be giving an in depth overview on the tools and techniques, however, we
will be giving an overview of the tools and techniques associated with Big Data. This will help the reader
get a association with the tools used for Big Data analytics.

Techniques:

There are a lot of techniques that could be used when going to start with a project. Some of the tools which
have frequent usage are summarized here.

Association rule learning: A set of techniques for discovering interesting relationships, i.e., association
rules, among variables in large databases.

Data mining: One of the most important terms related to data-driven decision making and describes it as
searching or digging into a data file for information to understand better a particular phenomenon.

Cluster analysis: Cluster analysis is a type of data mining that divides a large group into smaller groups of
similar objects whose characteristics of similarity are not known in advance.

Crowd sourcing: Crowd sourcing collects data from a large group of people through an open call, usually
via a Web2.0 tool. This tool is used more for collecting data than for analyzing it.

Machine learning: Traditionally computers only know what we tell them, but in machine learning, a
subspecialty of computer science, we try to craft algorithms that allow computers to evolve based on
empirical data.

Text analytics: A large portion of generated data is in text form. Text Analytics is the process of
converting unstructured text data into meaningful data.

Technology:

Tanvi, Dr. Radha Krishna Page 28


As with the analytical techniques, there are several software products and available technologies to
facilitate big data analytics. Some of the most common will be discussed here.

EDWs: Enterprise data warehouses are databases used in data analysis.

Visualization products: One of the difficulties with big data analytics is finding ways to visually represent
results. Many new visualization products aim to fill this need, devising methods for representing data points
numbering up into the millions. Beyond simple representation visualization can also help in the information
search.

MapReduce: MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set
of data and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those
data tuples into a smaller set of tuples.

Hadoop: It is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. Hadoop is an Apache-
managed software framework derived from MapReduce and Big Table.

NoSQL databases: NoSQL database, also called Not Only SQL, is an approach to data management and
database design that's useful for very large sets of distributed data. NoSQL is especially useful when an
enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely
on multiple virtual servers in the cloud. The most popular NoSQL database is Apache Cassandra. Other
NoSQL implementations include SimpleDB, Google BigTable, Apache Hadoop, MapReduce,
MemcacheDB, and Voldemort.

Usage Areas of Big Data:

Big data is used efficiently in numerous fields. Some of them are listed below:
1) Automotive industry
2) High technology and industry
3) Oil and gas
4) Telecommunication sector
5) Medical field
6) Retail industry
7) Packaged consumer products
8) Media and show business
9) Travel and transport sector
10) Financial services
11) Social media and online services

Tanvi, Dr. Radha Krishna Page 29


12) Public services
13) Education and research
14) Health services
15) Law enforcement and defense industry

Conclusion:

After writing this report/paper, we have developed better understanding of this concept called as Big Data
after we have been able to put words to it. We have been able to define models, ecosystems and
categorize elements on the basis of it. We also have been able to identify the tools and techniques that
have been associated with Big Data Analytics on a frequent basis. On the basis of the analysis conducted,
we identified the areas with maximum usage of Big Data Analytics.

Reference:

1) Sreedhar C, Dr. D. Kavitha, K. Asha Rani, Big Data and Hodoop, International Journal of
Advanced Research in Computer Engineering & Technology (IJARCET) Volume 3 Issue 5, May
2014
2) Puneet Singh Duggal, Sanchita Paul, Department of Computer Science & Engineering Birla
Institute of Technology Mesra, Ranchi, India, Big Data Analysis: Challenges and Solutions,
International Conference on Cloud, Big Data and Trust 2013, Nov 13-15, RGPV
3) Subramaniyaswamy Va, Vijayakumar Vb, Logesh Rc and Indragandhi Vd, Unstructured Data
Analysis on Big Data using Map Reduce, 2nd International Symposium on Big Data and Cloud
Computing (ISBCC15)
4) Ms. Vibhavari Chavan, Prof. Rajesh. N. Phursule, JSPMs Imperial College of Engineering and
Research, Pune, (IJCSIT) International Journal of Computer Science and Information
Technologies, Vol. 5 (6) , 2014, 7932-7939
5) Hakan zkse, Emin Sertac Ari, Cevriye Gencer
6) bGazi University, Maltepe, Ankara, Turkey , Yesterday, Today and Tomorrow of Big Data,
World Conference on Technology, Innovation and Entrepreneurship
7) Ishwarappa, Anuradha J, A Brief Introduction on Big Data 5Vs Characteristics and Hadoop
Technology, International Conference on Intelligent Computing, Communication &
Convergence(ICCC-2015),
8) Big Data basics from Oreilly by Edd Dumbill: http:/strata.oreilly.com/2012/01/what-is-big-
data.html
9) Big Data: Survey, Technologies, Opportunities, and Challenges
http://www.hindawi.com/journals/tswj/2014/712826/
10) Challenges and Opportunities with Big Data, A community white paper developed by leading
researchers across the United States.
11) Sameera Siddiqui, Deepa Gupta. Big Data Process Analytics: A Survey, International Journal
of Emerging Research in Management &Technology ISSN: 2278-9359 (Volume-3, Issue-7) July
2014.

Tanvi, Dr. Radha Krishna Page 30