Sie sind auf Seite 1von 8

BIG DATA

Dinesh Kumar Tiwari Renuka Pandey Uma Shree


Assistant Professor CSE , FGIET CSE , FGIET
CSE, FGIET Raebareli, India Raebareli, India
Raebareli, India renukapandey96@gmail.com vidushi774@gmail.com
dkt10dec@gmail.com

Abstract – In this paper we will be introducing BIG DATA , is a term for data
sets that are so large or complex that traditional data processing application
softwares are inadequate to deal with them. Challenges
include capture, storage,analysis, data curation,
search, sharing, transfer, visualization, querying, updating and information
privacy.
Analysis of data sets can find new correlations to "spot business trends,
prevent diseases, combat crime and so on”.Scientists, business executives,
practitioners of medicine, advertising and governments alike regularly meet
difficulties with large data-sets in areas including Internet search,
finance, urban informatics, and business informatics. Scientists encounter
limitations in e-Science work, including meteorology, genomics,
connectomics, complex physics simulations, biology and environmental
research.
Relational Database Management System and desktop statistics- and
visualization-packages often have difficulty handling big data. The work may
require "massively parallel software running on tens, hundreds, or even
thousands of servers".What counts as "big data" varies depending on the
capabilities of the users and their tools, and expanding capabilities make big
data a moving target. "For some organizations, facing hundreds of gigabytes
of data for the first time may trigger a need to reconsider data management
options. For others, it may take tens or hundreds of terabytes before data size
becomes a significant consideration."

1. Introduction :
Big data is an evolving term that describes any voluminous amount of structured,
semistructured and unstructured data that has the potential to be mined for
information. The term has been in use since the 1990s, with some giving credit
to John Mashey for coining or at least making it popular. Big data usually includes
data sets with sizes beyond the ability of commonly used software tools
to capture, curate, manage, and process data within a tolerable elapsed time. Big
data "size" is a constantly moving target, as of 2012 ranging from a few dozen
terabytes to many petabytes of data. Big data requires a set of techniques and
technologies with new forms of integration to reveal insights from datasets that are
diverse, complex, and of a massive scale.
In a 2001 research report and related lectures, META Group (now Gartner) defined
data growth challenges and opportunities as being three-dimensional, i.e. increasing
volume (amount of data), velocity (speed of data in and out), and variety (range of
data types and sources). Gartner, and now much of the industry, continue to use this
"3Vs" model for describing big data. In 2012, Gartner updated its definition as
follows: "Big data is high volume, high velocity, and/or high variety information assets
that require new forms of processing to enable enhanced decision making, insight
discovery and process optimization." Gartner's definition of the 3Vs is still widely
used, and in agreement with a consensual definition that states that "Big Data
represents the Information assets characterized by such a High Volume, Velocity
and Variety to require specific Technology and Analytical Methods for its
transformation into Value". Additionally, a new V "Veracity" is added by some
organizations to describe it, revisionism challenged by some industry authorities. The
3Vs have been expanded to other complementary characteristics of big data:

 Volume: big data doesn't sample; it just observes and tracks what happens
 Velocity: big data is often available in real-time
 Variety: big data draws from text, images, audio, video; plus it completes missing
pieces through data fusion

2. TECHNOLOGY FOR BIG DATA :


The top emerging technologies that can help users cope with and handle Big Data in
a cost-effective manner.
Column-oriented databases
Traditional, row-oriented databases are excellent for online transaction processing
with high update speeds, but they fall short on query performance as the data
volumes grow and as data becomes more unstructured. Column-oriented databases
store data with a focus on columns, instead of rows, allowing for huge data
compression and very fast query times. The downside to these databases is that
they will generally only allow batch updates, having a much slower update time than
traditional models.

Schema-less databases, or NoSQL databases


There are several database types that fit into this category, such as key-value stores
and document stores, which focus on the storage and retrieval of large volumes of
unstructured, semi-structured, or even structured data. They achieve performance
gains by doing away with some (or all) of the restrictions traditionally associated with
conventional databases, such as read-write consistency, in exchange for scalability
and distributed processing.

MapReduce
This is a programming paradigm that allows for massive job execution scalability
against thousands of servers or clusters of servers. Any MapReduce implementation
consists of two tasks:

 The "Map" task, where an input dataset is converted into a different set of
key/value pairs, or tuples;
 The "Reduce" task, where several of the outputs of the "Map" task are combined
to form a reduced set of tuples (hence the name).

Hadoop
Hadoop is by far the most popular implementation of MapReduce, being an entirely
open source platform for handling Big Data. It is flexible enough to be able to work
with multiple data sources, either aggregating multiple sources of data in order to do
large scale processing, or even reading data from a database in order to run
processor-intensive machine learning jobs. It has several different applications, but
one of the top use cases is for large volumes of constantly changing data, such as
location-based data from weather or traffic sensors, web-based or social media data,
or machine-to-machine transactional data.

Hive
Hive is a "SQL-like" bridge that allows conventional BI applications to run queries
against a Hadoop cluster. It was developed originally by Facebook, but has been
made open source for some time now, and it's a higher-level abstraction of the
Hadoop framework that allows anyone to make queries against data stored in a
Hadoop cluster just as if they were manipulating a conventional data store. It
amplifies the reach of Hadoop, making it more familiar for BI users.

PIG
PIG is another bridge that tries to bring Hadoop closer to the realities of developers
and business users, similar to Hive. Unlike Hive, however, PIG consists of a "Perl-
like" language that allows for query execution over data stored on a Hadoop cluster,
instead of a "SQL-like" language. PIG was developed by Yahoo!, and, just like Hive,
has also been made fully open source.

WibiData
WibiData is a combination of web analytics with Hadoop, being built on top of
HBase, which is itself a database layer on top of Hadoop. It allows web sites to
better explore and work with their user data, enabling real-time responses to user
behavior, such as serving personalized content, recommendations and decisions.

PLATFORA
Perhaps the greatest limitation of Hadoop is that it is a very low-level implementation
of MapReduce, requiring extensive developer knowledge to operate. Between
preparing, testing and running jobs, a full cycle can take hours, eliminating the
interactivity that users enjoyed with conventional databases. PLATFORA is a
platform that turns user's queries into Hadoop jobs automatically, thus creating an
abstraction layer that anyone can exploit to simplify and organize datasets stored in
Hadoop.

Storage Technologies
As the data volumes grow, so does the need for efficient and effective storage
techniques. The main evolutions in this space are related to data compression and
storage virtualization.

SkyTree
SkyTree is a high-performance machine learning and data analytics platform focused
specifically on handling Big Data. Machine learning, in turn, is an essential part of
Big Data, since the massive data volumes make manual exploration, or even
conventional automated exploration methods unfeasible or too expensive.

3. HOW BIG DATA WORKS ?

Even companies that are fully committed to big data, that have defined the business
case and are ready to mature beyond the “science project” phase, face a daunting
question: how do we make big data work?

The massive hype, and the perplexing range of big data technology options and
vendors, makes finding the right answer harder than it needs to be. The goal must
be to design and build an underlying big data environment that is low cost and low
complexity. That is stable, highly integrated, and scalable enough to move the entire
organization toward true data-and-analytics centricity.

Data-and-analytics centricity is a state of being where the power of big data and big
data analytics are available to all the parts of the organization that need them. With
the underlying infrastructure, data streams and user toolsets required to discover
valuable insights, make better decisions and solve actual business problems. That’s
how big data should work.

Attributes of Highly Effective Big Data Environments

 Seamlessly Use Data Sets: Much of the payoff comes through the mixing, combining
and contrasting of data sets – so there’s no analytics-enabled innovation without
integration
 Flexible, Low Cost: The target here is low complexity and low cost, with sufficient
flexibility to scale for future needs, which will be both larger-scale and more targeted
at specific user groups
 Stable: Stability is critical because the data volumes are massive and users need to
easily access and interact with data. In this sense, infrastructure performance holds
a key to boosting business performance through big data.
4. PRIVACY AND SECURITY ISSUES AND CHALLENGES WITH BIG
DATA :
 Secure Computations in Distributed Programming Frameworks
Distributed programming frameworks use parallel computing and data storage
for massive amounts of data. An example of this is MapReduce framework.
As has been mentioned earlier MapReduce framework divides an input file
into many chunks and then a mapper for each chunk reads the data, does
computations and provides outputs in the form of key/value pairs. A reducer
then combines the values belonging to each unique key and outputs the
results. The main concerns here are: securing the mappers and securing the
data from a malicious mapper. Mappers returning incorrect results are difficult
to detect and it eventually results in incorrect aggregate outputs. With very
large data sets malicious mappers are too hard to be detected as well and
they eventually damage essential data. Mappers leaking, intentionally or
unintentionally, private records are also an issue of concern. MapReduce
computations are often subjected to replay attack, man-in-the-middle attack
and denial-of-service attack Rogue data nodes can be added to a cluster, and
in turn receive replicated data or deliver altered MapReduce code. Creating
snapshots of legitimate nodes and reintroducing altered copies is an easy
attack in cloud and virtual environments and is difficult to detect
 Security Best Practices for Non-Relational Data Stores
Non-relational databases used to store big data, mainly NoSQL databases,
handle many challenges of big data analytics without concerning much over
security issues. NoSQL databases consist of security embedded in the
middleware and no explicit security enforcement is provided. Transactional
integrity maintenance is very lax in NoSQL databases. Complex integrity
constrains can’t be inculcated in NoSQL databases as it hampers with its
functioning of providing better performance and scalability. NoSQL databases
have weak authentication techniques and weak password storage
mechanisms. They use HTTP Basic- or Digest- based authentication and are
subjected to man-in-the-middle attack. REST (Representational State
Transfer) based on HTTP is prone to cross-site scripting, cross-site request
forgery and injection attacks like: JSON injection, array injection, view
injection, REST injection, GQL (Generalized Query Language) injection,
schema injection and others. NoSQL is unsupportive of blocking with the help
of third party as well. Authorization techniques in NoSQL provide authorization
at higher layers only. It provides authorization on a per database level rather
than at the level where the data are collected. NoSQL databases are
subjected to inside attacks as well due to lenient security mechanisms. They
may go unnoticed due to poor logging and log analysis methods along with
other fundamental security mechanisms
 Secure Data Storage And Transaction Logs
Data and transactions logs used to be kept in multi-tiered storage media. As
data size grew scalability and accessibility became an issue hence auto-
tiering for big data storage came to the fore. It doesn’t keep track of where the
data are stored unlike in previous multi-tiered storage media where IT
managers knew which data resided where and when. This gave rise to many
new challenges for data security storage. Untrustworthy storage service
providers often search for clues that help them correlate user activities and
data sets and get to know certain properties, which can well prove vital to
them. They however are not able to break into the data overcoming the
encipherment. As the data owner stores the cipher text in an auto-storage
system and distributes the private key to each user, he gives the right to
access data of certain portions to certain users, he being unauthorized to
access the data. However he may conspire with users by exchanging the key
and data hence he can obtain data to which he is not authorized to. The
service provider can instigate roll back attack on users in case of a multi-user
environment. He may serve outdated versions of data while the updated ones
are already uploaded in the database. Data tampering and data loss resulted
by malicious users often results in disputes between the data storage provider
or amongst users.
 End Point Input Validation/ Filtering
Organizations collect data from a variety of sources including hardware
devices, software applications and endpoint devices. As and when collecting
these data, validation of the data as well as the source is a challenge. Often
mischievous users tamper with the device from where the data are collected
or tamper with the data collecting application installed in the device so that
malicious data gets input into the central data collecting system. Fake IDs
may be created by malicious users and provide malicious data as input into
the central data collecting system. ID cloning attacks like Sybil attacks are
predominant in a Bring Your Own Device (BYOD) scenario where a malicious
user brings his own device, faked as a trusted device and provides malicious
input from there into the central data collecting system. Input sources of
sensory data can be manipulated as well like artificially changing the
temperature from a temperature sensor and inputting malicious input into the
temperature collection process. GPS signals can be manipulated much the
same way. The malicious user may change data while it is in transmission
from a generous source to the central data collection system. It’s a man-in-the
middle attack in a sense.
 Real-Time Security Monitoring
Real-time security monitoring has been an ongoing challenge in the big data
analysis scenario mainly due to the number of alerts generated by security
devices. These alerts, may be co-related may be not, lead to many false
positives and due to human being’s incapability to successfully deal with such
an huge amount of them at such a speed, results in them being clicked away
or ignored [9]. Security monitoring requires that the Big Data infrastructure or
platform be inherently secure. Threats to a Big Data infrastructure include
rogue admin access to applications or nodes, (web) application threats, and
eavesdropping on the line. Infrastructure which is mostly an ecosystem of
different components, the security of each component and the security
integration of the components must be considered. In case of a Hadoop
cluster run in a public cloud the security of the public cloud, itself being an
ecosystem of components consisting of computing, storage and network
components, needs to be considered. The security of the Hadoop cluster, the
security of the nodes, the interconnection among the nodes and the security
of the data stored in a node needs to be considered. The security of the
monitoring application including applicable correlation rules that should follow
secure coding principles, must be considered as well. The security of the input
source from where the data comes from too must be taken into account.

5. FUTURE SCOPE AND DEVELOPMENT:


As far as the future of big data is concerned it is for certain that data volumes will
continue to grow and the prime reason for that would be the drastic increment in the
number of hand held devices and internet connected devices, which is expected to
grow in an exponential order. SQL will remain as the standard for data analysis and
Spark, which is emerging, will emerge as the complimentary tool for data analysis.
Tools for analysis without the presence of an analyst are set to take over, with
Microsoft and Salesforce both recently announcing features letting non-coders to
create apps for viewing business data. As per IDC half of all business analytics
software will include intelligence where it is needed by 2020. In other words it can be
said that prescriptive analytics will be built into business software. Programs like
Kafka and Spark will enable users to make decisions in real time. Machine learning
will have a far bigger role to play for data preparation and predictive analysis in
businesses in the coming days. Privacy and security challenges related to big data
will grow and by 2018, 50% of business ethics violations will be related to data. Chief
Data Officer will be a common sight in companies in the recent future though it is
thought that it won’t last long. Autonomous agents and things like robots,
autonomous vehicles, virtual personal assistant and smart devices will be a huge
trend in the future. Big data talent crunch as is seen these days will reduce in the
coming days. The International Institute for Analytics predicts that companies will use
recruiting and internal training to budding data scientists to get their own problems
done. Businesses will soon be able to buy algorithms rather than program them by
themselves and add their own data to it. Existing services like Algorithmia, DataXu,
and Kaggle will grow in a large scale, that is algorithm markets will emerge. More
companies will try to derive their revenue from their data. The gap between insight
and action in big data is going to reduce and more energy will be given to obtaining
insights and execution rather than collecting big data. Fast and actionable data will
replace big data. Companies are expected to ask the right questions and make
better use of the data they have, much of the big data they have are unused these
days.

6. CONCLUSION :
To handle big data and to work with it and obtaining benefits from it a branch of
science has come up and is evolving, called Data Science. Data Science is the
branch of science that deals with discovering knowledge from huge sets of data,
mostly unstructured and semi structured, by virtue of data inference and exploration.
It’s a revolution that’s changing the world and finds application across various
industries like finance, retail, healthcare, manufacturing, sports and communication.
Search engine and digital marketing companies like Google, Yahoo and Bing, social
networking companies like Facebook, Twitter and finance and e commerce
companies like Amazon and EBay are requiring and will require a lots of data
scientists.. As far as security is concerned the existing technologies are promising to
evolve as newer vulnerabilities to big data arise and the need for securing them
increases.

7. REFERENCES :
1. www.techrepublic.com/blog/big-data.
2. www.coursera.org, Introduction to Big Data.
3. https://en.wikipedia.org/wiki/Big_data.
4. http://www.dataversity.net/common-big-data-management-issues-solutions/
The Most Common Big Data Management Issues (And Their Solutions). By:
A.R. Guess. July 15 2014.

Das könnte Ihnen auch gefallen