Sie sind auf Seite 1von 10

BigData: Issues, Challenges,

Technologies and Methods

Khalid Adam, Mohammed Adam Ibrahim Fakharaldien,


Jasni Mohamed Zain, Mazlina Abdul Majid and Ahmad Noraziah

Abstract With dramatically increase of emerging applications such as social media,


sensors, videos and digital pictures data is become generating from everywhere. Due
to characteristic of big data it becomes very difficult to manage, analysis, store,
transport, and process the data using the existing traditional techniques. This paper
introduces several Big Data, technologies and methods. First, we introduce the Big
Data and its problem, important of Big Data, technologies and methods of Big Data
management, and analyzed includes the Distributed File System, NOSQL Database,
Hadoop/MapReduce. Next, we present the Big Data model; and discuss the Big Data
security challenges. Finally, we present the future work and the conclusion.

Keywords Big data · Hadoop · Hadoop distributed file system · MapReduce

1 Introduction

Nowadays with the sharply increase of data every day is expanding in drastic manner.
Big data is a popular term used to describe the data which is in zettabytes [1]. Making
it difficult to handle such large amount of data (Exabytes). The main difficulty in
handling such large amount of data is because that the volume is increasing rapidly in

K. Adam · M. A. I. Fakharaldien · J. M. Zain · M. A. Majid · A. Noraziah (B)


Faculty of Computer Systems and Software Engineering, University Malaysia Pahang, Gambang,
26300 Kuantan, Pahang, Malaysia
e-mail: noraziah@ump.edu.my
K. Adam
e-mail: khalidwsn15@gmail.com
M. A. I. Fakharaldien
e-mail: adamibrahim@ump.edu.my
J. M. Zain
e-mail: jasni@ump.edu
M. A. Majid
e-mail: mazlina@ump.edu.my
© Springer Nature Singapore Pte Ltd. 2019 541
J. H. Abawajy et al. (eds.), Proceedings of the International Conference on Data
Engineering 2015 (DaEng-2015), Lecture Notes in Electrical Engineering 520,
https://doi.org/10.1007/978-981-13-1799-6_56
542 K. Adam et al.

comparison to the computing resources. Big Data does not have only single definition.
Because Data coming from a ubiquitous such as social media, videos, digital pictures
and sensors used to collect climate data, purchase transaction record, and cell phone
(Fig. 1; Table 1).

1.1 What Is Big Data Problem?

Big Data has emerged because we are living in a society which makes increasing
use of data intensive technologies. One current feature of big data is the difficulty
working with it using relational databases and desktop statistics/visualization pack-
ages, requiring instead massively parallel software running on tens, hundreds, or
even thousands of servers [2]. The various challenges faced in big data manage-
ment include analytics, unstructured data and fault tolerance and on. Big data can be
defined with the following properties associated with it (Fig. 2):
Variety: refers to types of data such as structured, unstructured, text, images, audio,
video, log files, emails, simulations, 3D models, etc.
Volume: refers to size of data. It could amount to hundreds of terabytes or even
petabytes of information generated for everywhere.

Fig. 1 Big data sources

Fig. 2 Big data


characteristics
Table 1 Distributed database open source tools
MongoDB CouchDB Riak Redis Voldemort Cassandra HBase
Language C++ Erlang Erlang C++ Java Java Java
License AGPL Apache Apache BSD Apache Apache Apache
Model Document Document Key/value Key/value Key/value Wide Column Wide Column
Protocal BSON HTTP/ REST HTTP/ REST or TCP TCP/ Thrift HTTP/ REST
TCP/ Protobufs
Storage Memory mapped COW-BTree Pluggable, In memory, Pluggable, BDB, Memtable/ HDFS
b-trees InnoBD, Snapshot to disk MySQL, SSTable
BigData: Issues, Challenges, Technologies and Methods

LeveIDB, in-memory
Bitcask
Inspiration Dynamo Dynamo Dynamo BigTable, Bigtable
Dynamo
Search Yes No Yes No No Yes Yes
MapReduce Yes No Yes No No Yes Yes
543
544 K. Adam et al.

Velocity: refers to the speed at which new data is generated and speed at which data
move around.

1.2 The Important of Big Data

The announcement from the regime of Obama in 2012 for big data initiatives by
more than $200 million in research and development investments for National Sci-
ence Foundation [3]. Big data is different from the data being stored in traditional
warehouses. The data stored there first needs to be cleansed, documented and even
trusted. Moreover it should fit the basic structure of that warehouse to be stored but
this is not the case with Big data it not only handles the data being stored in tradi-
tional warehouses but also the data not suitable to be stored in those warehouses. Thus
there comes the point of access to mountains of data and better business strategies
and decisions as analysis of more data is always better.
According to Gartner Inc., big data technology will grow radically over the next
few years. Gartner research predicts that by 2015, 4.4 million IT jobs will have been
created to staff big data initiatives. Even now, a Gartner survey has discovered that
40% of almost 500 IT executives have invested or will invest in big data technology.

2 How Fast Data Is Increasing

Companies like Google, Facebook, Twitter, Skype and so on generated data every
60 s, so by this we can understand how much data being generated in a second, a
minute, a day or a year and how exponentially it is generating. As per the analysis by
TechNewsDaily we might generate more than 8 Zettabytes of data by 2015 (Fig. 3).

3 Technologies and Methods of Big Data Management

Many researchers have suggested that commercial DBMSs are not suitable for pro-
cessing extremely large scale data. Classic architecture’s potential bottleneck is the
database server while faced with peak workloads. One database server has restriction
of scalability and cost, which are two important goals of big data processing [4]. This
Table shows the distributed database open source.
BigData: Issues, Challenges, Technologies and Methods 545

Fig. 3 Big data generated


by companies

3.1 Distributed File System

A distributed file system is a client/server-based application that enables clients to


access and process data stored on the server as if the data was local to the client.
When a user accesses a file on the server, the server generally sends the client a copy
of the file, which is cached on the client while the data is being processed and is
then returned to the server. They are many Distributed File System Such as Google
File System (GFS) from Google [5], Andrew File System (AFS) from IBM [6] and
Network File System (NFS).

3.2 NOSQL Database

Until recently, relational database management systems (RDBMS) were the main-
stay for managing all types of data irrespective of their naturally fit to the relational
data model. The emergence of Big Data and mobile computing necessitated new
database functionality to support applications such as real-time log file analysis,
cross-selling in eCommerce, location based services, and micro-blogging. Many of
these applications exhibit a preponderance of insert and retrieve operations on a very
large scale. Relational databases were found to be inadequate in handling the scale
and varied structure of data. The above requirements ushered in an array of choices
for Big Data management under the umbrella term NoSQL [7]. NoSQL is new way
to describe the high scale of data class of database, though NOSQL is not a relational
546 K. Adam et al.

Fig. 4 Scalability of NoSQL


database versus traditional
relational database

data as in traditional relational database. NoSQL systems provide data partitioning


and replication as built-in features. They typically run on cluster computers made
from commodity hardware and provide horizontal scalability. Developing applica-
tions using NoSQL systems is quite different from the process used with RDBMS
NoSQL databases require developer-centric approach from the application inception
to completion. For example, data modeling is done by the application architect or
developer, whereas in RDBMS based applications, data architects and data model-
ers complete the data modeling tasks. They begin by constructing conceptual and
logical data models. Database transactions and queries come into play only in the
design of a physical database. In contrast, NoSQL database approaches begin by
identifying application queries and structure the data model to efficiently support
these queries. In other words, there is a strong coupling between the data model and
application queries. Any changes to the queries will necessitate changes to the data
model. This approach is in stark contrast to the time-tested RDBMS principles of
logical and physical data independence traditionally, data is viewed as a strategic and
shared corporate resource with well-established policies and procedures for data gov-
ernance and quality control. In contrast, NoSQL systems promote data silos, each
silo geared towards meeting the performance and scalability requirements of just
one or more applications. This runs against the ethos of enterprise data integration,
redundancy control, and integrity constraints [8] (Fig. 4).

3.3 Open Source Cloud Platform

The main idea behind Cloud computing to offers enabling convenient, on-demand
network access to a shared pool of configurable computing resources (CPU, storage,
bandwidth) to users, Cloud services with the ability to ingest, store and analyse data
have been available for some time and they enable organizations to overcome the
challenges associated with Big Data.
BigData: Issues, Challenges, Technologies and Methods 547

3.4 Hadoop/MapReduce

Most enterprises are facing lots of new data, which arrives in many different forms.
Big data has the potential to provide insights that can transform every business.
Big data has generated a whole new industry of supporting architectures such as
MapReduce and Hadoop. Map/Reduce is a programming paradigm that was made
popular by Google where a task is divided into small portions and distributed to a
large number of nodes for processing (map), and the results are then summarized into
the final answer (reduce). Hadoop also uses Map/Reduce for data processing [9].
MapReduce. MapReduce is the programming model that allows Hadoop to effi-
ciently process large amounts of data [10]. MapReduce breaks large data processing
problems into multiple steps, namely a set of Maps and Reduces that can each be
worked on at the same time (in parallel) on multiple computers. MapReduce is
designed to work with of HDFS. Apache Hadoop automatically optimizes the exe-
cution of MapReduce programs so that a given Map or Reduce step is run on the
HDFS node that contains locally the blocks of data required to complete the step.
Apache Hadoop. Is a platform that offers an efficient and effective method for
storing and processing massive amounts of data Unlike traditional offerings, Hadoop
was designed and built from the ground up to address the requirements and challenges
of big data. Hadoop is powerful in its ability to allow businesses to stop worrying
about building big data capable infrastructure and to focus on what really matters:
extracting business value from the data. Apache Hadoop use cases are many, and
show up in many industries, including: risk, fraud and portfolio analysis in financial
services; behavior analysis and personalization in retail; social network, relationship
and sentiment analysis for marketing; drug interaction modelling and genome data
processing in healthcare and life sciences and so on to name a few. Hadoop comes
with its default distributed file system which is Hadoop distributed file system. It
stores file in blocks of 64 MB. It can store files of varying size from 100 MB to GB,
TB. Hadoop architecture contains the Name node, data nodes, secondary name node,
Task tracker and job tracker [11]. Name node maintained the Metadata information
about the block stored in the Hadoop distributed file system. Files are stored in blocks
in a distributed manner. The Secondary name node does the work of maintaining the
validity of the Name Node and updating the Name Node Information time to time.
Data node actually stores the data. The Job Tracker actually receives the job from
the user and split it into parts. Job Tracker then assigns these split jobs to the Task
Tracker. Task Tracker runs on the Data node they fetch the data from the data node and
execute the task. They continuously talk to the Job Tracker. Job Tracker coordinates
the job submitted by the user. Task Tracker has fixed number of the slots for running
the tasks. The Job tracker selects the Task Tracker which has the free available slots.
It is useful to choose the Task Tracker on the same rack where the data is stored this
is known as rack awareness. With this inter rack bandwidth can be saved. Figure 2
shows the arrangement of the different component of Hadoop on a single node. In
this arrangement all the component Name Node, Secondary Name Node, Data Node,
Job Tracker, and Task Tracker are on the same system. The User submits its job in
548 K. Adam et al.

Fig. 5 HDFS architecture

the form of MapReduce task. The data Node and the Task Tracker are on the same
system so that the best speed for the read and write can be achieved (Fig. 5).
The Future of Hadoop. Hadoop has “crossed the chasm” from a framework
for early adopters, developers and technology enthusiasts to a strategic data plat-
form embraced by innovative CTOs and CIOs across mainstream enterprises. These
people, who want to improve the performance of their companies and unlock new
business opportunities, realize that including Apache Hadoop as a deeply integrated
supplement to their current data architecture offers the fastest path to reaching their
goals while maximizing the ir existing investments. Going forward, Hortonworks
and the Apache Hadoop community will continue to focus on increasing the ease
with which enterprises deploy and use Hadoop, and on increasing the platform’s
interoperability with the broader data ecosystem. This includes making certain it is
reliable and stable and more importantly, ready for all and any enterprise workload.

4 Big Data Modeling and Security

Although distributed data analysis platform may offer a solution to deal with data
of great scale, the data based modelling in big data environment still remains to be
a challenge. There are two possible solutions to the big data modelling problem:
(a) design a deep learning modelling algorithm, making use of the strong ability of
machine learning to process massive high-dimensional data; (b) divide the entire
dataset into subsets, based on which submodels can be built, and then obtain the
entire model by integrating all the sub-models according to specific strategies [12].
As increasing use of big data and expanding scope of big data, big data security
has been considered crucial. There are many security issues about big data. Among
them, data protection and access control are recognized as the most important security
issue. This is similar to the current information security situation. However, data
management and classification for security are more difficult than current information
security issues due to the volume of data [13]. For this reason, Management Cost
BigData: Issues, Challenges, Technologies and Methods 549

per GB has decreased but security investment for big data has increased. Similarly,
access control is more difficult due to huge data scale. As mentioned earlier in the
introduction, value is the key deliverable of big data. The data itself is not the subject
of protection. In addition, securing the entire data is very inefficient, considering the
volume of big data.

5 Future Work

Using big data tools like Hadoop is one way to manage large quantities of data. This
is an open source programming framework that supports massive data sets and data
transfers. Its streaming technology can capture and store information as fast as it
flows in. Design patterns are another way to help reduce some of the complexity
associated with big data. These are used to provide template solutions to recurring
problems in big data management. Using a variety of purpose-driven design patterns,
developers can mash up semi-structured data, find event sequence signals, respond
to signal patterns in real time and match up cloud-based data services.

6 Conclusion

This paper present the concept overview of Big Data, important of Big Data, tech-
nologies and methods, Big Data problem, Big Data modeling, and the results have
shown that even if available data, tools and techniques available in the literature.
There are many points to be considered, discussed, improved, developed, analyzed,
etc. Besides, we also discussed the Big Data security challenges and the future work.
Although this paper clearly has not resolved the entire subject about this substan-
tial topic, hopefully it has provided some useful discussion and a framework for
researchers.

Acknowledgements Appreciation conveyed to Ministry of Higher Education Malaysia for project


financing under Exploratory Research Grant Scheme RDU120608; and University Malaysia Pahang
under UMP Short Term Grant RDU1403163.

References

1. Ji, C., Li, Y., Qiu, W., Awada, U., Li, K.: BigData processing in cloud computing environments.
In: International Symposium on Pervasive Systems, Algorithms and Networks (2012)
2. Patel, A.B., Birla, M., Nair, U.: Addressing big data problem using hadoop and map reduce.
In: NIRMA University International Conference on Engineering (2012)
550 K. Adam et al.

3. Kaisler, S., Armour, F., Espinosa, J.A., Money, W.: Big data: issues and challenges moving
forward. In: Hawaii International Conference on System Sciences. https://doi.org/10.1109/
hicss.2013.645 (2012)
4. Ji, C., Li, Y., Qiu, W., Awada, U., Li, K.: Big data processing in cloud computing environments.
In: International Symposium on Pervasive Systems, Algorithms and Networks (2012)
5. UzZaman, N.: Survey on google file system. In: Survey Paper for CSC 456 (Operating Systems),
University of Roch (2007)
6. Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system, ACM 1-58113-757-
5/03/0010. Bolton Landing, New York, USA (2003)
7. Gudivada, V.N., Rao, D., Raghavan, V.V.: NoSQL systems for big data management. In: IEEE
10th World Congress on Services, 978-1-4799-5069-0/14 (2014)
8. Zhang, Q., Chen, Z., Lv, A., Zhao, L., Liu, F., Zou, J.: A universal storage architecture for
big data in cloud environment. In: IEEE International Conference on Green Computing and
Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing,
978-0-7695-5046-6/13 (2013)
9. Nandimath, J., Patil, A., Banerjee, E., Kakade, P., Vaidya, S.: Big data analysis using apache
hadoop. In: IEEE San Francisco, California, USA, 978-1-4799-1050-2/13(2013)
10. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, Google,
Inc
11. Pal, A., Agrawal, P., Jain, K., Agrawal, S.: A performance analysis of mapreduce task with
large number of files datasetin big data using hadoop. In: Fourth International Conference on
Communication Systems and Network Technologies, 978-1-4799-3070-8/14 (2014)
12. Kong, W.c., Wu, Q., Li, L., Qiao, F.: Intelligent data analysis and its challenges in big data
environment. In: IEEE International Conference on System Science and Engineering (ICSSE),
978-1-4799-4367-8/14 (2014)
13. Kim, S.-H., Kim, N.-U., Chung, T.-M.: Attribute Relationship Evaluation Methodology for
Big Data Security, 978-1-4799-2845-3/13/$31.00, IEEE (2013)

Das könnte Ihnen auch gefallen