Sie sind auf Seite 1von 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/315670193

Big Data Management and Analysis

Conference Paper · December 2014

CITATION READS

1 4,236

6 authors, including:

Khalid Adam Jasni Mohamad Zain


Universiti Malaysia Pahang Universiti Teknologi MARA
18 PUBLICATIONS   47 CITATIONS    161 PUBLICATIONS   1,362 CITATIONS   

SEE PROFILE SEE PROFILE

Mazlina A. Majid
Universiti Malaysia Pahang
105 PUBLICATIONS   350 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

An Intrusion detection and prevention system model for DDoS attack in cloud computing View project

Agent based Cloud Computing View project

All content following this page was uploaded by Khalid Adam on 28 March 2017.

The user has requested enhancement of the downloaded file.


Big Data Management and Analysis
Khalid Adam, Mohammed Adam Ibrahim Fakharaldien, Jasni Mohamed Zain, Mazlina Abdul Majid
Faculty of Computer System and Software Engineering,
University Malaysia Pahang,
Kuantan, Malaysia
Khalidwsn15@gmail.com,adamibrahim@ump.edu.my, jasni@ump.edu.my, mazlina@ump.edu.my

Abstract: Big data is term refer to huge data sets, have high ways, depending on the analysis that needs to be done or on
Velocity , high Volume and high Variety and complex structure
with the difficulties of management , analyzing, storing and the information that must be found in the initial data. Usually,
processing .Due to characteristic of big data it becomes very this big amount of data is produced with great velocity and
difficult to Management, analysis, Storage, Transport and
processing the data using the existing traditional techniques. This must be captured and processed quickly.
paper introduces Big Data Management and Analysis. First
introduction of big data, Big Data Impact on Storage Infrastructure,
Big Data Analysis and Management include (Big Data over Cloud 2. What is Big Data?
computing and Hadoop HDFS and MapReduce), Finally
Conclusion and Future work. Big Data is a term refer to describe the exponential growth
for data, both structured and unstructured, because data
Keywords: Big Data, Storage, Hadoop/MapReduce. coming from everywhere such as social media, videos, digital
pictures, sensors...etc. and that make it difficult to use
1. Introduction software tools to capture, Analysis, manage, and process data
within a tolerable elapsed time [2] . Big data have three
Nowadays, we live in a more and more interconnected world characteristics Volume, Velocity and Variety as show in
that generates a great volume of information every day, Figure 1.

starting from the logging files of the users of social networks,


search engines, e-mail clients to machine generated data as
from the real-time monitoring of sensor networks for dams or
bridges, and various vehicles such as airplanes, cars or ships.
According to an infographic made by Intel, 90% of the data
today was created in the last two years, and the growth
continues. It is estimated that all the global data generated
from the beginning of time until 2003 represented about 5
ExaBytes (1 ExaByte equals 1 million GigaBytes), the
amount of data generated until 2012 is 2.7 ZettaBytes (1
zettaBytes equals 1000 ExaBytes) and it is expected to grow Figure 1: Big data characteristics
3 times larger than that until 2015 [1]. For example, the
Volume: refers to amount of data and there are many factors
number of RFID tags sold globally is projected to rise from
that can contribute to the volume increase in data It could
12 million in 2012 to 209 billion in 2021. All this volume amount to hundreds of terabytes or even petabytes of
information generated for everywhere [3]
represents a great amount of data that rise challenges when
Velocity: Describes the speed at which new data is generated
talking about acquiring, organizing and analyzing it. Big and that is make it difficult to deal with this data and speed at
Data is an umbrella term describing all these types of which data move around [3, 4].
information mentioned above. As the name suggests, Big Variety: refers to types of data because data come in many
Data refers to a great volume of data, but this is not enough formats such as text, images, audio, video, log files, emails,
financial transactions, simulations, 3D models, etc. [3].
to describe the meaning of the concept. The data presents a
great variety, it is usually unsuitable for typical relational
databases treatment, being raw, semistructured or
unstructured. Also, the data will be processed in different
3. Related Work using any market available storage appliances, can
significantly bring down costs.
Big data can be handled by using the Hadoop and
MapReduce. The multiple node clusters can be set up using E. Flexibility
the Hadoop and MapReduce. Different size files can be
Big Data typically incorporates a Business Intelligence
stored in this cluster [5]. A shared disk analysis of the
application, which requires data integration and migration.
Hadoop is done. File of size terabytes is generated and stored
However, given the scale of Big Data, the storage system
in the Hadoop cluster and analyzed. Big data can also be
needs to be fixed without any need of data migration needs
provided as a service in cloud computing (Data-as service).
and simultaneously flexible enough to accommodate different
With this big data analysis becomes important in a cloud
types and sources of data, again without sacrificing
computing environment. For management of data there is a
performance or latency.
traditional database management system, appliance, Hadoop,
in memory, and Solid State Disks (SSD). It is tested that F. Security
Hadoop provides the best service in the context of cost,
scalability and unstructured data [6]. As a result of cross-referencing data at a new level to yield a
bigger picture, new considerations for data level security
4. Big Data Impact on Storage Infrastructure might be required over existing IT scenarios. Storage should
be able to handle these kinds of data level security
Today, we collect and store data from a myriad of sources requirements, without sacrificing scalability or latency.
such as biological science and research , sensors and mobile
phones and their applications , social media activity, mobile
devices and automated sensors to name a few. Software 5. Big Data Analysis and Management
always paves the path for new and improved hardware. Big data analytics is differences from traditional analytics
However; Big Data, with all its computing and storage needs, Because of the big increase in the volume of data and that led
is driving the development of storage hardware, network to Many researchers have suggested commercial DBMS and
infrastructure and new ways of handling ever-increasing this not suitable with size of data. This type of data is
computing needs. The most important infrastructure aspect of impossible to handle using traditional relational database
Big Data analytics is storage. management systems. New innovative technologies were
A. Big Data Capacity of Storage needed and Google found the solution by using a processing
model called MapReduce. There are more solutions to handle
Data over the size of a petabyte is considered Big Data. The Big Data, but the most widely-used one is Hadoop, an open
amount of data increases rapidly, thus the storage must be source project based on Google’s MapReduce and Google
highly scalable as well as flexible so the entire system File System. Hadoop was founded by the Apache Software
doesn’t need to be brought down to increase storage. Big Foundation. The main contributors of the project are Yahoo,
data translates into an enormous amount of metadata, so a Facebook, Citrix, Google, Microsoft, IBM, HP, Cloudera
traditional file system cannot support it. In order to reduce and many others. Hadoop is a distributed batch processing
scalability, object oriented file systems should be leveraged. infrastructure which consists of the Hadoop kernel, Hadoop
Distributed File System (HDFS), MapReduce and several
B. Big Data Latency related projects.

Big Data analytics involves social media tracking and


transactions, which are leveraged for tactical decision making 5.1 Data Mining Analysis
in real-time. Thus, Big Data storage cannot appear latent or it Data Mining is commonly defined as the technique to extract
risks becoming stale data. Some applications might require useful knowledge from database [7]. It is almost impossible
real-time data for real-time decision making. Storage systems to derive the value directly from each data. For this reason,
must be able to scale-out without sacrificing performance, data mining needs pre-processing and analytic method for
which can be achieved by implementing a flash based storage finding the value. Indeed, data mining is closely related with
system. artificial intelligence and machine learning and so on. Scale
of data management in data mining and big data is
C. Big Data Access significantly different in size. However, the basic method to
extract the value is very similar. In case of data mining, the
Since Big Data analytics is used across multiple platforms process of extracting knowledge needs data cleaning, data
and host systems, there is a greater need to cross-reference integration, data selection, data transformation, data mining,
data and tie it all together in order to give the big picture. pattern evaluation, knowledge presentation et.. Big data came
Storage must be able to handle data from various source out after solving the requirements and challenges of data
systems at the same time. mining. Requirements and challenges are ‘handling of
different types of data’, ‘efficiency and scalability of data
D. Cost
mining algorithms’, ‘mining information from different
Big Data also translates into big-prices. The most expensive source of data’.
component of Big Data analytics is storage. Certain
techniques like data de-duplication, using tape for backup,
data redundancy and building custom hardware, instead of
5.2 Big Data over Cloud computing
Cloud computing is usually defined as a type of computing
that relies on sharing pooling computing resources rather
than having local servers or personal devices to handle
applications [8].The current technologies like cloud
computing platform and grid ,have all intended to access
huge amounts of computing resources (software ,hardware,
application ) and that offering in a single system view.
Among these technologies, cloud computing is becoming a
powerful architecture to perform large-scale and complex
computing, and has revolutionized the way that computing
infrastructure is abstracted and used. Moreover , the main
goal of cloud computing is to deliver computing as a solution
for tackling big data, like high dimensional data sets , large
size and multi-media [9]. There are several leading
Information Technology solution providers that offer these
services to the customers. Now, progressively after the
concept of the big data came up, cloud computing service Figure 2: Hadoop architecture
model is by degrees transferring into big data service model,
which are AaaS (Analysis as a Service) DBaaS (Big data as a Map-Reduce was introduced by Google in order to process
Service) and DaaS (Database as a Service). and store large datasets on commodity hardware. It provides
a programming paradigm which allows useable and
5.3 Hadoop HDFS and MapReduce manageable distribution of many computationally intensive
tasks. As a result, many programming languages now have
Hadoop comes with its default distributed file system which
Map-Reduce implementations which extend its uptake. On
is Hadoop distributed file system [10]. It stores file in blocks
the other hand, Hadoop is a highly popular free Map-Reduce
of 64 MB. It can store files of varying size from 100MB to
implementation by the Apache Foundation [11]. With the
GB, TB. Hadoop architecture contains the Name node, data
popularity of the Hadoop applications there have been many
nodes, secondary name node, Task tracker and job tracker.
complementing applications developed by the open source
Name node maintained the Metadata information about the
community and packaged up under apache foundation [12].
block stored in the Hadoop distributed file system. Files are
Map-Reduce involves two main parts.
stored in blocks in a distributed manner. The Secondary
 Map operation
name node does the work of maintaining the validity of the
Where a simple function is used to emit key/value pairs in
Name Node and updating the Name Node Information time
parallel similar to using primary keys in the relational
to time. Data node actually stores the data. The Job Tracker
database world. Once the data to be processed is mapped into
actually receives the job from the user and split it into parts.
key/value groups.
Job Tracker then assigns these split jobs to the Task Tracker.
Task Tracker runs on the Data node they fetch the data from  Reduce operation
Is used to apply the core processing logic to produce results
the data node and execute the task. They continuously talk to
in a timely manner [13]. The simple concept of Map-Reduce
the Job Tracker. Job Tracker coordinates the job submitted
removes many traditional challenges in HPC to achieve fault
by the user. Task Tracker has fixed number of the slots for
tolerance and availability. Therefore, it paves the way for
running the tasks. The Job tracker selects the Task Tracker
development of highly parallel, highly reliable and
which has the free available slots. It is useful to choose the
distributed applications on large datasets.
Task Tracker on the same rack where the data is stored this is
known as rack awareness. With this inter rack bandwidth can
be saved. Figure 2 shows the arrangement of the different 6. Conclusion and Future work
component of Hadoop on a single node. In this arrangement Big data provides enterprise with more choices because of its
all the component Name Node, Secondary Name Node, Data lots of related technologies and tools, which will continue to
Node, Job Tracker, and Task Tracker are on the same be developed and become innovative hotspots in the future,
system. The User submits its job in the form of MapReduce such as Hadoop distribution, the next generation of data
task. The data Node and the Task Tracker are on the same warehouse, advanced data visualization, etc. In recent years,
system so that the best speed for the read and write can be academia pays more attention to cloud computing. Big data
achieved. focuses on “data”, like data service, data acquisition, analysis
and data mining, which pays more attention on ability of data
storage. Cloud computing focuses on computing architecture
and practices. Big data and cloud computing are two sides of
the same issue .It is more accurate to analyze and forecast big
data by using cloud computing and release more hidden value
of data; in order to meet the service demand of big data, we
can find even better practical application to the cloud
computing. Nowadays, more and more enterprises hope that
they can transfer their own applications and infrastructures to
the cloud platform. Cloud computing brings great changes to
the big data. First, cloud computing provides a quite cheap
storage place for the big data and makes medium-sized and
small enterprises complete big data analysis. Second, cloud
computing has huge IT resources, distributes widely and
becomes an effective way for enterprises which have more
heterogeneous system to process data accurately.

7. References
[1] Big Data Infographic: Solve your Big Data Problems?,
http://www.intel.in/content/www/in/en/big-data/solving-big-
dataproblems- infographic.html
[2] Elena Geanina ULARU, Florina Camelia PUICAN, Anca
APOSTU and Manole VELICANU ,“Perspectives on Big
Data and Big Data Analytics”, Database Systems Journal vol.
III, 2012 .
[3] Avita Katal, Mohammad Wazid and R H Goudar ,”Big
Data: Issues, Challenges, Tools and Good Practices” , IEEE
, 978-1-4799-0192-0/13 , 2013 .
[4] Zaiying Liu, Ping Yang and Lixiao Zhang, “A Sketch of
Big Data Technologies”, International Conference on
Internet Computing for Engineering and Science IEEE, 978-
0-7695-5118-0/13, 2013.
[5] Firat Tekiner1 and John A. Keane , “Big Data Framework
“ , International Conference on Systems, Man, and
Cybernetics IEEE , 978-1-4799-0652-9 , 2013.
[6] Changqing Ji, Yu Li, Wenming Qiu, Uchechukwu Awada
and Keqiu Li,” Big Data Processing in Cloud Computing
Environments”, International Symposium on Pervasive
Systems, Algorithms and Networks, IEEE, 1087-4089/12 ,
2012 .
[7] M. Chen, J. Han and P.S. Yu, “Data mining: An overview
from a database perspective “ , knowledge and data
Engineering, IEEE , 1996.
[8] Divyakant Agrawal, Sudipto Das and Amr El Abbadi,
“Big Data and Cloud Computing: Current State and Future
Opportunities” EDBT, Uppsala, Sweden, 978-1-4503-0528-
0/11/0003, 2011.
[9] CHANGQING JI et al,” BIG DATA PROCESSING:
BIG CHALLENGES AND OPPORTUNITIES”, Journal of
Interconnection Networks Vol. 13, Nos. 3 & 4, 2012.
[10] Amrit Pal et al , “A Performance Analysis of
MapReduce Task with Large Number of Files Dataset in Big
Data Using Hadoop” International Conference on
Communication Systems and Network Technologies IEEE ,
978-1-4799-3070-8/14 , 2014.
[11] White T, "Hadoop: The Definitive Guide", Third
Edition, O'Reilly, 978-1-449-31152-0, May 2012.
[12] Saecker M. and Markl V., "Big Data Analytics on
Modern Hardware Architectures: A Technology Survey",
Springer Lecture Notes in Business Information Processing,
Volume 138, pp 125-149. 2013.
[13] McCreadie R., Macdonald C., Ounis I., "MapReduce
indexing strategies: Studying scalability and efficiency",
Journal of Information Processing and Management: an
International Journal archive, 48 (5). pp. 873-888. ISSN
0306-4573. 2012.

View publication stats

Das könnte Ihnen auch gefallen