Sie sind auf Seite 1von 7

UNIVERSITY OF WESTMINSTER

INVESTIGATING DATA
MANGEMENT PRINCIPLES
IN CLOUD COMPUTING
DATA MANGEMENT AND REPOSITORIES

IBRAHIM IDDRIS FAREED
12/20/2013




.
1

INVESTIGATING DATA MANGEMENT IN CLOUD COMPUTING
INTRODUCTION
Cloud computing is an emerging technology in the ICT industry over the last decade, however
there is no formal agreeable definition of cloud computing as there exist many different
definitions.[1] NIST defines cloud computing as a model for enabling ubiquitous, convenient on-
demand network access to a shared pool of configurable computing services (e.g networks,
servers, storage, applications and services) that can be rapidly provisioned and released with
minimal management effort or service provider interaction.[2] The term cloud is historically a
metaphor used to represent the internet depicting the network of connected computers.
Cloud computing seeks to encompass a general shift of computer processing, storage and
software delivery away from the desktop and local servers across the network, and into next
generation data centers hosted by large infrastructure companies such as Amazon, Google,
Yahoo, Microsoft etc.[3]The usefulness of this technology is it provides IT services as a utility
analogous to electricity being provided as a utility. The advantage of this technology for
businesses is the freeing of corporations from large IT capital investments, and enables them to
connect into extremely powerful computer resources over the network.
Cloud computing over the past years has enjoyed a lot of patronage despite security concerns
over data privacy and management. The continually growth of this technology has brought with
it massive data storage. These databases need to be managed effectively and securely as well
as readily available by clients on demand. This article investigates the principles used in the
data management of cloud service providers, and also current data management systems in use
or being deployed. The focus would be on Google, Amazon, Microsoft and Yahoo cloud service
providers.
DATA COLLECTION AND STORAGE
Cloud service providers typically collect volumes of data from clients (individuals, corporations,
institutions etc.) these volumes of data are stored in what is known as cluster file systems.
However there are different implementations to this form of storage.
Googles File system (GFS) is used to store very large data, it stores data in unit blocks of
storage called chunks; these chunks are 64MB in size. By allocating this huge amount of storage
space GFS is optimized for large data storage.[4] Bigtable another form of Googles storage
methods is implemented slighted different. Here large volumes of data are store in a tabular
format similar to a RDBMS. Bigtable is built on the concept of relational databases with some
2

added functionalities such as the timestamp and the column family. Bigtable is able to store
petabytes of data across thousands of servers.[4]
Microsoft uses its Azure services platform which is based on Azures storage framework. It
involves storing of binary large objects (blobs), communications queues to provide access to the
data via Azure applications, and a query language that can provide table-like structures. An
Azure account holder can have one or more containers where each container can hold one or
more blobs. Each blob has a maximum size of 50 GB, and can be subdivided into smaller blocks.
To work with the blobs of data, entity and property hierarchies are provided through tables.
These tables are not SQL-like relational tables and are not accessed using SQL. Instead, access
to these tables is provided via the Microsoft Language Integrated Query (LINQ) syntax query
language.[4]
Amazon has developed the Amazon simple storage service(S3). S3 organizes its data in this
instance called objects into units called buckets. With each bucket capable of storing up to 5TB
of object data, each accompanied by up to 2KB of metadata. Each bucket is owned by a client
by a unique, user-assigned key.[] Amazon S3 provides large quantities of reliable storage that is
highly protected but to which you have low bandwidth access.[5]
Yahoos PNUTS is a massively parallel and geographically distributed database system for
Yahoo!s web applications. PNUTS provides data storage organized as hashed or ordered
relational tables. Data tables are horizontally partitioned into groups of records called tablets,
which are scattered across many servers; each server might have hundreds or thousands of
tablets, but each tablet is stored on a single server within a region.[6][7]
DATA PRIVACY CONCERNS
Concerns about Data privacy is a key issue with cloud storage. Clouds typically store client data
in several locations most often outside the geographical boundary of its clients. Storing data in
several locations has the advantage of backup retrieval in case primary storage at one place
fails. However the physical location of data elsewhere is subjected to the local data protection
laws of that country. For example in the USA the US data patriot act allows the government to
demand access to the data stored on any computer, if the data is being hosted by a third party
the data is to be handed over without the knowledge or permission of the company or person
using the hosting service.[1] Unfortunately most cloud providers give their clients very little or
no option about where their data would be stored. Amazon S3 only permits its client to choose
between USA and EU data storage options. This leaves most clients of cloud storage immune to
exposure, unless the data is encrypted using a key not located at the host.

3

DATA AVAILABILITY AND FAULT TOLERANCE
A key management issue for cloud providers is the ability to make stored data available on
demand, when requested by clients for use. These demands can be overwhelming and so
systems put in place to handle such demands must be robust and most importantly fault
tolerant. Again just like the data storage mechanism there are very few differences within the
service providers studied above, Almost all the providers use the concept of data replication to
address the issue of availability of fault tolerance.
In Googles GFS system, data chunks are replicated on several severs across many different
locations with a replication factor of 3[4] i.e. each chuck is stored on 3 different servers. This
ensures that when one server fails, other servers could be used in addressing a demand.
Microsofts Azure also works in a similar way replicating data 3 times in order to address fault
tolerance.[4]
Amazon uses several platforms to address availability and fault tolerance aside replicating
objects across many geographic distribution areas. These platforms serve as an option for
clients to use in addition to replication. An example is the auto scaling platform which allows
client to automatically scale the Amazon Elastic Cloud 2(EC2) capacity up or down. It allows
users to define their rules that determine when they need fewer or more sever instances. Other
platforms include Amazon machine image(AMI), Elastic Block Store(EBS), Elastic IP addresses,
Elastic load Balancing, reserve instance etc.[8]
Yahoo PNUNTS employs redundancy at multiple levels (data, metadata, serving components,
etc)and leverage its consistency model to support highly available reads and writes even after a
failure or partition[7]
DATA PROCESSING
Google cloud services allow for data processing, this is done by the use of the MapReduce
model. It is built on the GFS platform [4]. MapReduce is the framework for processing
parallelizable problems across huge data sets using a large number of computer (nodes)
Computational processing can occur on data stored either in a file system (unstructured) or in a
database (structured). MapRreduce can take advantage of locality of data processing on or near
the storage assets to decrease transmission of data. MapReduce enables software developers
to write a program containing two simple functions, map and reduce [4]
In the map step the master nodes take the input divides it into smaller sub problems and
distributes them to worker nodes. A worker node may do this again in turn leading to a multi-
level tree structure. The worker node processes the smaller problem and passes the answer
back to its master node.[9][10]
4

In the Reduce step, the master node then collects the answers to all the sub problems and
combines them in some way to form the output which is the answer to the problem it was
originally trying to solve[9][10]

DATA SCALABILITY AND ELASTICITY
All the cloud providers investigated here claims their clouds are scalable and elastic. Scalability
is a desirable property of a system which indicates its ability to either handle growing amounts
of work or improve throughput when additional resources are added [11]. Scaling is achieved in
two ways the first is the key value store and the other is the conventional DBMS architecture
and leverage from a key-value store in order to make it highly scalable. Key value store
abstraction is where data is viewed as key-value pairs and atomic access is supported only at
the granularity of single keys. This single key atomic access semantics naturally allows efficient
horizontal data portioning and provides the bases for scalability and availability in these
systems.[11]
Computer power is elastic but only when workload is parallelizable one of the advantages of
cloud computing is it elasticity in the face of changing conditions. E.g. unexpected spikes in data
demand is catered for by adding additional computational resources []. Amazon Elastic
Compute Cloud (EC2) works in this way.[1]
CONCLUSION
This article has attempted to investigate some areas of data management employed in four
prominent cloud providers. These providers have used the same general principles but have
used different approaches to implementation. Research into more robust data management
techniques and algorithms is still ongoing in order to make cloud storage the preferred
destination for all IT data storage and processing. From the investigations it is clear that current
data management practices adopted and implemented are meeting the demands of cloud
users.
Cloud computing is set to keep growing in the coming years and this requires providers of this
service to continual improve on existing data management systems, as well as develop new
ones to meet the demands of the volumes of data which is today referred to as bigdata.



5

REFERENCES
[1] D.J Abadi. Data Management in the cloud: Limitations and Opportunities
http://www.cs.yale.edu/homes/dna/papers/abadi-cloud-ieee09.pdf
[2] http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf
[3] http://news.bbc.co.uk/1/hi/technology/7421099.stm.
[4] http://www.nsa.gov/research/_files/publications/cloud_computing_overview.pdf
[5]B. Sosinsky. Cloud Computing Bible, Chapter 9 using amazon web services, 2011.
[6] http://libra.msra.cn/Publication/4439960/pnuts-yahoo-s-hosted-data-serving-platform
[7] http://www.mpi-sws.org/~druschel/courses/ds/papers/cooper-pnuts.pdf
[8]http://d36cz9buwru1tt.cloudfront.net/AWS_Building_Fault_Tolerant_Applications.pdf
[9] J.Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters
http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-
osdi04.pdf
[10] http://en.wikipedia.org/wiki/MapReduce
[11] D Agrawal, A Abbadi, S Das, and A. J. Elmore. Database Scalability, Elasticity, and Autonomy
in the Cloud http://cs.ucsb.edu/~aelmore/papers/dasfaa.pdf


LIST OF ABBREVIATIONS USED
1. DBMS: Data Base Management Systems
2. EU: European Union
3. GFS: Google File Systems
4. ICT: Information Communication Technology
5. IT: Information Technology
6. NIST: National Institute of Standards and Technology
7. RDBMS: Relational Database Management Systems
8. SQL: Structured Query Language
9. USA : United States of America


6

Das könnte Ihnen auch gefallen