Hybrid Cloud Documentation

SECURE AUTHORIZED DE-DUPLICATION
CHAPTER-1
1. PRE-REQUISITES
1.1 What is cloud?

Cloud Computing provides us a means by which we can access the applications as
utilities, over the internet. It allows us to create, configure, and customize the business
applications online.
1.2 What is cloud computing?

Cloud Computing refers to manipulating, configuring, and accessing the applications
online. It offers online data storage, infrastructure and application.
We need not to install a piece of software on our local PC and this is how the cloud
computing overcomes platform dependency issues. Hence, the Cloud Computing is
making our business application mobile and collaborative.
FIG 1: ARCHITECTURE OF CLOUD COMPUTING
Dept. CSE, GPREC, KURNOOL. Page 1

1.3 Basic Concepts:

There are certain services and models working behind the scene making the cloud
computing feasible and accessible to end users. Following are the working models for
cloud computing:
Deployment Models
Service Models
1.3.1 Deployment Models:
Deployment models define the type of access to the cloud, i.e., how the cloud is located? Cloud
can have any of the four types of access: Public, Private, Hybrid and Community.
FIG 2: DEPLOYMENT MODELS
(i) PUBLIC CLOUD:

The cloud infrastructure is made available to the general public or a large industry group
and is owned by an organization selling cloud services. In public clouds, resources are
offered as a service, usually over an internet connection, for a pay-per-usage fee. Users can
scale their use on demand and do not need to purchase hardware to use the service. Public
cloud providers manage the infrastructure and pool resources into the capacity required by
its users. Public clouds are available to the general public or large organizations, and are
owned by a third party organization that offers the cloud service. A public cloud is hosted
on the internet and designed to be used by any user with an internet connection to provide
a similar range of capabilities and services. Public cloud users are typically residential users

and connect to the public internet through an internet service providers network. Google,
Amazon and Microsoft are examples of public cloud vendors who offer their services to
the general public. Data created and submitted by consumers are usually stored on the
servers of the third party vendor.
The advantages of public cloud include:
Data availability and continuous uptime
24/7 technical expertise
On demand scalability
Easy and inexpensive setup
No wasted resources
Drawbacks of public cloud:

Data Security
Privacy
Examples of public cloud include:
Amazon AWS
Google Apps
Salesforce.com
Microsoft BPOS
Microsoft Office 365
Public cloud computing represents a significant paradigm shift from the conventional norms
of an organizational data centre to a de-perimeterized infrastructure open to use by potential
adversaries. As with any emerging information technology area, cloud computing should be
approached carefully with due consideration to the sensitivity of data. Planning helps to ensure
that the computing environment is as secure as possible and in compliance with all relevant
organizational policies and that privacy is maintained. It also helps to ensure that the agency
derives full benefit from information technology spending.
Public cloud providers default offerings generally do not reflect a specific organizations
security and privacy needs. From a risk perspective, determining the suitability of cloud
services requires an understanding of the context in which the organization operates and the

consequences from the plausible threats it faces. Adjustments to the cloud computing
environment may be warranted to meet an organizations requirements. Organizations should
require that any selected public cloud computing solution is configured, deployed, and
managed to meet their security, privacy, and other requirements.
While one of the biggest obstacles facing public cloud computing is security, the cloud
computing paradigm provides opportunities for innovation in provisioning security services
that hold the prospect of improving the overall security of some organizations. The biggest
beneficiaries are likely to be smaller organizations that have limited numbers of information
technology administrators and security personnel, and can gain the economies of scale
available to larger organizations with sizeable data centers, by transitioning to a public cloud.
Non-negotiable service agreements in which the terms of service are prescribed completely by
the cloud provider are generally the norm in public cloud computing. Negotiated service
agreements are also possible. Similar to traditional information technology outsourcing
contracts used by agencies, negotiated agreements can address an organizations concerns
about security and privacy details, such as the vetting of employees, data ownership and exit
rights, breach notification, isolation of tenant applications, data encryption and segregation,
tracking and reporting service effectiveness, compliance with laws and regulations, and the use
of validated products meeting federal or national standards (e.g., Federal Information
Processing Standard 140). A negotiated agreement can also document the assurances the cloud
provider must furnish to corroborate that organizational requirements are being met.
Critical data and applications may require an agency to undertake a negotiated service
agreement in order to use a public cloud. Points of negotiation can negatively affect the
economies of scale that a non-negotiable service agreement brings to public cloud computing,
however, making a negotiated agreement less cost effective. As an alternative, the organization
may be able to employ compensating controls to work around identified shortcomings in the
public cloud service.
With the growing number of cloud providers and range of services from which to choose,
organizations must exercise due diligence when selecting and moving functions to the cloud.
Decision making about services and service arrangements entails striking a balance between
benefits in cost and productivity versus drawbacks in risk and liability. While the sensitivity of
data handled by government organizations and the current state of the art make the likelihood
of outsourcing all information technology services to a public cloud low, it should be possible

for most government organizations to deploy some of their information technology services to
a public cloud, provided that all requisite risk mitigations are taken.
Another issue with public cloud is that you may not know where your data is stored
or how it is backed up, and whet her unauthorized users can get access to it.
Reliability is another concern for public cloud networks. A recent two-day Amazon cloud
outage,
for example, left dozens of major e-commerce websites disabled or completely unavailable
Public clouds are owned and operated by third-party service providers. Customers benefit
from economies of scale because infrastructure costs are spread across all users, thus
allowing each individual client to operate on a low-cost, pay-as-you-go model. Another
advantage of public cloud infrastructures is that they are typically larger in scale than an
in-house enterprise cloud, which provides clients with seamless, on-demand scalability.
It is also important to note that all customers on public clouds share the same infrastructure
pool with limited configurations, security protections and availability variances, as these
factors are wholly managed and supported by the service provider.
The Public Cloud allows systems and services to be easily accessible to the general public,
Public cloud may be less secure because of its openness, e.g., e-mail.
(ii) PRIVATE-CLOUD:
The Private Cloud allows systems and services to be accessible within an
organization. It offers increased security because of its private nature.
Private clouds are those that are built exclusively for an individual enterprise. They allow
the firm to host applications in the cloud, while addressing concerns regarding data security
and control, which is often lacking in a public cloud environment. There are two variations
of private clouds:
On-Premise Private Cloud: This format, also known as an internal cloud, is hosted
within an organizations own data center. It provides a more standardized process and
protection, but is often limited in size and scalability. Also, a firms IT department
would incur the capital and operational costs for the physical resources with this model.
On-premise private clouds are best used for applications that require complete control

and configurability.
Externally-Hosted Private Cloud: This private cloud model is hosted by an external

cloud computing provider. The service provider facilitates an exclusive cloud
environment with full guarantee of privacy. This format is recommended for
organizations that prefer not to use a public cloud infrastructure due to the risks
associated with the sharing of physical resources.
The following graphic shows the difference between customer private clouds and provider
private clouds.
FIG 3: DIFFERENCE BETWEEN CUSTOMER PRIVATE AND PROVIDER

PRIVATE CLOUDS
The cloud infrastructure is operated solely for an organization. It may be managed by the
organization or a third party and may exist on premise or off premise. The cloud infrastructure
is accessed only by the members of the organization and/or by granted third parties. The
purpose is not to offer cloud services to the general public, but to use it within the organization.
For example an enterprise that wants to make consumer data available to their different stores.
A private cloud is hosted in the data centre of a company and provides its services only to users
inside that company or its partners. A private cloud provides more security than public clouds,
and cost saving in case it utilizes otherwise unused capacities in an already existing data centre.
Making such un-used capacities available through cloud interfaces allows to utilize the same
tools as when working with public clouds and to benefit the capabilities inherent in cloud
management software, like a self-service interface, automated management of computing
resources, and the ability to sell existing over capacities to partner companies. The Aberdeen
group published a report, which concludes that organizations operating private clouds typically
have about 12% cost advantage over organizations using public clouds. A private cloud has the
potential to give the organization.

The major drawback of private cloud is its higher cost. When comparisons are made with public
cloud; the cost of purchasing equipment, software and staffing often results in higher costs to
an organization having their own private cloud.
(iii) COMMUNITY CLOUD:

The Private Cloud allows systems and services to be accessible within an organization. It
offers increased security because of its private nature.
A community cloud falls between public and private clouds with respect to the target set of
consumers. It is somewhat similar to a private cloud, but the infrastructure and
computational resources are exclusive to two or more organizations that have common
privacy, security, and regulatory considerations, rather than a single organization. The
community cloud aspires to combine distributed resource provision from grid computing,
distributed control from digital ecosystems and sustainability from green computing, with
the use cases of cloud computing, while making greater use of self-management advances
from autonomic computing. Replacing vendor clouds by shaping the under-utilized
resources of user machines to form a community cloud, with nodes potentially fulfilling all
roles, consumer, producer, and most importantly coordinator.
The advantages of community cloud include:
Cost of setting up a communal cloud versus individual private cloud can be cheaper
due to the division of costs among all participants.
Management of the community cloud can be outsourced to a cloud provider. The
advantage here is that the provider would be an impartial third party that is bound
by contract and that has no preference to any of the clients involved other than what
is contractually mandated.
Tools residing in the community cloud can be used to leverage the information
stored to serve consumers and the supply chain, such as return tracking and just-in-
time production and distribution.
Drawbacks of community cloud:
Costs higher than public cloud.
Fixed amount of bandwidth and data storage is shared among all community
members. The concept of community cloud is still in its infancy, but picking up
rapidly among start-ups and small and medium term businesses.

(iv) HYBRID CLOUD:

The Hybrid Cloud is mixture of public and private cloud. However, the critical activities
are performed using private cloud while the non-critical activities are performed using
public cloud.
Hybrid clouds are more complex than the other deployment models, since they involve a
composition of two or more clouds (private, community, or public). Each member remains
a unique entity, but is bound to others through standardized or proprietary technology that
enables application and data portability among them A hybrid cloud is a composition of at
least one private cloud and at least one public cloud. A hybrid cloud is typically offered in
one of two ways: a vendor has a private cloud and forms a partnership with a public cloud
provider, or a public cloud provider forms a partnership with a vendor that provides private
cloud platforms. Hybrid cloud infrastructure is a composition of two or more clouds that
are unique entities, but at the same time are bound together by standardized or proprietary
technology that enables data and application portability. In hybrid cloud, an organization
provides and manages some resources in-house and some out-house. For example,
organizations that have their human resource (HR) and customer relationship management
(CRM) data in a public cloud like Saleforces.com but have confidential data in their own
private cloud. Ideally, the hybrid approach allows a business to take advantage of the
scalability and cost-effectiveness that a public cloud computing environment offers without
exposing mission-critical applications and data to third-party vulnerabilities. This type of
hybrid cloud is also referred to as hybrid IT. Hybrid clouds offer the cost and scale benefits
of public clouds, while also offering the security and control of private clouds.
The advantages of hybrid cloud include:
Reduces capital expenses as part of the organizations infrastructure, needs are
outsourced to public cloud providers.
Improves resource allocation for temporary projects at a vastly reduced cost
because the use of public cloud removes the need for investments to carry out these
projects.
Helps optimize the infrastructure spending during different stages of the application
lifecycle. Public clouds can be tapped for development and testing while private
clouds can be used for production. More importantly, public clouds can be used to
retire applications, which may be no longer needed because of the move to SaaS, at
much lower costs than dedicated on premise infrastructure.

Offers both the controls available in a private cloud deployment along with the
ability to rapidly scale using public cloud.
Supplies support for cloud-bursting.
Provides drastic improvements in the overall organizational agility, because of the
ability to leverage public clouds, leading to increased opportunities.
Drawbacks of hybrid cloud are:
As a hybrid cloud extends the IT perimeter outside the organizational boundaries,
it opens up a larger surface area for attacks with a section of the hybrid cloud
infrastructure under the control of the service provider.
An easier approach to solving the identity, needs of hybrid clouds is to extend the
existing enterprise identity and access management to the public clouds. This opens
up concerns about how this approach will affect the enterprise identity and its
impact on the organizations security.
When organizations manage complex hybrid cloud environments using a
management tool, either as a part of the cloud platform or as a third-party tool,
organizations should consider the security implications of using such a tool. For
example, the management tool should be able to handle the identity and enforce
security uniformly across hybrid cloud environments.
A hybrid cloud makes the data flow from a private environment to a public cloud
much easier. There are privacy and integrity concerns associated with such data
movement because the privacy controls in the public cloud environment vary
significantly from the private cloud.
There are risks associated with the security policies spanning the hybrid cloud
environment such as issues with how encryption keys are managed in a public cloud
compared to a pure private cloud environment.
Hybrid clouds offer a greater flexibility to businesses while offering choice in terms of keeping
control and security. Hybrid clouds are usually deployed by organizations willing to push part
of their workloads to public clouds for either cloud-bursting purposes or for projects requiring
faster implementation. Because hybrid clouds vary based on company needs and structure of
implementation, there is no one-size-fits-all solution. Since hybrid environments involve both
on-premise and public cloud providers, some additional infrastructure security considerations
come into the picture, which are normally associated with public clouds. Any businesses
planning to deploy hybrid clouds should understand the different security needs and follow the

industry best practices to mitigate any risks. Once secure, a hybrid cloud environment can help
businesses transition more applications into public clouds, providing additional cost savings
(v) DISTRIBUTED CLOUD:
A cloud computing platform can be assembled from a distributed set of machines in different
locations, connected to a single network or hub service. It is possible to distinguish between
two types of distributed clouds: public-resource computing and volunteer cloud.
Public-resource computingThis type of distributed cloud results from an expansive

definition of cloud computing, because they are more akin to distributed computing than
cloud computing. Nonetheless, it is considered a sub-class of cloud computing, and some
examples include distributed computing platforms such as BOINC and Folding@Home.
Volunteer cloudVolunteer cloud computing is characterized as the intersection of
public-resource computing and cloud computing, where a cloud computing infrastructure
is built using volunteered resources. Many challenges arise from this type of infrastructure,
because of the volatility of the resources used to built it and the dynamic environment it
operates in. It can also be called peer-to-peer clouds, or ad-hoc clouds. An interesting effort
in such direction is Cloud@Home, it aims to implement a cloud computing infrastructure
using volunteered resources providing a business-model to incentivize contributions
through financial restitution.
(vi) INTER CLOUD:

The Inter cloud is an interconnected global "cloud of clouds" and an extension of the Internet
"network of networks" on which it is based. The focus is on direct interoperability between
public cloud service providers, more so than between providers and consumers (as is the case
for hybrid- and multi-cloud).
(vii) MULTI CLOUD:

Multi cloud is the use of multiple cloud computing services in a single heterogeneous
architecture to reduce reliance on single vendors, increase flexibility through choice, mitigate
against disasters, etc. It differs from hybrid cloud in that it refers to multiple cloud services,
rather than multiple deployment modes (public, private, legacy).

(viii) NESTED CLOUDS:

Several companies have become market leaders in the area of public cloud services. Cloud
services such as Amazon (AWS) or Google App Engine are the de-facto standards for cloud
hosting. These providers went well beyond simply building the largest public clouds in the
world; they also succeeded in defining the gigantic cloud eco-systems that have become
platforms for other enterprise clouds. Interestingly, many companies have built their own
clouds within major public clouds. These organizations decided that building their cloud within
a third-party cloud provides more benefits relative to building their own. One such company is
Acquia. Acquia is a leading provider of online products and services to help companies build
and manage their websites based on the popular Drupal open-source social publishing platform.
Acquia also offers a cloud hosting platform that helps companies host their websites. Acquia
uses Amazon
Web Services to host both its own infrastructure and its customers clouds. In February of 2011,
the company had approximately 350 servers running in the AWS cloud. Acquia CTO Dries
Buytaert has described this decision: Acquia chose AWS because it was the fastest way to get
a new hosting service to market. It also saves us the cost of adding staff specialists on
networking and infrastructure build out. Customers love our ability to quickly scale their sites
using the elastic scalability of AWS and to quickly create clone sites for load testing.
Beyond Acquia, public clouds have become launch-pad platforms for many small companies
and start-ups. The ability to quickly take products to market without major up-front
infrastructure investment provides substantial benefits. Given todays infrastructure and
human-resources costs, early-stage technology companies are not able to start and sustain their
business with traditional in-house IT infrastructure.
Even for cloud service providers like Acquia, the ability to pay as you grow, to rapidly scale
infrastructure, and to quickly take products to market outweighs potential long-term cost-
saving benefits that a company could realize with their own cloud infrastructure.
1.3.2 Service Models:

Service Models are the reference models on which the Cloud Computing is based. These can
be categorized into three basic service models as listed below:
1. Infrastructure as a Service (IaaS)
2. Platform as a Service (PaaS)
3. Software as a Service (SaaS)

There are many other service models all of which can take the form like XaaS, i.e., Anything
as a Service. This can be Network as a Service, Business as a Service, Identity as a Service,
Database as a Service or Strategy as a Service. The Infrastructure as a Service (IaaS) is the
most basic level of service. Each of the service models make use of the underlying service
model, i.e., each inherits the security and management mechanism from the underlying model,
as shown in the following diagram.
Cloud service models describe how cloud services are made available to clients. These service
models may have synergies between each other and be interdependent for example, PaaS is
dependent on IaaS because application platforms require physical infrastructure.
The main difference between SaaS and PaaS is that PaaS normally represents a platform for
application development, while SaaS provides online applications that are already developed.
FIG.4- SERVICE MODELS

Cloud Clients:
Users access cloud computing using networked client devices, such as desktop computers,
laptops, tablets and smartphones and any Ethernet enabled device such as Home Automation
Gadgets. Some of these devicescloud clientsrely on cloud computing for all or a majority
of their applications so as to be essentially useless without it. Examples are thin clients and the
browser-based Chromebook. Many cloud applications do not require specific software on the
client and instead use a web browser to interact with the cloud application. With Ajax and
HTML5 these Web user interfaces can achieve a similar, or even better, look and feel to native
applications. Some cloud applications, however, support specific client software dedicated to
these applications (e.g., virtual desktop clients and most email clients). Some legacy

applications (line of business applications that until now have been prevalent in thin client
computing) are delivered via a screen-sharing technology.
(i) Infrastructure as a service (IaaS)
According to the Internet Engineering Task Force(IETF), the most basic cloud-service model
is that of providers offering computing infrastructure virtual machines and other resources
as a service to subscribers. Infrastructure as a service (IaaS) refers to online services that
abstract the user from the details of infrastructure like physical computing resources, location,
data partitioning, scaling, security, backup etc. A hypervisor, such as Oracle Virtual Box,
Oracle VM, KVM, VMware ESX, or Hyper-V, runs the virtual machines as guests. Pools of
hypervisors within the cloud operational system can support large numbers of virtual machines
and the ability to scale services up and down according to customers' varying requirements.
Linux containers run in isolated partitions of a single Linux kernel running directly on the
physical hardware. Linux c-groups and namespaces are the underlying Linux kernel
technologies used to isolate, secure and manage the containers. Containerisation offers higher
performance than virtualization, because there is no hypervisor overhead. Also, container
capacity auto-scales dynamically with computing load, which eliminates the problem of over-
provisioning and enables usage-based billing. IaaS clouds often offer additional resources such
as a virtual-machine disk-image library, raw block storage, file or object storage, firewalls, load
balancers, IP addresses, virtual local area networks (VLANs), and software bundles.
IaaS-cloud providers supply these resources on-demand from their large pools of equipment
installed in data-centers. For wide-area connectivity, customers can use either the Internet or
carrier clouds (dedicated virtual private networks). To deploy their applications, cloud users
install operating-system images and their application software on the cloud infrastructure. In
this model, the cloud user patches and maintains the operating systems and the application
software. Cloud providers typically bill IaaS services on a utility computing basis: cost reflects
the amount of resources allocated and consumed.
(ii) Platform as a service (PaaS)
PaaS vendors offer a development environment to application developers. The provider

typically develops toolkit and standards for development and channels for distribution and
payment. In the PaaS models, cloud providers deliver a computing platform, typically including
operating system, programming-language execution environment, database, and web server.

Application developers can develop and run their software solutions on a cloud platform
without the cost and complexity of buying and managing the underlying hardware and software
layers. With some PaaS offers like Microsoft Azure and Google App Engine, the underlying
computer and storage resources scale automatically to match application demand so that the
cloud user does not have to allocate resources manually. The latter has also been proposed by
an architecture aiming to facilitate real-time in cloud environments. Even more specific
application types can be provided via PaaS, such as media encoding as provided by services
like bitcodin.com or media.io.
Some integration and data management providers have also embraced specialized applications
of PaaS as delivery models for data solutions. Examples include iPaaS (Integration Platform
as a Service) and dPaaS (Data Platform as a Service). iPaaS enables customers to develop,
execute and govern integration flows. Under the iPaaS integration model, customers drive the
development and deployment of integrations without installing or managing any hardware or
middleware. dPaaS delivers integrationand data-managementproducts as a fully managed
service. Under the dPaaS model, the PaaS provider, not the customer, manages the
development and execution of data solutions by building tailored data applications for the
customer. dPaaS users retain transparency and control over data through data-visualization
tools. Platform as a Service (PaaS) consumers do not manage or control the underlying cloud
infrastructure including network, servers, operating systems, or storage, but have control over
the deployed applications and possibly configuration settings for the application-hosting
environment.
A recent specialized PaaS is the Block chain as a Service (BaaS), that some vendors such as
Microsoft Azure have already included in their PaaS offering.
(iii) Software as a service (SaaS)
In the software as a service (SaaS) model, users gain access to application software and
databases. Cloud providers manage the infrastructure and platforms that run the applications.
SaaS is sometimes referred to as "on-demand software" and is usually priced on a pay-per-use
basis or using a subscription fee. In the SaaS model, cloud providers install and operate
application software in the cloud and cloud users access the software from cloud clients. Cloud
users do not manage the cloud infrastructure and platform where the application runs. This
eliminates the need to install and run the application on the cloud user's own computers, which

simplifies maintenance and support. Cloud applications differ from other applications in their
scalabilitywhich can be achieved by cloning tasks onto multiple virtual machines at run-time
to meet changing work demand. Load balancers distribute the work over the set of virtual
machines. This process is transparent to the cloud user, who sees only a single access-point.
To accommodate a large number of cloud users, cloud applications can be multitenant, meaning
that any machine may serve more than one cloud-user organization.
The pricing model for SaaS applications is typically a monthly or yearly flat fee per user, so
prices become scalable and adjustable if users are added or removed at any point. Proponents
claim that SaaS gives a business the potential to reduce IT operational costs by outsourcing
hardware and software maintenance and support to the cloud provider. This enables the
business to reallocate IT operations costs away from hardware/software spending and from
personnel expenses, towards meeting other goals. In addition, with applications hosted
centrally, updates can be released without the need for users to install new software. One
drawback of SaaS comes with storing the users' data on the cloud provider's server. As a result,
there could be unauthorized access to the data. For this reason, users are increasingly adopting
intelligent third-party key-management systems to help secure their data.
1.3.3 Cloud engineering:
It is the application of engineering disciplines to cloud computing. It brings a systematic

approach to the high-level concerns of commercialization, standardization, and governance in
conceiving, developing, operating and maintaining cloud computing systems. It is a
multidisciplinary method encompassing contributions from diverse areas such as systems,
software, web, performance, information, security, platform, risk and quality engineering.
Cloud computing is still as much a research topic, as it is a market offering. What is clear
through the evolution of cloud computing services is that the chief technical officer (CTO) is a
major driving force behind cloud adoption. The major cloud technology developers continue
to invest billions a year in cloud R&D; for example: in 2011 Microsoft committed 90% of its
US$9.6bn R&D budget to its cloud. Centaur Partners also predict that SaaS revenue will grow
from US$13.5B in 2011 to $32.8B in 2016. This expansion also includes Finance and
Accounting SaaS. Additionally, more industries are turning to cloud technology as an efficient
way to improve quality services due to its capabilities to reduce overhead costs, downtime, and
automate infrastructure deployment.

CHAPTER-2
2.INTRODUCTION
Long gone are the days when software installers came on 3.25 inch disks and CDs could be
considered a corporate backup medium. Storage has been firmly in the realm of commodities
for years now, and as a result the amount of data within businesses is somewhere between
staggering and are you serious? There are many, many problems associated with having
mass amounts of data to steward. Two of the most obvious and painful are
1) How to make room for the continued growth of data
2) How to back up all of the data.
2.1 DE-DUPLICATION:
File systems often contain redundant copies of information: identical files or sub-file regions,
possibly stored on a single host, on a shared storage cluster, or backed-up to secondary storage.
De-duplicating storage systems take advantage of this redundancy to reduce the underlying
space needed to contain the file systems (or backup images thereof). Deduplication can work
at either the sub-file or whole-file level. More fine-grained deduplication creates more
opportunities for space savings, but necessarily reduces the sequential layout of some files,
which may have significant performance impacts when hard disks are used for storage (and in
some cases it is necessary to follow complicated techniques to improve performance).
Alternatively, whole file deduplication is simpler and eliminates file fragmentation concerns,
though at the cost of some otherwise reclaimable storage.
Because the disk technology trend is toward improved sequential bandwidth and reduced per-
byte cost with little or no improvement in random access speed, its not clear that trading away
sequential strategy for space savings makes sense, at least in primary storage.
In order to evaluate the tradeoff in space savings between whole-file and block-based
deduplication, we conducted a large-scale study of file system contents on desktop Windows
machines at Microsoft. Our study consists of 857 file systems spanning 162 terabytes of disk
over 4 weeks. It includes results from a broad cross-section of employees, including software
developers, testers, management, sales & marketing, technical support, documentation writers
and legal staff. We find that while block-based deduplication of our dataset can lower storage
consumption to as little as 32% of its original requirements, nearly three quarters of the
improvement observed could be captured through whole-file deduplication and sparseness. For

four weeks of full backups, whole file deduplication (where a new backup image contains a
reference to a duplicate file in an old backup) achieves 87% of the savings of block-based. We
also explore the parameter space for deduplication systems, and quantify the relative benefits
of sparse file support. Our study of file content is larger and more detailed than any previously
published effort, which promises to inform the design of space efficient storage systems.
In addition, we have conducted a study of metadata and data layout, as the last similar study
is now 4 years old. We find that the previously observed trend toward storage being consumed
by files of increasing size continues unabated; half of all bytes are in files larger than 30MB
(this figure was 2MB in 2000). Complicating matters, these files are in opaque unstructured
formats with complicated access patterns. At the same time there are increasingly many small
files in an increasingly complex file system tree. Contrary to previous work, we find that file-
level fragmentation is not widespread, presumably due to regularly scheduled background
defragmenting in Windows and the finding that a large portion of files are rarely modified. For
more than a decade, file system designers have been warned against measuring only fresh file
system installations, since aged systems can have a significantly different performance profile.
Our results show that this concern may no longer be relevant, at least to the extent that the
aging produces file-level fragmentation. Ninety-six percent of files observed are entirely linear
in the block address space. To our knowledge, this is the first large scale study of disk
fragmentation in the wild.
2.1.1 Duplication methodology:

Potential participants were selected randomly from Microsoft employees. Each was contacted
with an offer to install a file system scanner on their work computer(s) in exchange for a chance
to win a prize. The scanner ran autonomously during off hours once per week from September
18 October 16, 2009. Then contacted 10,500 people in this manner to reach the target study
size of about 1000 users. This represents a participation rate of roughly 10%, which is smaller
than the rates of 22% in similar prior studies. Anecdotally, many potential participants declined
explicitly because the scanning process was quite invasive.
2.1.2 File system Scanner

The scanner first took a consistent snapshot of fixed device (non-removable) file
systems with the Volume Shadow Copy Service (VSS). VSS snapshots are both file

system and application consistent. It then recorded metadata about the file system itself,
including age, capacity, and space utilization. The scanner next processed each file in
the snapshot, writing records to a log. It recorded Windows file metadata, including
path, file name and extension, time stamps, and the file attribute flags. It recorded any
retrieval and allocation pointers, which describe fragmentation and sparseness
respectively. It also recorded information about the whole system, including the
computers hardware and software configuration and the time at which the
defragmentation tool was last run, which is available in the Windows registry. We took
care to exclude from study the page file, hibernation file, the scanner itself, and the VSS
snapshots it created.
During the scan, we recorded the contents of each file first by breaking the file into
chunks using each of two chunking algorithms (fixed block and Rabin fingerprinting)
with each of 4 chunk size settings (8K- 64K in powers of two) and then computed and
saved hashes of each chunk. We found whole file duplicates in post-processing by
identifying files in which all 1 Application consistent means that VSS-aware
applications have an opportunity to save their state cleanly before the snapshot is taken.
chunks matched. In addition to reading the ordinary contents of files we also collected
a separate set of scans where the files were read using the Win32 Backup Read API,
which includes metadata about the file and would likely be the format used to store file
system backups.
We used salted MD5 as our hash algorithm, but truncated the result to 48 bits in order
to reduce the size of the data set. The Rabin-chunked data with an 8K target chunk size
had the largest number of unique hashes, somewhat more than 768M. We expect that
about two thousands of those (0.0003%) are false matches due to the truncated hash.
Another process copied the log files to our server at midnight on a random night of the
week to help smooth the considerable network traffic. Nevertheless, the copying
process resulted in the loss of some of the scans. Because the scanner placed the results
for each of the 32 parameter settings into separate files and the copying process worked
at the file level, for some file systems we have results for some, but not all of the
parameter settings. In particular, larger scan files tended to be partially copied more
frequently than smaller ones, which may result in a bias in our data where larger file
systems are more likely to be excluded. Similarly, scans with a smaller chunk size
parameter resulted in larger size scan files and so were lost at a higher rate.

2.1.3 Post processing:

At the completion of the study the resulting data set was 4.12 terabytes compressed,
which would have required considerable machine time to import into a database. As an
optimization, we observed that the actual value of any unique hash (i.e., hashes of
content that was not duplicated) was not useful to our analyses.
To find these unique hashes quickly we used a novel 2- pass algorithm. During the first
pass we created a 2 GB Bloom filter [4] of each hash observed. During this pass, if we
tried to insert a value that was already in the Bloom filter, we inserted it into a second
Bloom filter of equal size. We then made a second pass through the logs, comparing
each hash to the second Bloom filter only. If it was not found in the second filter, we
were certain that the hash had been seen exactly once and could be omitted from the
database. If it was in the filter, we concluded that either the hash value had been seen
more than once, or that its entry in the filter was a collision. We recorded all of these
values to the database. Thus this algorithm was sound, in that it did not impact the
results by rejecting any duplicate hashes.
However it was not complete despite being very effective, in that some non-duplicate
hashes may have been added to the database even though they were not useful in the
analysis. The inclusion of these hashes did not affect our results, as the later processing
ignored them.
2.1.4 Biases and Sources of Error:

The use of Windows workstations in this study is beneficial in that the results can be
compared to those of similar studies. However, as in all data sets, this choice may introduce
biases towards certain types of activities or data. For example, corporate policies
surrounding the use of external software and libraries could have impacted our results.
As discussed above, the data retrieved from machines under observation was large and
expensive to generate and so resulted in network timeouts at our server or aborted scans on
the client side. While we took measures to limit these effects, nevertheless some amount of
data never made it to the server, and more had to be discarded as incomplete records. Our
use of VSS makes it possible for a user to selectively remove some portions of their file
system from our study. We discovered a rare concurrency bug in the scanning tool affecting

0.003% of files. While this likely did not affect results, we removed all files with this
artifact.
Our scanner was unable to read the contents of Windows system restore points, though it
could see the file metadata. We excluded these files from the deduplication analyses, but
included them in the metadata analyses.
2.2 REDUNDANCY IN FILE CONTENTS

Despite the significant declines in storage costs per GB, many organizations have seen
dramatic increases in total storage system costs. There is considerable interest in
reducing these costs, which has given rise to deduplication techniques, both in the
academic community and as commercial offerings. Initially, the interest in
deduplication has centered on its use in embarrassingly compressible scenarios, such
as regular full backups or virtual desktops. However, some have also suggested that
deduplication be used more widely on general purpose data sets. The rest of this section
seeks to provide a well-founded measure of duplication rates and compare the efficacy
of different parameters and methods of deduplication. In Section we provide a brief
summary of deduplication, and in Section we discuss the performance challenges
deduplication introduces. In Section we share observed duplication rates across a set of
workstations. Finally, Section measures duplication in the more conventional backup
scenario.
2.2.1 Background on Deduplication

De-duplication systems decrease storage consumption by identifying distinct chunks
of data with identical content. They then store a single copy of the chunk along with
metadata about how to reconstruct the original files from the chunks.
Chunks may be of a predefined size and alignment, but are more commonly of variable
size determined by the content itself. The canonical algorithm for variable sized
content-defined blocks is Rabin Fingerprints. By deciding chunk boundaries based on
content, files that contain identical content that is shifted (say because of insertions or
deletions) will still result in (some) identical chunks. Rabin-based algorithms are
typically configured with a minimum and maximum chunk size, as well as an expected
chunk size. In all our experiments, we set the minimum and maximum parameters to

4K and 128K, respectively while we varied the expected chunk size from 8K to 64K
by powers of-two.
2.2.2 The Performance Impacts of De-Duplication:

Managing the overheads introduced by a deduplication system is challenging. Naively,
each chunks fingerprint needs to be compared to that of all other chunks. While
techniques such as caches and Bloom filters can mitigate overheads, the performance
of deduplication systems remains a topic of research interest. The I/O system also poses
a performance challenge. In addition to the layer of indirection required by
deduplication, deduplication has the effect of de-linearizing data placement, which is
at odds with many data placement optimizations, particularly on hard-disk based
storage where the cost for non-sequential access can be orders of magnitude greater
than sequential.
FIG-5: PERFORMANCE IMPACTS OF DE-DUPLICATION

Other more established techniques to reduce storage consumption are simpler and have
smaller performance impact. Sparse file support exists in many file systems including
NTFS, XFS is relatively simple to implement. In a sparse file a chunk of zeros is stored
in notation by marking its existence in the metadata, removing the need to physically
store it. Whole file deduplication systems, such as the Windows SIS facility operate by
finding entire files that are duplicates and replacing them by copy-on-write links.
Although SIS does not reduce storage consumption as much as a modern deduplication
system, it avoids file allocation concerns and is far less computationally expensive than
more exhaustive deduplication.
2.2.3 De-duplication in Primary Storage:

Deduplication in Primary Storage Our data set includes hashes of data in both variable
and fixed size chunks, and of varying sizes. We chose a single week (September 18,
2009) from this dataset and compared the size of all unique chunks to the total
consumption observed. We had two parameters that we could vary: the deduplication
algorithm/parameters and the set of file systems (called the deduplication domain)
within which we found duplicates; duplicates in separate domains were considered to
be unique contents. The set of file systems included corresponds to the size of the file
server(s) holding the machines file systems. A value of 1 indicates deduplication
running independently on each desktop machine. Whole Set means that all 857 file
systems are stored together in a single deduplication domain. We considered all power
of-two domain sizes between 1 and 857. For domain sizes other than 1 or 857, we had

to choose which file systems to include together into particular domains and which to
exclude when the number of file systems didnt divide evenly by the size of the domain.
We did this by using a cryptographically secure random number generator. We
generated sets for each domain size ten times and report the mean of the ten runs. The
standard deviation of the results was less than 2% for each of the data points, so we
dont believe that we would have gained much more precision by running more trials.
Rather than presenting a three dimensional graph varying both parameters, we show
two slices through the surface. In both cases, the y-axis shows the de-duplicated file
system size as a percentage of the original file system size. Figure 1 shows the effect of
the chunk size parameter for the fixed and Rabin-chunked algorithms, and also for the
whole file algorithm (which doesnt depend on chunk size, and so varies only slightly
due to differences in the number of zeroes found and due to variations in which file
systems scans copied properly. This graph assumes that all file systems are in a single
deduplication domain; the shape of the curve is similar for smaller domains, through
the space savings are reduced.
Figure 2 shows the effect changing the size of the deduplication domains. Space
reclaimed improves roughly linearly in the log of the number of file systems in a
domain. Comparing single file systems to the whole set, the effect of grouping file
systems together is larger than that from the choice of chunking algorithm or chunk
size, or even of switching from whole file chunking to block-based. The most
aggressive chunking algorithm (8K Rabin) reclaimed between 18% and 20% more of
the total file size than did whole file deduplication. This offers weak support for block-
level deduplication in primary storage. The 8K fixed block algorithm reclaimed
between 10% and 11% more space than whole file. This capacity savings represents a
small gain compared to the performance and complexity of introducing advanced
deduplication features, especially ones with dynamically variable block sizes like Rabin
fingerprinting. Table 1 shows the top 15 file extensions contributing to duplicate
content for whole file duplicates, the percentage of duplicate space attributed to files of
that type, and the mean file size for each type. It was calculated using all of the file
systems in a single deduplication domain. The extension marked is a particular globally
unique ID thats associated with a widely distributed software patch. This table shows
that the savings due to whole file duplicates are concentrated in files containing

program binaries: dll, lib, pdb, exe, cab, msp, and msi together make up 58% of the
saved space. Figure 3 shows the CDF of the bytes reclaimed by whole file deduplication
and the CDF of all bytes, both by containing file size. It shows that duplicate bytes tend
to be in smaller files than bytes in general. Another way of looking at this is that the
very large file types (virtual hard disks, database stores, etc.) tend not to have whole-
file copies. This is confirmed by Table 1. Table 2 shows the amount of duplicate content
not in files with whole-file duplicates by file extension as a fraction of the total file
system content. It considers the whole set of file systems as a single deduplication
domain, and presents results with an 8K block size using both fixed and Rabin
chunking. For both algorithms, by far the largest source of duplicate data is VHD
(virtual hard drive) files. Because these files are essentially disk images, its not
surprising both that they contain duplicate data and also that they rarely have whole-
file duplicates. The next four file types are all compiler outputs. We speculate that they
generate block-aligned duplication because they have header fields that contain, for
example, timestamps but that their contents is otherwise deterministic in the code being
compiled. Rabin chunking may find blocks of code (or symbols) that move somewhat
in the file due to code changes that affect the length of previous parts of the file.
2.2.4: De-duplication in Backup storage:

Much of the literature on deduplication to date has relied on workloads consisting of
daily full backups. Certainly these workloads represent the most attractive scenario for
deduplication, because the content of file systems does not change rapidly. Our data set
did not allow us to consider daily backups, so we considered only weekly ones.
With frequent and persistent backups, the size of historical data will quickly out-pace
that of the running system. Furthermore, performance in secondary storage is less
critical than in that of primary, so the reduced sequentiality of a block-level de-
duplicated store is of lesser concern. We considered the 483 file systems for which four
continuous weeks of complete scans were available, starting with September 18, 2009,
the week used for the rest of the analyses. Our backup analysis considers each file
system as a separate deduplication domain. We expect that combining multiple backups
into larger domains would have a similar effect to doing the same thing for primary
storage, but we did not run the analysis due to resource constraints.
In practice, some backup solutions are incremental (or differential), storing deltas
between files, while others use full backups. Often, highly reliable backup policies use

a mix of both, performing frequent incremental backups, with occasional full backups
to limit the potential for loss due to corruption. Thus, the meaning of whole-file
deduplication in a backup store is not immediately obvious. We ran the analysis as if
the backups were stored as simple copies of the original file systems, except that the
contents of the files was the output from the Win32 BackupRead call, which includes
some file metadata along with the data. For our purposes, imagine that the backup
format finds whole file duplicates and stores pointers to them in the backup file. This
would result in a garbage collection problem for the backup files when theyre deleted,
but the details of that are beyond the scope of our study and are likely to be simpler
than a block-level de-duplicating store.
Using the Rabin chunking algorithm with an 8K expected chunk size, block-level
deduplication reclaimed 83% of the total space. Whole file deduplication, on the other
hand, yielded 72%. These numbers, of course, are highly sensitive to the number of
weeks of scans used in the study; its no accident that the results were around of the
space being claimed when there were four weeks of backups. However, one should not
assume that because 72% of the space was reclaimed by whole file deduplication that
only 3% of the bytes were in files that changed. The amount of change was larger than
that, but the de-duplicator found redundancy within a week as well and the two effects
offset.

CHAPTER-3
3.LITERATURE SURVEY
Clear explanation of all the work carried out by us
3.1 DESCRIPTION
In computing, data de-duplication is a specialized data compression technique for eliminating
duplicate copies of repeating data. Related and somewhat synonymous terms are intelligent
(data) compression and single-instance (data) storage. This technique is used to improve
storage utilization and can also be applied to network data transfers to reduce the number of
bytes that must be sent. In the de-duplication process, unique chunks of data, or byte patterns,
are identified and stored during a process of analysis. As the analysis continues, other chunks
are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced
with a small reference that points to the stored chunk. Given that the same byte pattern may
occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the
chunk size), the amount of data that must be stored or transferred can be greatly reduced.
3.2 HYBRID CLOUD APPROACH:

A Hybrid Cloud is a combined form of private clouds and public clouds in which some critical
data resides in the enterprises private cloud while other data is stored in and accessible from a
public cloud. Hybrid clouds seek to deliver the advantages of scalability, reliability, rapid
deployment and potential cost savings of public clouds with the security and increased control
and management of private clouds. As cloud computing becomes famous, an increasing
amount of data is being stored in the cloud and used by users with specified privileges, which
define the access rights of the stored data.
FIG-6: Architecture of Cloud

The critical challenge of cloud storage or cloud computing is the management of the
continuously increasing volume of data. Data de-duplication or Single Instancing essentially
refers to the elimination of redundant data. In the de-duplication process, duplicate data is
deleted, leaving only one copy (single instance) of the data to be stored. However, indexing of
all data is still retained should that data ever be required. In general the data de-duplication
eliminates the duplicate copies of repeating data.
The data is encrypted before outsourcing it on the cloud or network. This encryption requires
more time and space requirements to encode data. In case of large data storage the encryption
becomes even more complex and critical. By using the data de-duplication inside a hybrid
cloud, the encryption will become simpler. As we all know that the network is consist of
abundant amount of data, which is being shared by users and nodes in the network. Many large
scale network uses the data cloud to store and share their data on the network. The node or
user, which is present in the network have full rights to upload or download data over the
network. But many times different user uploads the same data on the network. Which will
create a duplication inside the cloud. If the user wants to retrieve the data or download the data
from cloud, every time he has to use the two encrypted files of same data. The cloud will do
same operation on the two copies of data files. Due to this the data confidentiality and the
security of the cloud get violated. It creates the burden on the operation of cloud.
FIG 7: Architecture of Hybrid cloud

To avoid this duplication of data and to maintain the confidentiality in the cloud we using the
concept of Hybrid cloud. It is a combination of public and private cloud. Hybrid cloud storage
combines the advantages of scalability, reliability, rapid deployment and potential cost savings
of public cloud storage with the security and full control of private cloud storage.
Unlimited virtualized resources to users as services across the whole internet providing by
the cloud computing to hide platforms and implementation details. Highly available storage
and massively parallel computing resources providing by the cloud services at low costs. Cloud

computing widely spread in the world, maximum amount of data stored in the clouds and shred
by the users with specified rights, which define as access rights of the stored data. One of the
critical challenge of cloud storage services is the management of the duplication is one of the
best technique to make the data management in the cloud computing. It has attracted more and
more attention recently. In the data storage to reduce the data copies we go for duplication
techniques. This duplication technique is a data compression technique. The technique is used
improve storage utilization and can also applied for network data transfer to reduce the number
of byte that must be sent. De-duplication eliminates redundant data to reduce multiple data
copies with the same content. Duplication only keeps one physical copy and referring other
redundant data to that copy. Either the file level or block level, de-duplication can take place.
Same file duplicate copies eliminated in file level de-duplication. In non-identical files, blocks
of data that occur, this blocks of data eliminate with the block de-duplication. The detailed
system architecture is shown in figure.
Although data de-duplication brings a lot of advantages, security and privacy concerns arise as
users sensitive data are susceptible to both the insider and outsider attacks. When compares
the traditional encryption with data duplication. It will provide data confidentiality. In the
traditional encryption requires different users to encrypt data with their own keys. Thus
identical copies of different users will lead to different cipher texts, making de-duplication
impossible. One of the new technique has been proposed to encrypt data confidentiality while
making de-duplication feasible, i.e convergent encryption. This convergent encrypt provides
one convergent key to encrypt/decrypt the data, which is obtained by computing the
cryptographic hash value of the content of the data copy. After completion of key generation
and data encryption, users retain the keys and send the cipher text to the cloud. Since the
encryption operation is deterministic and is derived from the data content, identical data copies
generate the same convergent key and hence the same cipher text. A secure proof of ownership
protocol is also required to provide the proof that the user indeed owns. This is all for prevent
unauthorized access, the same file duplicate will found, this process will occur. A pointer from
the server will provide to user, after the proof submission, who are having the subsequent file
without needing upload the same file. The encrypted file can be downloaded by the user and
also decrypted by the corresponding data users with their convergent keys. Thus, convergent
encryption allows the cloud to perform de-duplication on the cipher texts and the proof of
ownership prevents the user to access the file.
Differential authorized de-duplication check cannot supported by the previous de-duplication
systems. With the authorized de-duplication system, each user issued a set of the privileges

during system initialization. To specify which type of user is allowed to perform the duplication
check and access the files is decided by the uploading each file to the cloud and is also bounded
by the set of privileges. The user have to take the file and the own privileges as inputs, to submit
before of the user duplication check request for the same file. If only, copy of the file and
matched privilege stored in cloud, then only the user gets the duplicate of the same file.
In previous de-duplication systems cannot support differential authorization duplicate check,

which is important in many applications. In such an authorized de-duplication system, each
user issued a set of privileges during system initialization.
The overview of the cloud de-duplication is as follow:
3.3 Post-Process De-duplication:

With post-process de-duplication, new data is first stored on the storage device and then a
process at a later time will analyze the data looking for duplication. The benefit is that there is
no need to wait for the hash calculations and lookup to be completed before storing the data
thereby ensuring that store performance is not degraded. Implementations offering policy-
based operation can give users the ability to defer optimization on "active" files, or to process
files based on type and location. One potential drawback is that you may unnecessarily store
duplicate data for a short time which is an issue if the storage system is near full capacity.
3.4 In-Line De-duplication:
This is the process where the de-duplication hash calculations are created on the target device
as the data enters the device in real time. If the device spots a block that it already stored on
the system it does not store the new block, just references to the existing block. The benefit of
in-line de-duplication over post-process de-duplication is that it requires less storage as data is
not duplicated. On the negative side, it is frequently argued that because hash calculations and
lookups takes so long, it can mean that the data ingestion can be slower thereby reducing the
backup throughput of the device. However, certain vendors with in-line de-duplication have
demonstrated equipment with similar performance to their post-process de-duplication
counterparts. Post-process and in-line de-duplication methods are often heavily debated.
3.5 Source Versus Target De-duplication:
Another way to think about data de-duplication is by where it occurs. When the de-duplication
occurs close to where data is created, it is often referred to as "source de-duplication." When it
occurs near where the data is stored, it is commonly called "target de-duplication." Source de-

duplication ensures that data on the data source is de-duplicated. This generally takes place
directly within a file system. The file system will periodically scan new files creating hashes
and compare them to hashes of existing files. When files with same hashes are found then the
file copy is removed and the new file points to the old file. Unlike hard links however,
duplicated files are considered to be separate entities and if one of the duplicated files is later
modified, then using a system called Copy-on-write a copy of that file or changed block is
created. The de-duplication process is transparent to the users and backup applications.
Backing up a de-duplicated file system will often cause duplication to occur resulting in the
backups being bigger than the source data. Target de-duplication is the process of removing
duplicates of data in the secondary store. Generally this will be a backup store such as a data
repository or a virtual tape library.
One of the most common forms of data de-duplication implementations works by comparing
chunks of data to detect duplicates. For that to happen, each chunk of data is assigned an
identification, calculated by the software typically using cryptographic hash functions. In many
implementations, the assumption is made that if the identification is identical then the data is
identical, even though this cannot be true in all cases due to the pigeonhole principle; other
implementations do not assume that two blocks of data with the same identifier are identical,
but actually verify that data with the same identification is identical. If the software either
assumes that a given identification already exists in the de-duplication namespace or actually
verifies the identity of the two blocks of data, depending on the implementation, then it will
replace that duplicate chunk with a link. Once the data has been de-duplicated, upon read back
of the file, wherever a link is found, the system simply replaces that link with the referenced
data chunk. The de-duplication process is intended to be transparent to end users and
applications.
In archival storage systems, there is a huge amount of duplicate data or redundant data, which
occupy significant extra equipments and power consumptions, largely lowering down
resources utilization (such as the network bandwidth and storage) and imposing extra burden
on management as the scale increases. So data de-duplication, the goal of which is to minimize
the duplicate data in the inter level, has been receiving broad attention both in academic and
industry in recent years. In this paper, semantic data de-duplication (SDD) is proposed, which
makes use of the semantic information in the I/O path (such as file type, file format, application
hints and system metadata) of the archival files to direct the dividing a file into semantic
chunks(SC). While the main goal of SDD is to maximally reduce the inter file level

duplications, directly storing variable sc (s) into disks will result in a lot of fragments and
involve a high percentage of random disk accesses, which is very inefficient. So an efficient
data storage scheme is also designed and implemented: SCes are further packaged into fixed
sized Objects, which are actually the storage units in the storage devices, so as to speed up the
I/O performance as well as ease the data management. Primary experiments have demonstrated
that SDD can further reduce the storage space compared with current methods. With the advent
of cloud computing, secure data de-duplication has attracted much attention recently from
research community. Yuan et al. proposed a de-duplication system in the cloud storage to
reduce the storage size of the tags for integrity check. To enhance the security of de-duplication
and protect the data confidentiality, Bellare et al. showed how to protect the data confidentiality
by transforming the predicatable message into unpredicatable message. In their system, another
third party called key server is introduced to generate the file tag for duplicate check. Stanek
et al. presented a novel encryption scheme that provides the essential security for popular data
and unpopular data. For popular data that are not particularly sensitive, the traditional
conventional encryption is performed.
Another two-layered encryption scheme with stronger security while supporting de-duplication
is proposed for unpopular data. In this way, they achieved better trade between the efficiency
and security of the out-sourced data. Liet al. addressed the key management issue in block-
level de-duplication by distributing these keys across multiple servers after encrypting the files.
The main problems in the cloud computing is de-duplication with differential privileges. The
main aim of this paper is to solve this problem. For this we go with different type of
architecture, which is having public cloud and private cloud i.e., Hybrid Cloud Architecture.
Private cloud is main part, that is involved as the substitution to allow data owner/users to
securely perform de-duplication check with differentials privileges. The data owners/users only
outsource their data storage by using public cloud and data operation is managing in private
cloud. Differential duplication check is proposed under the hybrid cloud architecture separated
by a new de-duplication system. A user only with corresponding privilege only marked files
has been allowed to perform de-duplication. We enhance our system in security for future
scope. Specifically, we present an advanced scheme to support stronger security by encrypting
the file with differential privilege keys. Without the privilege key the duplication check cannot
perform. Such type of unauthorized users can decrypt the data even conspire with the S-CSP
security analysis demonstrates that our system is secure in terms of the definitions specified
this model.

Table 3.1: Notations appeared:

Acronym Description
S-CSP Storage-cloud service provider
PoW Proof of Ownership
Pku, sku Users public and security key pair
KF Convergent encryption key for file F.
PU Privilege set of a user U.
PF Specified privilege set of a file F
3.6 PRELIMINARIES:
In this we go through with the notations used in this paper. Analyze the secure primitives used
in our secure duplication.
Symmetric Encryption:
It uses a common secret key k to encrypt and decrypt information. A symmetric encryption
scheme consists of three primitive functions:
a. KeyGenSE( ) is the key generation algorithm that generates using security parameter.
b. EncSE(,M) C is the symmetric encryption algorithm that takes the secret and message
M and then outputs the cipher text C.
c. DecSE(,C) M is the symmetric decryption algorithm that takes the secret and cipher
text C and then outputs the original message M.
3.7 Convergent Encryption:

With this convergent encryption, we get secure confidentiality of de-duplication. Data owner
gets convergent key from each original data copy and encrypts data copy with the convergent
key. A tag is also provide to the user with the data copy, tag will be used to detect duplicates.
If two data copies are same, then their tags are same. To identify and check the duplicates, the
user first sends a tag to the server side to check. Server will replies, if the identical copy has
been already stored or not. Both (confidentiality check and tag) are independently derived. Tag
cannot used to reduce the convergent key and compromise data confidentiality. Tag and its

encrypted data copy will be stored in the server side. With the four primitive functions we can
define the convergent encryption scheme.
KeyGenCE(M) K is the key generation algorithm that maps a data copy M to a

convergent key K
EncCE(K,M) C is the symmetric encryption algorithm that takes both the
convergent key K and the data copy M as inputs and then outputs cipher text C;
DecCE(K,C) M is the decryption algorithm that takes both the cipher text C and the
convergent key K as inputs and then outputs the original data copy M.
TagGen(M) T(M) is the tag generation algorithm that maps the original data copy
M and outputs a tag T(M).
3.8 Proof of ownership:

Enable the users to provide their ownership of data copies to the storage server we choose proof
of ownership. Proof of ownership is implemented as an interactive algorithm run by a power
and verifier. From a data copy of M, the verifier derives a short value (M). To prove the
ownership of the data copy M, the user needs to send to the verifier such that = (M). The
formal security definition for PoW roughly follows the threat model in a content distribution
network, where an attacker does not know the entire file, but has accomplices who have the
file. The accomplices follow the bounded retrieval model, such that they can help the attacker
obtain the file, subject to the constraint that they must send fewer bits than the initial min-
entropy of the file to the attacker.
Identification protocol:
With two phases we can describe the identification protocol, Proof and Verify. In the stage of
proof, a user U demonstrates his identity to a verifier by performing some identification proof
related to his identity. Private is the input of the provider/user i.e sensitive information such as
private key of a public key in his certificate, credit card number, etc. These types of numbers
cannot share with others. With the help of input of public information related to, the verifier
perform the verification. At the end of protocol, the verifier output either accept or not to denote
whether the proof is passed or not. Different types of identification protocols are there like,
certificate based and identification based identification.

CHAPTER-4
4.ALGORITHMS & IMPLEMENTATIONS

4.1 Alpha Numeric Unique ID Generation
CODE:
import java.util.UUID;
public class UniqueIDTest
{
public static void main(String[] args)
{
UUID uniqueKey = UUID.randomUUID();
System.out.println (uniqueKey);
}
}
OUTPUT:
Each time the program is Executed it gives new unique ID.

4.3 SHA-1 ALGORITHM

SHA1 Algorithm Description:
In the proposed system convergent key for each file is generated by using secure hashing
algorithm-1 the steps of this algorithm is given below
Step1: Padding
Pad the message with a single one followed by zeroes until the final block has 448
bits.
Append the size of the original message as an unsigned 64 bit integer.
Step2: Initialize the 5 hash blocks (h0,h1,h2,h3,h4) to the specific constants defined in the
SHA1 standard.
Step3: Hash (for each 512bit Block)
Allocate an 80 word array for the message schedule
Set the first 16 words to be the 512bit block split into 16 words.
The rest of the words are generated using the following algorithm
step4: word[i3] XOR word[i8] XOR word[i14] XOR word[i16] then rotated 1 bit to the left.
Loop 80 times doing the following.
Calculate SHAfunction() and the constant K (these are based on the current round
number.
e=d
d=c
c=b (rotated left 30)
b=a
a = a (rotated left 5) + SHAfunction() + e + k + word[i]
Add a,b,c,d and e to the hash output.
step5: Output the concatenation (h0,h1,h2,h3,h4) which is the message digest.

CHAPTER-5
5. SIMULATION MECHANISM
Recently, cloud computing emerged as the leading technology for delivering reliable, secure,
fault-tolerant, sustainable, and scalable computational services, which are presented as
Software, Infrastructure, or Platform as services (SaaS, IaaS, PaaS). Moreover, these services
may be offered in private data centers (private clouds), may be commercially offered for
clients (public clouds), or yet it is possible that both public and private clouds are combined
in hybrid clouds.
These already wide ecosystem of cloud architectures, along with the increasing demand for
energy-efficient IT technologies, demand timely, repeatable, and controllable methodologies
for evaluation of algorithms, applications, and policies before actual development of cloud
products. Because utilization of real testbeds limits the experiments to the scale of the testbed
and makes the reproduction of results an extremely difficult undertaking, alternative
approaches for testing and experimentation leverage development of new Cloud technologies.
5.1 Introduction to Cloud Sim:
Cloud Sim provides a generalized and extensible simulation framework that enables seamless
modelling and simulation of app performance. By using Cloud Sim, developers can focus on
specific systems design issues that they want to investigate, without getting concerned about
details related to cloud-based infrastructures and services.
Advances in computing have opened up many possibilities. Hitherto, the main concern of
application developers was the deployment and hosting of applications, keeping in mind the
acquisition of resources with a fixed capacity to handle the expected traffic due to the demand
for the application, as well as the installation, configuration and maintenance of the whole
supporting stack. With the advent of the cloud, application deployment and hosting has become
flexible, easier and less costly because of the pay-per-use chargeback model offered by cloud
service providers.
Cloud computing is a best-fit for applications where users have heterogeneous, dynamic, and
competing quality of service (QoS) requirements. Different applications have different

performance levels, workloads and dynamic application scaling requirements, but these
characteristics, service models and deployment models create a vague situation when we use
the cloud to host applications. The cloud creates complex provisioning, deployment, and
configuration requirements.
5.2 Why simulation is important for the cloud environment?
In the public cloud, tenants have control over the OS, storage and deployed applications.
Resources are provisioned in different geographic regions. In the public cloud deployment
model, the performance of an application deployed in multiple regions is a matter of concern
for organisations. Proof of concepts in the public cloud environment give a better
understanding, but cost a lot in terms of capacity building and resource usage even in the pay-
per-use model.
Cloud Sim, which is a toolkit for the modelling and simulation of Cloud computing
environments- comes to the rescue. It provides system and behavioural modelling of the Cloud
computing components. Simulation of cloud environments and applications to evaluate
performance can provide useful insights to explore such dynamic, massively distributed, and
scalable environments.
The principal advantages of simulation are:
Flexibility of defining configurations

Ease of use and customisation
Cost benefits: First designing, developing, testing, and then redesigning, rebuilding,
and retesting any application on the cloud can be expensive. Simulations take the
building and rebuilding phase out of the loop by using the model already created in the
design phase.
Cloud Sim is a toolkit for modelling and simulating cloud environments and to assess
resource provisioning algorithms.
5.2.1 FEATURES:
1. Support of modeling and simulation of large scale computing environment.

2. A self-contained platform for modeling clouds, service brokers, provisioning and
allocation policies.
3. Support for simulation of network connections among the simulated system elements.

4. Facility for simulation of federated cloud environment, that inter-networks resources

from both private and public domains.
5. Availability of a virtualization engine that aids in the creation and management of
multiple independent and co-hosted virtual services on a data center node.
6. Flexibility to switch between space shared and time shared allocation of processing
cores to virtualized services.
5.3 GETTING AWARE OF Cloud-sim:

Cloud-sim is a simulation tool that allows cloud developers to test the performance of their
provisioning policies in a repeatable and controllable environment, free of cost. It helps tune
the bottlenecks before real-world deployment. It is a simulator; hence, it doesnt run any actual
software. It can be defined as running a model of an environment in a model of hardware,
where technology-specific details are abstracted.
Cloud-sim is a library for the simulation of cloud scenarios. It provides essential classes for
describing data centres, computational resources, virtual machines, applications, users, and
policies for the management of various parts of the system such as scheduling and provisioning.
Using these components, it is easy to evaluate new strategies governing the use of clouds, while
considering policies, scheduling algorithms, load balancing policies, etc. It can also be used to
assess the competence of strategies from various perspectives such as cost, application
execution time, etc. It also supports the evaluation of Green IT policies. It can be used as a
building block for a simulated cloud environment and can add new policies for scheduling,
load balancing and new scenarios. It is flexible enough to be used as a library that allows you
to add a desired scenario by writing a Java program.
By using Cloud-sim, organisations, R&D centres and industry-based developers can test the
performance of a newly developed application in a controlled and easy to set-up environment.
The Cloud-sim layer provides support for modelling and simulation of cloud environments
including dedicated management interfaces for memory, storage, bandwidth and VMs. It also
provisions hosts to VMs, application execution management and dynamic system state
monitoring. A cloud service provider can implement customised strategies at this layer to study
the efficiency of different policies in VM provisioning.
The user code layer exposes basic entities such as the number of machines, their specifications,
etc, as well as applications, VMs, number of users, application types and scheduling policies.

The main components of the Cloud-sim framework

Regions: It models geographical regions in which cloud service providers allocate resources
to their customers. In cloud analysis, there are six regions that correspond to six continents in
the world.
Data centres: It models the infrastructure services provided by various cloud service
providers. It encapsulates a set of computing hosts or servers that are either heterogeneous or
homogeneous in nature, based on their hardware configurations.
Data centre characteristics: It models information regarding data centre resource

configurations.
Hosts: It models physical resources (compute or storage).
The user base: It models a group of users considered as a single unit in the simulation, and
its main responsibility is to generate traffic for the simulation.
Cloudlet: It specifies the set of user requests. It contains the application ID, name of the user
base that is the originator to which the responses have to be routed back, as well as the size of
the request execution commands, and input and output files. It models the cloud-based
application services. Cloud-sim categorises the complexity of an application in terms of its
computational requirements. Each application service has a pre-assigned instruction length
and data transfer overhead that it needs to carry out during its life cycle.
Service broker: The service broker decides which data centre should be selected to provide
the services to the requests from the user base.
VMM allocation policy: It models provisioning policies on how to allocate VMs to hosts.
VM scheduler: It models the time or space shared, scheduling a policy to allocate processor
cores to VMs.
5.4 Cloud Sim Package:
The Cloud-sim Project is structured as java custom packages, where each package will contain
the correlated classes in it. This project has in twelve packages as follows:
1) org.cloudbus.cloudsim: This package contains classes that once instantiated will behave
like some component in the system or support a specific component of the system for
producing its relevant behavior during the simulation process. This package can be broadly
categorized into two sections:

a) Simulating Components Classes: These are the set of classes that imitates a particular
part of cloud setup. These are the classes that come under this category: Cloudlet,
datacenter,datacenterbroker,host,storage,hardrivestorage,SANstorage,PE,datacloudTa
gs & VM.
b) Policy Classes: These are the set of classes that imitate the policy behavior of a cloud
component. Following are the classes that comes under this category:
VMAllocationPolicy,CloudLetSchedulingPolicy,VMSchedulingPolicy,UtilizationMo
del and each of these class has there variants implemented in this package.
2) org.cloudbus.cloudsim.core: This package contains the main classes of this project and
are directly responsible for initiating(CloudInformationService.java, Cloudsim.java),
starting(Cloudsim.java), maintaining(Cloudsim.java, SimEntity.java, SimEvent.java,
FutureQueue.java and DeferedQueue.java) and end the simulation process(Cloudsim.java).
This package contains a class 'cloudsimtags.java' this class contains all the event identifiers
that are been implemented in datacenter and data center broker class.
3) org.cloudbus.cloudsim.core.predicates: This package classes are responsible for
selecting matching events from deferred queue class object during the simulation process
for executing the event on specific entity.
4) org.cloudbus.cloudsim.distributions: This package contain classes that have predefined
network traffic distribution methods implemented in them. Classes defined here
are: ContinuousDistribution,ExponentialDistribution etc
5) org.cloudbus.cloudsim.lists: This package contain classes implementing predefined
operations related to component list during simulation. Lists classes specified in this
package are: CloudLetList, HostList, PeList, ResCloudLetList and VMList.
6) org.cloudbus.cloudsim.network: This package hold the classes that produce behavior
related to network packet routing.
7) org.cloudbus.cloudsim.network.datacenter: This package contain classes to produce a
simulation behavior for geographically distributed data centers of cloud service providers.
8) org.cloudbus.cloudsim.power/org.cloudbus.cloudsim.power.lists/org.cloudbus.clouds
im.power.models : In these packages the org.cloudbus.cloudsim package classes are
extended to produce behavior for power aware components. These packages cab be used
to implement green computing related work.
9) org.cloudbus.cloudsim.provisioners: Classes in this package contains policies related to
allocation of Bandwidth, Processing elements and RAM. Default policy implemented here
is 'Best effort allocation policy'.

10) org.cloudbus.cloudsim.util: Classes in this package provide the basic utility operations
related to perform some math operations related to cloud computing services or some
calculation of execution time during the simulation process.
5.5 Cloud-sim Working:

Example' package provided in example folder of Cloud-sim project follows some standard
steps to implement the specified configuration to start a simulation. To understand the working
of Cloud-sim simulation framework, knowledge about these steps are must. There are eleven
steps that are followed in each example with some variation in them, specified as follows:
1. Set Number of users for current simulation. This user count is directly proportional to
number of brokers in current simulation.
2. Initialize the simulation, provided with current time, number of users and trace flag.
3. Create a Data center.
4. Create a Data center broker.
5. Create a Virtual Machine(s).
6. Submit Virtual Machine to Data center broker.
7. Create Cloudlet(s) by specifying there characteristics.
8. Submit Cloudlets to Data center broker.
9. Send call to Start Simulation.
10. Once no more event to execute, send call to Stop Simulation.
11. Finally print the final status of the Simulation.
5.6 Cloud-sim in Eclipse:
Cloud-sim is written in Java. The knowledge you need to use Cloud-sim is basic Java
programming and some basics about cloud computing. Knowledge of programming IDEs such
as Eclipse or NetBeans is also helpful. It is a library and, hence, Cloud-sim does not have to be
installed. Normally, you can unpack the downloaded package in any directory, add it to the
Java class path and it is ready to be used. Please verify whether Java is available on your system.
Step-1 Download eclipse. For that following link http://www.eclipse.org/downloads/
Step-2 Extract eclipse to particular directory. Here lets say C:\eclipse

Step-3 Download Cloud-sim. For that follow following link

http://code.google.com/p/cloudsim/downloads/list
Step-4 Extract Cloud-sim to particular directory. Here lets say C:\cloudsim-3.0.2
Step-5 Download Michael Thomas Flanagan's Java Scientific and Numerical Library. For that
follow following link http://www.ee.ucl.ac.uk/~mflanaga/java/
Step-6 Copy this flanagan.jar file into C:\cloudsim-3.0.2\jars\
Step-7 Open the eclipse IDE. For that go to C:\eclipse and open the eclipse application (blue
ball like icon)
Step-8 Select the workspace, where eclipse stores your projects
Step-9 In eclipse IDE go to New ->Java Project

Step-10 Specify Project name as My Project, untick the use default location option and select
extracted My project folder. Click finish. It might take some time to finish.
11. This step is to initialise the Cloud-sim package by initialising the Cloud-sim library, as
follows:
Cloud-sim.init(num_user, calendar, trace_flag)

12. Data centres are the resource providers in Cloud-sim; hence, creation of data centres is a
second step. To create Datacenter, you need the DatacenterCharacteristics object that stores the
properties of a data centre such as architecture, OS, list of machines, allocation policy that
covers the time or spaceshared, the time zone and its price:
Datacenter datacenter9883 = new Datacenter(name, characteristics, new

VmAllocationPolicySimple(hostList), storageList, 0);
13. This step is to create a broker:
DatacenterBroker broker = createBroker();
14. This step is to create one virtual machine unique ID of the VM, userId ID of the VMs
owner, mips(Microprocessor without Interlocked Pipeline Stages), number Of Pes amount of
CPUs, amount of RAM, amount of bandwidth, amount of storage, virtual machine monitor,
and cloudletScheduler policy for cloudlets:
Vm vm = new Vm(vmid, brokerId, mips, pesNumber, ram, bw, size, vmm, new
CloudletSchedulerTimeShared())
15. Submit the VM list to the broker:
broker.submitVmList(vmlist)
16. Create a cloudlet with length, file size, output size, and utilisation model:
Cloudlet cloudlet = new Cloudlet(id, length, pesNumber, fileSize, outputSize,

utilizationModel, utilizationModel, utilizationModel)
17. Submit the cloudlet list to the broker:
broker.submitCloudletList(cloudletList)

CHAPTER-6
7. ARCHITECTURE OF CLOUD SIM
Below figure shows the multi-layered design of the Cloud-sim software framework and its
architectural components. Initial releases of Cloud-sim used SimJava as the discrete event
simulation engine that supports several core functionalities, such as queuing and processing of
events, creation of Cloud system entities (services, host, data center, broker, VMs),
communication between components, and management of the simulation clock. However in
the current release, the SimJava layer has been removed in order to allow some advanced
operations that are not supported by it. We provide ner discussion on these advanced
operations in the next section. The Cloud-sim simulation layer provides support for modeling
and simulation of virtualized Cloud-based data center environments including dedicated
management interfaces for VMs, memory, storage, and bandwidth. The fundamental issues,
such as provisioning of hosts to VMs, managing application execution, and monitoring
dynamic system state, are handled by this layer. A Cloud provider, who wants to study the
efciency of different policies in allocating its hosts to VMs (VM provisioning), would need
to implement his strategies at this layer. Such implementation can be done by programmatically
extending the core VM provisioning functionality. There is a clear distinction at this layer
related to provisioning of hosts to VMs. A Cloud host can be concurrently allocated to a set of
VMs that execute applications based on SaaS providers dened QoS levels. This layer also
exposes the functionalities that a Cloud application developer can extend to perform complex
workload proling and application performance study. The top-most layer in the Cloud-sim
stack is the User Code that exposes basic entities for hosts (number of machines, their
specication, and so on), applications (number of tasks and their requirements), VMs, number
of users and their application types, and broker scheduling policies. By extending the basic
entities given at this layer, a Cloud application developer can perform the following activities:
(i) generate a mix of workload request distributions, application congurations; (ii) model
Cloud availability scenarios and perform robust tests based on the custom congurations; and
(iii) implement custom application provisioning techniques for clouds and their federation. As
Cloud computing is still an emerging paradigm for distributed computing, there is a lack of
dened standards, tools, and methods that can efciently tackle the infrastructure and
application level complexities. Hence, in the near future there will be a number of research

efforts both in the academia and industry toward dening core algorithms, policies, and
application benchmarking based on execution contexts. By extending the basic functionalities
already exposed to
FIG 8: CLOUDSIM CORE SIMULATION ENGINE

Cloud-sim, researchers will be able to perform tests based on specic scenarios and
congurations, thereby allowing the development of best practices in all the critical aspects
related to Cloud Computing.
6.1 Modeling the cloud:

The infrastructure-level services (IaaS) related to the clouds can be simulated by extending the
data center entity of Cloud-sim. The data center entity manages a number of host entities. The
hosts are assigned to one or more VMs based on a VM allocation policy that should be dened
by the Cloud service provider. Here, the VM policy stands for the operations control policies
related to VM life cycle such as: provisioning of a host to a VM, VM creation, VM destruction,
and VM migration. Similarly, one or more application services can be provisioned within a
single VM instance, referred to as application provisioning in the context of Cloud computing.
In the context of Cloud-sim, an entity is an instance of a component. A Cloud-sim component
can be a class (abstract or complete) or set of classes that represent one Cloud-sim model (data

center, host). A data center can manage several hosts that in turn manages VMs during their
life cycles. Host is a Cloud-sim component that represents a physical computing server in a
Cloud: it is assigned a pre-congured processing capability (expressed in millions of
instructions per secondMIPS), memory, storage, and a provisioning policy for allocating
processing cores to VMs. The Host component implements interfaces that support modeling
and simulation of both single-core and multi-core nodes. VM allocation (provisioning)[7]is the
process of creating VM instances on hosts that match the critical characteristics (storage,
memory), congurations (software environment), and requirements (availability zone) of the
SaaS provider. Cloud-sim supports the development of custom application service models that
can be deployed within a VM instance and its users are required to extend the core Cloudlet
object for implementing their application services. Furthermore, Cloud-sim does not enforce
any limitation on the service models or provisioning techniques that developers want to
implement and perform tests with. Once an application service is dened and modeled, it is
assigned to one or more pre-instantiated VMs through a service-specic allocation policy.
Allocation of application-specic VMs to hosts in a Cloud-based data center is the
responsibility of a VM Allocation controller component (called VmAllocationPolicy). This
component exposes a number of custom methods for researchers and developers who aid in the
implementation of new policies based on optimization goals (user centric, system centric, or
both). By default, Vm Allocation Policy implements a straightforward policy that allocates
VMs to the Host on a First-Come-First-Serve (FCFS) basis. Hardware requirements, such as
the number of processing cores, memory, and storage, form the basis for such provisioning.
Other policies, including the ones likely to be expressed by Cloud providers, can also be easily
simulated and modeled in Cloud-sim. However, policies used by public Cloud providers
(Amazon EC2, Microsoft Azure) are not publicly available, and thus a pre-implemented
version of these algorithms is not provided with Cloud-sim. For each Host component, the
allocation of processing cores to VMs is done based on a host allocation policy. This policy
takes into account several hardware characteristics, such as number of CPU cores, CPU share,
and amount of memory (physical and secondary), that are allocated to a given VM instance.
Hence, Cloud-sim supports simulation scenarios that assign specic CPU cores to specic VMs
(a space-shared policy), dynamically distribute the capacity of a core among VMs (time-shared
policy), or assign cores to VMs on demand. Each host component also instantiates a VM
scheduler component, which can either implement the space-shared or the time-shared policy
for allocating cores to VMs. Cloud system/application developers and researchers can further
extend the VM scheduler component for experimenting with custom allocation policies. In the

next section, the ner-level details related to the timeshared and space-shared policies are
described. Fundamental software and hardware conguration parameters related to VMs are
dened in the VM class. Currently, it supports modeling of several VM congurations offered
by Cloud providers such as the Amazon EC2.
6.2 Modeling the VM allocation:

One of the key aspects that make a Cloud computing infrastructure different from a Grid
computing infrastructure is the massive deployment of virtualization tools and technologies.
Hence, as against Grids, Clouds contain an extra layer (the virtualization layer) that acts as an
execution, management, and hosting environment for application services. Hence, traditional
application provisioning models that assign individual application elements to computing
nodes do not accurately represent the computational abstraction, which is commonly associated
with Cloud resources. For example, consider a Cloud host that has a single processing core.
There is a requirement of concurrently instantiating two VMs on that host. Although in practice
VMs are contextually (physical and secondary memory space) isolated, still they need to share
the processing cores and system bus. Hence, the amount of hardware resources available to
each VM is constrained by the total processing power and system bandwidth available within
the host. This critical factor must be considered during the VM provisioning process, to avoid
creation of a VM that demands more processing power than is available within the host. In
order to allow simulation of different provisioning policies under varying levels of performance
isolation, Cloud-sim supports VM provisioning at two levels: rst, at the host level and second,
at the VM level. At the host level, it is possible to specify how much of the overall processing
power of each core will be assigned to each VM. At the VM level, the VM assigns a xed
amount of the available processing power to the individual application services (task units) that
are hosted within its execution engine. For the purpose of this paper, we consider a task unit as
a ner abstraction of an application service being hosted in the VM.
At each level, Cloud-sim implements the time-shared and space-shared provisioning policies.
To clearly illustrate the difference between these policies and their effect on the application
service performance, in Figure 4 we show a simple VM provisioning scenario. In this gure, a
host with two CPU cores receives request for hosting two VMs, such that each one requires
two cores and plans to host four tasks units. More specically, tasks t1, t2, t3, and t4 to be
hosted in VM1, whereas t5, t6, t7, and t8 to be hosted in VM2.

Figure 4(a) presents a provisioning scenario, where the space-shared policy is applied to both
VMs and task units. As each VM requires two cores, in space-shared mode only one VM can
run at a given instance of time. Therefore, VM2 can only be assigned the core once VM1
nishes the execution of task units. The same happens for provisioning tasks within the VM1:
since each task unit demands only one core, therefore both of them can run simultaneously.
During this period, the remaining tasks (2 and 3) wait in the execution queue. By using a space-
shared policy, the estimated nish time of a task p managed by a VM i is given by
where est(p) is the Cloudlet- (cloud task) estimated start time and rl is the total number of
instructions that the Cloudlet will need to execute on a processor. The estimated start time
depends on the position of the Cloudlet in the execution queue, because the processing unit is
used exclusively (space-shared mode) by the Cloudlet. Cloudlets are put in the queue when
there are free processing cores available that can be assigned to the VM. In this policy, the total
capacity of a host having np processing elements (PEs) is given by:
where cap(i) is the processing strength of individual elements
Figure 9. Effects of different provisioning policies on task unit execution: (a) space-shared
provisioning for VMs and tasks; (b) space-shared provisioning for VMs and time-shared

provisioning for tasks; (c) time-shared provisioning for VMs, space-shared provisioning
for tasks; and (d) time-shared provisioning for VMs and tasks.
In Figure 9(b), a space-shared policy is applied for allocating VMs to hosts and a time-shared
policy forms the basis for allocating task units to processing core within a VM. Hence, during
a VM lifetime, all the tasks assigned to it are dynamically context switched during their life
cycle. By using a time-shared policy, the estimated nish time of a Cloudlet managed by a VM
is given by
where eft(p) is the estimated nish time, ct is the current simulation time, and cores(p) is the
number of cores (PEs) required by the Cloudlet. In time-shared mode, multiple cloudlets (task
units) can simultaneously multi-task within a VM.in this case, we compute the total processing
capacity of cloud host as
where cap(i) is the processing strength of individual elements.

In Figure 9(c), a time-shared provisioning is used for VMs, whereas task units are provisioned
based on a space-shared policy. In this case, each VM receives a time slice on each processing
core, which then distributes the slices among task units on a space-shared basis. As the cores
are shared, the amount of processing power available to a VM is variable. This is determined
by calculating VMs that are active on a host. As the task units are assigned based on a space-
shared policy, which means that at any given instance of time only one task will be actively
using the processing core.
Finally, in Figure 9(d) a time-shared allocation is applied to both VMs and task units. Hence,
the processing power is concurrently shared by the VMs and the shares of each VM are
simultaneously divided among its task units. In this case, there are no queuing delays associated
with task units.

6.3 Modeling the cloud market:

Market is a crucial component of the Cloud computing ecosystem; it is necessary for regulating
Cloud resource trading and online negotiations in a public Cloud computing model, where
services are offered in a pay-as-you-go model. Hence, research studies that can accurately
evaluate the cost to-benet ratio of emerging Cloud computing platforms are required.
Furthermore, SaaS providers need transparent mechanisms to discover various Cloud
providers offerings (IaaS, PaaS, SaaS, and their associated costs). Thus, modeling of costs and
economic policies are important aspects to be considered when designing a Cloud simulator.
The Cloud market is modeled based on a multi-layered (two layers) design. The rst layer
contains the economic of features related to the IaaS model such as cost per unit of memory,
cost per unit of storage, and cost per unit of used bandwidth. Cloud customers (SaaS providers)
have to pay for the costs of memory and storage when they create and instantiate VMs, whereas
the costs for network usage are only incurred in the event of data transfer. The second layer
models the cost metrics related to SaaS model. Costs at this layer are directly applicable to the
task units (application service requests) that are served by the application services. Hence, if a
Cloud customer provisions a VM without an application service (task unit), then they would
only be charged for layer 1 resources (i.e. the costs of memory and storage). This behavior may
be changed or extended by Cloud-sim users.
6.4 LATENCY
Latency is the delay from input into a system to desired outcome; the term is understood slightly
differently in various contexts and latency issues also vary from one system to another. Latency
greatly affects how usable and enjoyable electronic and mechanical devices as well as
communications are.
Latency in communication is demonstrated in live transmissions from various points on the

earth as the communication hops between a ground transmitter and a satellite and from a
satellite to a receiver each take time. People connecting from distances to these live events can
be seen to have to wait for responses. This latency is the wait time introduced by the signal
travelling the geographical distance as well as over the various pieces of communications

equipment. Even fiber optics are limited by more than just the speed of light, as the refractive
index of the cable and all repeaters or amplifiers along their length introduce delays.
LATENCY MATRIX
Figure 10. Network communication ow.
6.5 Modeling the Network Behavior:

Modeling comprehensive network topologies to connect simulated Cloud computing entities
(hosts, storage, end-users) is an important consideration because latency messages directly
affect the overall service satisfaction experience. An end-user or a SaaS provider consumer
who is not satised with the delivered QoS is likely to switch his/her Cloud provider; hence, it
is a very important requirement that Cloud system simulation frameworks provide facilities for
modeling realistic networking topologies and models. Inter-networking of Cloud entities (data
centers, hosts, SaaS providers, and end-users) in Cloud-sim is based on a conceptual
networking abstraction. In this model, there are no actual entities available for simulating
network entities, such as routers or switches. Instead, network latency that a message can
experience on its path from one Cloud-sim entity (host) to another (Cloud Broker) is simulated
based on the information stored in the latency matrix (see Table I). For example, Table I shows
a latency matrix involving ve Cloud-sim entities. At any instance of time, the Cloud-sim
environment maintains an mn size matrix for all Cloud-sim entities currently active in the
simulation context. An entry eij in the matrix represents the delay that a message will under go
when it is being transferred from entity i to entity j over the network. Recall, that Cloud-sim is
an event-based simulation, where different system models/entities communicate via sending

events. The event management engine of Cloud-sim utilizes the inter-entity network latency
information for inducing delays in transmitting message to entities. This delay is expressed in
simulation time units such as milliseconds.
It means that an event from entity i to j will only be forwarded by the event management
engine when the total simulation time reaches the t+d value, where t is the simulation time
when the message was originally sent, and d is the network latency between entities i and j.
The transition diagram representing such an interaction is depicted in Figure 5. This method of
simulating network latencies gives us a realistic yet simple way of modeling practical
networking architecture for a simulation environment. Further, this approach is much easier
and cleaner to implement, manage, and simulate than modeling complex networking
components such as routers, switches etc. The topology description is stored in BRITE [18]
format that contains a number of network nodes, which may be greater than the number of
simulated nodes. These nodes represent various Cloud-sim entities including hosts, data
centers, Cloud Brokers etc. This BRITE information is loaded every time Cloud-sim is
initialized and is used for generating latency matrix. Data centers and brokers are also required
to be mapped as the network nodes. Further, any two Cloud-sim entities cannot be mapped to
the same network node. Messages (events) sent by Cloud-sim entities are rst processed by the
Network Topology object that stores the network topology information. This object augments
the latency information to the event and passes it on to the event management engine for further
processing. Let us consider an example scenario in which a data center is mapped to the rst
node and the Cloud broker to the fth node in a sample BRITE network (see Table I). When a
message is sent from the broker to the data center, the corresponding delay, stored at the
element (1, 5) of the latency matrix (200ms in this example), is added to the corresponding
event. Therefore, the event management engine will take this delay into account before
forwarding the event to the destination entity. By using an external network description le
(stored in BRITE format), we allow reuse of same topology in different experiments.
Moreover, the logical number of nodes that are ambient in the conguration le can be greater
than the number of actual simulated entities; therefore, the network modeling approach does
not compromise the scalability of the experiments. For example, every time there are additional
entities to be included in the simulation, they only need to be mapped to the BRITE nodes that
are not currently mapped to any active Cloud-sim entities. Hence, there will always exist a
scope to grow the overall network size based on application service and Cloud computing
environment scenarios.

6.6 Modeling a federation of clouds:

In order to federate or inter-network multiple clouds, there is a requirement for modeling a
CloudCoordinator entity. This entity is responsible not only for communicating with other data
centers and end-users in the simulation environment, but also for monitoring and managing the
internal state of a data center entity. The information received as part of the monitoring process,
that is active throughout the simulation period, is utilized for making decisions related to inter-
cloud provisioning. Note that no software object offering similar functionality to the Cloud
Coordinator is offered by the existing providers, such as Amazon, Azure, or Google App
Engine presently. Hence, if a developer of a real-world Cloud system wants to federate services
from multiple clouds, they will be required to develop a Cloud Coordinator component. By
having such an entity to manage the federation of Cloud-based data centers, aspects related to
communication and negotiation with foreign entities are isolated from the data center core.
Therefore, by providing such an entity among its core objects, Cloud-sim helps Cloud
developers in speeding up their application service performance testing.
The two fundamental aspects that must be handled when simulating a federation of clouds
include: communication and monitoring. The rst aspect (communication) is handled by the
data center through the standard event-based messaging process. The second aspect (data center
monitoring) is carried out by the Cloud Coordinator. Every data center in Cloud-sim needs to
instantiate this entry in order to make itself a part of Cloud federation. The Cloud Coordinator
triggers the inter-cloud load adjustment process based on the state of the data center. The
specic set of events that affect the adjustment are implemented via a specic sensor entity.
Each sensor entity implements a particular parameter (such as under provisioning, over
provisioning, and SLA violation) related to the data center. For enabling online monitoring of
a data center host, a sensor that keeps track of the host status (utilization, heating) is attached
with the Cloud Coordinator. At every monitoring step, the Cloud Coordinator queries the
sensor. If a certain pre-congured threshold is achieved, the Cloud Coordinator starts the
communication with its peers (other Cloud Coordinators in the federation) for possible load-
shredding. The negotiation protocol, load-shredding policy, and compensation mechanism can
be easily extended to suit a particular research study.
6.7. Modeling dynamic workloads

Software developers and third-party service providers often deploy applications that exhibit
dynamic behavior in terms of workload patterns, availability, and scalability requirements.
Typically, Cloud computing thrives on highly varied and elastic services and infrastructure
demands. Leading Cloud vendors, including Amazon and Azure, expose VM
containers/templates to host a range of SaaS types and provide SaaS providers with the notion
of unlimited resource pool that can be leased on the y with requested congurations.
Pertaining to the aforementioned facts, it is an important requirement that any simulation
environment supports the modeling of dynamic workload patterns driven by application or
SaaS models. In order to allow simulation of dynamic behaviors within Cloud-sim, we have
made a number of extensions to the existing framework, in particular to the Cloudlet entity.
We have designed an additional simulation entity within Cloud-sim, which is referred to as the
Utilization Model that exposes methods and variables for dening the resource and VM-level
requirements of a SaaS application at the instance of deployment. In the Cloud-sim framework,
Utilization Model is an abstract class that must be extended for implementing a workload
pattern required to model the applications resource demand. Cloud-sim users are required to
override the method, getUtilization(), whose input type is discrete time parameter and return
type is percentage of computational resource required by the Cloudlet.
Another important requirement for Cloud computing environments is to ensure that the agreed
SLA in terms of QoS parameters, such as availability, reliability, and throughput, are delivered
to the applications. Although modern virtualization technologies can ensure performance
isolation between applications running on different VMs, there still exists scope for developing
methodologies at the VM provisioning level that can further improve resource utilization. Lack
of intelligent methodologies for VM provisioning raises a risk that all VMs deployed on a
single host may not get the adequate amount of processor share that is essential for fullling
the agreed SLAs. This may lead to performance loss in terms of response time, time outs, or
failures in the worst case. The resource provider must take into account such behaviors and
initiate necessary actions to minimize the effect on the application performance. To simulate
such behavior, the SLA model can either be dened as fully allocating the requested amount
of resources or allowing exible resource allocations up to a specic rate as long as the agreed
SLA can be delivered (e.g. allowing the CPU share to be 10% below the requested amount).
Cloud-sim supports modeling of the aforementioned SLA violation scenarios. Moreover, it is
possible to dene particular SLA-aware policies describing how the available capacity is
distributed among competing VMs in case of a lack of resources. The number of SLA violation

events as well as the amount of resource that was requested but not allocated can be accounted
for by Cloud-sim.
6.8 Modeling data center power consumption
Cloud computing environments are built upon an inter-connected network of a large number
(hundreds-of-thousands) of computing and storage hosts for delivering on-demand services
(IaaS, PaaS, and SaaS). Such infrastructures in conjunction with a cooling system may
consume enormous amount of electrical power resulting in high operational costs. Lack of
energy-conscious provisioning techniques may lead to overheating of Cloud resources
(compute and storage servers) in case of high loads. This in turn may result in reduced system
reliability and lifespan of devices. Another related issue is the carbon dioxide (CO2) emission
that is detrimental to the physical environment due to its contribution in the greenhouse effect.
All these problems require the development of efcient energy-conscious provisioning policies
at resource, VM, and application level.
To this end, the Cloud-sim framework provides basic models and entities to validate and
evaluate energy-conscious provisioning of techniques/algorithms. We have made a number of
extensions to Cloud-sim for facilitating the above, such as extending the PE object to include
an additional Power Model object for managing power consumption on a per Cloud host basis.
To support modeling and simulation of different power consumption models and power
management techniques such as Dynamic Voltage and Frequency Scaling (DVFS), we provide
an abstract implementation called Power Model. This abstract class should be extended for
simulating custom power consumption model of a PE. Cloud-sim users need to override the
method getPower() of this class, whose input parameter is the current utilization metric for
Cloud host and return parameter is the current power consumption value. This capability
enables the creation of energy-conscious provisioning policies that require real-time
knowledge of power consumption by Cloud system components. Furthermore, it enables the
accounting of the total energy consumed by the system during the simulation period.
6.9 Modeling dynamic entities creation

Clouds offer a pool of software services and hardware servers on an unprecedented scale, which
gives businesses a unique ability to handle the temporal variation in demand through dynamic
provisioning or de-provisioning of capabilities from clouds. Actual usage patterns of many
enterprise services (business applications) vary with time, most of the time in an unpredictable
way. This leads to the necessity for Cloud providers to deal with customers who can enter or

leave the system at any time. Cloud-sim allows such simulation scenarios by supporting
dynamic creation of different kinds of entities. Apart from the dynamic creation of user and
broker entities, it is also possible to add and remove data center entities at run-time. This
functionality might be useful for simulating dynamic environment where system components
can join, fail, or leave the system randomly. After creation, new entities automatically register
themselves in the Cloud Information Service (CIS) to enable dynamic resource discovery.

CHAPTER-7
7.DESIGN AND IMPLEMENTATION OF Cloud-sim
In this section, we provide the ner details related to the fundamental classes of Cloud-sim,
which are also the building blocks of the simulator. The overall Class design diagram for
Cloud-sim is shown in Figure 11.
BwProvisioner: This is an abstract class that models the policy for provisioning of bandwidth
to VMs. The main role of this component is to undertake the allocation of network bandwidths
to a set of competing VMs that are deployed across the data center. Cloud system developers
and researchers can extend this class with their own policies (priority, QoS) to reect the needs
of their applications. The BwProvisioningSimple allows a VM to reserve as much bandwidth
as required; however, this is constrained by the total available bandwidth of the host. Cloud
Coordinator: This abstract class extends a Cloud-based data center to the federation. It is
responsibleforperiodicallymonitoringtheinternalstateofdatacenterresourcesandbasedonthatit
undertakesdynamicload-
shreddingdecisions.Concreteimplementationofthiscomponentincludes the specic sensors and
the policy that should be followed during load-shredding. Monitoring of data center resources
is performed by the updateDatacenter() method by sending queries Sensors. Service/Resource
Discovery is realized in the setDatacenter()abstract method that can be extended for
implementing custom protocols and mechanisms (multicast, broadcast, peer-to-peer). Further,
this component can also be extended for simulating Cloud-based services such as the amazon

Figure 11. Cloud-sim class design diagram.

EC2 Load-Balancer. Developers aiming to deploy their application services across multiple
clouds can extend this class for implementing their custom inter-cloud provisioning policies.
Cloudlet: This class models the Cloud-based application services (such as content delivery,
social networking, and business workow). Cloud-sim orchestrates the complexity of an
application in terms of its computational requirements. Every application service has a pre-
assigned instruction length and data transfer (both pre and post fetches) overhead that it needs
to undertake during its life cycle. This class can also be extended to support modeling of other
performance and composition metrics for applications such as transactions in database-oriented
applications.
CloudletScheduler: This abstract class is extended by the implementation of different policies
that determine the share of processing power among Cloudlets in a VM. As described
previously, two types of provisioning policies are offered: space-shared (CloudetScheduler
Space Shared) and time-shared (CloudletScheduler Time Shared).
Datacenter: This class models the core infrastructure-level services (hardware) that are offered
by Cloud providers (Amazon, Azure, App Engine). It encapsulates a set of compute hosts that
can either be homogeneous or heterogeneous with respect to their hardware congurations
(memory, cores, capacity, and storage). Furthermore, every Datacenter component instantiates
a generalized application provisioning component that implements a set of policies for
allocating bandwidth, memory, and storage devices to hosts and VMs.
Data center Broker or Cloud Broker: This class models a broker, which is responsible for
mediating negotiations between SaaS and Cloud providers; and such negotiations are driven
by QoS requirements. The broker acts on behalf of SaaS providers. It discovers suitable Cloud
service providers by querying the CIS and undertakes online negotiations for allocation of
resources/services that can meet the applications QoS needs. Researchers and system
developers must extend this class for evaluating and testing custom brokering policies. The
difference between the broker and the Cloud Coordinator is that the former represents the
customer (i.e. decisions of these components are made in order to increase user-related
performance metrics), whereas the latter acts on behalf of the data center, i.e. it tries to
maximize the overall performance of the data center, without considering the needs of specic
customers.
Datacenter Characteristics: This class contains conguration information of data center
resources.

Host: This class models a physical resource such as a compute or storage server. It encapsulates
important information such as the amount of memory and storage, a list and type of processing
cores (to represent a multi-core machine), an allocation of policy for sharing the processing
power among VMs, and policies for provisioning memory and bandwidth to the VMs.
Network Topology: This class contains the information for inducing network behavior
(latencies) in the simulation. It stores the topology information, which is generated using the
BRITE topology generator.
RamProvisioner: This is an abstract class that represents the provisioning policy for allocating
primary memory (RAM) to the VMs. The execution and deployment of VM on a host is
feasible only if the RamProvisioner component approves that the host has the required amount
of free memory. The RamProvisioner Simple does not enforce any limitation on the amount of
memory that a VM may request. However, if the request is beyond the available memory
capacity, then it is simply rejected.
SanStorage: This class models a storage area network that is commonly ambient in Cloud-
based data centers for storing large chunks of data (such as Amazon S3, Azure blob storage).
SanStorage implements a simple interface that can be used to simulate storage and retrieval of
any amount of data, subject to the availability of network bandwidth. Accessing les in a SAN
at run-time incurs additional delays for task unit execution; this is due to the additional latencies
that are incurred in transferring the data les through the data center internal network.
Sensor: This interface must be implemented to instantiate a sensor component that can be used
by a Cloud Coordinator for monitoring specic performance parameters (energy-consumption,
resource utilization). Recall that, Cloud Coordinator utilizes the dynamic performance
information for undertaking load-balancing decisions. The methods dened by this interface
are: (i) set the minimum and maximum thresholds for performance parameter and (ii)
periodically update the measurement. This class can be used to model the real-world services
offered by leading Cloud providers such as Amazons CloudWatch and Microsoft Azures
FabricController. One data center may instantiate one or more Sensors, each one responsible
for monitoring a specic data center performance parameter.
Vm: This class models a VM, which is managed and hosted by a Cloud host component. Every
VM component has access to a component that stores the following characteristics related to a
VM: accessible memory, processor, storage size, and the VMs internal provisioning policy
that is extended from an abstract component called the CloudletScheduler.
VmmAllocationPolicy: This abstract class represents a provisioning policy that a VM Monitor
utilizesforallocatingVMstohosts.ThechieffunctionalityoftheVmmAllocationPolicyistoselect

the available host in a data center that meets the memory, storage, and availability requirement
for a VM deployment.
VmScheduler: This is an abstract class implemented by a Host component that models the
policies (space-shared, time-shared) required for allocating processor cores to VMs. The
functionalities of this class can easily be overridden to accommodate application-specic
processor sharing policies.
7.1 Cloud-sim core simulation framework

As discussed previously, GridSim is one of the building blocks of Cloud-sim. However,
GridSim uses the SimJava library as a framework for event handling and inter-entity message
passing. SimJava has several limitations that impose some restrictions with regard to creation
of scalable simulation environments such as: It does not allow resetting the simulation
programmatically at run-time. It does not support creation of new simulation entity at run-
time (once simulation has been initiated). Multi-threaded nature of SimJava leads to
performance overhead with the increase in system size. The performance degradation is caused
by the excessive context switching between threads. Multi-threading brings additional
complexity with regard to system debugging. To overcome these limitations and to enable
simulation of complex scenarios that can involve a large number of entities (on a scale of
thousands), we developed a new discrete event management framework. The class diagram of
this new core is presented in Figure 7(a). The related classes are the following:
Cloud-sim:This is the main class, which is responsible form an aging event queues and
controlling step-by-step (sequential) execution of simulation events. Every event that is
generated by the Cloud-sim entity at run-time is stored in the queue called future events. These
events are sorted by their time parameter and inserted into the queue. Next, the events that are
scheduled at each step of the simulation are removed from the future events queue and
transferred to the deferred

FIG-12 :Cloud-sim core simulation framework
Figure 12 Cloud-sim core simulation framework class diagram: (a) main classes and (b)
predicates.
event queue. Following this, an event processing method is invoked for each entity, which
chooses events from the deferred event queue and performs appropriate actions. Such an
organization allows exible management of simulation and provides the following powerful
capabilities:
Deactivation (holding) of entities.
Context switching of entities between different states (e.g. waiting to active).
Pausing and resuming the process of simulation.
Creation of new entities at run-time.
Aborting and restarting simulation at run-time.
DeferredQueue: This class implements the deferred event queue used by Cloud-sim.
FutureQueue: This class implements the future event queue accessed by Cloud-sim.
CloudInformationService: A CIS is an entity that provides resource registration, indexing, and
discovering capabilities. CIS supports two basic primitives: (i) publish(), which allows entities
to register themselves with CIS and (ii) search(), which allows entities such as
CloudCoordinator and Brokers in discovering status and endpoint contact address of other
entities. This entity also noties the (other?) entities about the end of simulation.
SimEntity: This is an abstract class, which represents a simulation entity that is able to send
messages to other entities and process received messages as well as re and handle events. All
entities must extend this class and override its three core methods: startEntity(), processEvent()
and shutdownEntity(), which dene actions for entity initialization, processing of events, and
entity destruction, respectively. SimEntity class provides the ability to schedule new events

and send messages to other entities, where network delay is calculated according to the BRITE
model. Once created, entities automatically register with CIS.
Cloud-simTags. This class contains various static event/command tags that indicate the type of
action that needs to be undertaken by Cloud-sim entities when they receive or send events.
SimEvent: This entity represents a simulation event that is passed between two or more entities.
SimEvent stores the following information about an event: type, init time, time at which the
event should occur, nish time, time at which the event should be delivered to its destination
entity, IDs of the source(s?) and destination entities, tag of the event, and data that have to be
passed to the destination entity. Cloud-simShutdown: This is an entity that waits for the
termination of all end-user and broker entities, and then signals the end of simulation to CIS.
Predicate: Predicates are used for selecting events from the deferred queue. This is an abstract
class and must be extended to create a new predicate. Some standard predicates are provided
that are presented in Figure 7(b).
Predicate Any: This class represents a predicate that matches any event on the deferred event
queue. There is a publicly accessible instance of this predicate in the Cloud-sim class, called
Cloud-sim.SIM ANY, and hence no new instances need to be created.
Predicate From: This class represents a predicate that selects events red by specic entities.
Predicate None: This represents a predicate that does not match any event on the deferred
event queue. There is a publicly accessible static instance of this predicate in the Cloud-sim
class, called Cloud-sim.SIM NONE; hence, the users are not needed to create any new
instances of this class.
PredicateNotFrom: This class represents a predicate that selects events that have not been sent
by specic entities.
PredicateNotType: This class represents a predicate to select events that do not match specic
tags.
PredicateType: This class represents a predicate to select events with specic tags.
7.2 Data center internal processing

Processing of task units is handled by the respective VMs; therefore, their progress must be
continuously updated And monitor date very simulation step. For handling this, an internal
event is generated to inform the DataCenter entity that a task unit completion is expected in the
near future. Thus, at each simulation step, each DataCenter entity invokes a method called
updateVMsProcessing()

Figure 13. Cloudlet processing update process.

for every host that it manages. Following this, the contacted VMs update processing of
currently active tasks with the host. The input parameter type for this method is the current
simulation time and the return parameter type is the next expected completion time of a task
currently running in one of the VMs on that host. The next internal event time is the least time
among all the nish times, which are returned by the hosts. At the host level, invocation of
updateVMsProcessing() triggers an updateCloudletsProcessing() method that directs every
VM to update its tasks unit status (nish, suspended, executing) with the Datacenter entity.
This method implements a similar logic as described previously for updateVMsProcessing()
but at the VM level. Once this method is called, VMs return the next expected completion time
of the task units currently managed by them. The least completion time among all the computed
values is sent to the Datacenter entity. As a result, completion times are kept in a queue that is
queried by Datacenter after each event processing step. The completed tasks waiting in the
nish queue that are directly returned concern Cloud-Broker or Cloud Coordinator. This
process is depicted in Figure 13 in the form of a sequence diagram.

7.3 Communication among entities

Figure 14 depicts the ow of communication among core Cloud-sim entities. At the beginning
of a simulation, each Datacenter entity registers with the CIS Registry. CIS then provides
information registry-type functionalities, such as match-making services for mapping
user/brokers, requests to suitable Cloud providers. Next, the DataCenter brokers acting on
behalf of users consult the CIS service to obtain the list of cloud providers who can offer
infrastructure services that match applications QoS, hardware, and software requirements. In
the event of a match, the DataCenter broker deploys the application with the CIS suggested
cloud. The communication ow described so far relates to the basic ow in a simulated
experiment. Some variations in this ow are possible depending on policies. For example,
messages from Brokers to Datacenters may require
Figure 14. Simulation data ow.

a conrmation from other parts of the Datacenter, about the execution of an action, or about
the maximum number of VMs that a user can create.

CHAPTER-8
8.SCOPE AND OBJECTIVES:
Scope is nothing but how will be the advantages and dis-advantages after the
implementation.
Objective is what is the main theme working on and what problem is solved through this
implementation.
8.1 EXISTING SYSTEMS:

Cloud computing provides seemingly unlimited virtualized resources to users as
services across the whole Internet, while hiding platform and implementation details.
Todays cloud service providers offer both highly available storage and massively
parallel computing resources at relatively low costs.
Cloud computing becomes prevalent, an increasing amount of data is being stored in
the cloud and shared by users with specified privileges, which define the access rights
of the stored data
8.1.1 EXISTING SYSTEM DISADVANTAGES:

One critical challenge of cloud storage services is the management of the ever-
increasing volume of data.
Increasing of duplicate data in cloud storage and it increases size of occupation and
increases bandwidth.
8.2 PROPOSED SYSTEM:

we enhance our system in security. Specifically, we present an advanced scheme to support
stronger security by encrypting the file with differential privilege keys. In this way, the users
without corresponding privileges cannot perform the duplicate check. Furthermore, such
unauthorized users cannot decrypt the cipher text even collude with the S-CSP. Security
analysis demonstrates that our system is secure in terms of the definitions specified in the
proposed security model.
the file level de-duplication for simplicity. In another word, we refer a data copy to be a whole
file and file-level de-duplication which eliminates the storage of any redundant files. Actually,
block-level de-duplication can be easily deduced from file-level de-duplication, Specifically,

to upload a file, a user first performs the file-level duplicate check. If the file is a duplicate,
then all its blocks must be duplicates as well; otherwise, the user further performs the block-
level duplicate check and identifies the unique blocks to be uploaded. Each data copy (i.e., a
file or a block) is associated with a token for the duplicate check.
S-CSP. This is an entity that provides a data storage service in public cloud. The S-CSP
provides the data outsourcing service and stores data on behalf of the users. To reduce the
storage cost, the S-CSP eliminates the storage of redundant data via de-duplication and keeps
only unique data. In this paper, we assume that S-CSP is always online and has abundant
storage capacity and computation power.
FIG 15- Architecture for Authorized De-duplication
Data Users. A user is an entity that wants to outsource data storage to the S-CSP and access
the data later. In a storage system supporting de-duplication, the user only uploads unique data
but does not upload any duplicate data to save the upload bandwidth, which may be owned by
the same user or different users. In the authorized de-duplication system, each user is issued a
set of privileges in the setup of the system. Each file is protected with the convergent encryption
key and privilege keys to realize the authorized de-duplication with differential privileges.
Private Cloud. Compared with the traditional de-duplication architecture in cloud computing,
this is a new entity introduced for facilitating users secure usage of cloud service. Specifically,
since the computing resources at data user/owner side are restricted and the public cloud is not

fully trusted in practice, private cloud is able to provide data user/owner with an execution
environment and infrastructure working as an interface between user and the public cloud. The
private keys for the privileges are managed by the private cloud, who answers the file token
requests from the users. The interface offered by the private cloud allows user to submit files
and queries to be securely stored and computed respectively
Notice that this is a novel architecture for data de-duplication in cloud computing, which
consists of a twin clouds (i.e., the public cloud and the private cloud). Actually, this hybrid
cloud setting has attracted more and more attention recently. For example, an enterprise might
use a public cloud service, such as Amazon S3, for archived data, but continue to maintain in-
house storage for operational customer data. Alternatively, the trusted private cloud could be a
cluster of virtualized cryptographic co-processors, which are offered as a service by a third
party and provide the necessary hardware based security features to implement a remote
execution environment trusted by the user.
8.2.2 ADVANTAGES OF PROPOSED SYSTEM:

The user is only allowed to perform the duplicate check
for files marked with the corresponding privileges. We present an advanced scheme to
support stronger
security by encrypting the file with differential privilege keys. Reduce the storage size
of the tags for integrity check.
To enhance the security of de duplication and protect the data confidentiality.
8.3 OBJECTIVE:
The notion of authorized data deduplication was proposed to protect the data security by
including differential privileges of users in the duplicate check. We also presented several new
deduplication constructions supporting authorized duplicate check in hybrid cloud architecture,
in which the duplicate check tokens of files are generated by the private cloud serve with private
keys. Security analysis demonstrates that our schemes are secure in terms of insider and
outsider attacks specified in the proposed security model. As a proof of concept, we
implemented a prototype of our proposed authorized duplicate check scheme and conduct
testbed experiments on our prototype. We showed that our authorized duplicate check scheme
incurs minimal overhead compared to convergent encryption and network transfer.

8.4 FUTURE SCOPE:

It excludes the security problems that may arise in the practical deployment of the present
model. Also, it increases the national security. It saves the memory by de-duplicating the data
and thus provide us with sufficient memory. It provides authorization to the private firms and
protect the confidentiality of the important data.

CHAPTER-9
EXPERIMENTS AND EVALUATION

In this section, we present the experiments and evaluation that we undertook in order to
quantify the efciency of Cloud-sim in modeling and simulation of Cloud computing
environments.
9.1. Cloud-sim:
scalability and overhead evaluation The rst tests that we present here are aimed at analyzing
the overhead and scalability of memory usage, and the overall efciency of Cloud-sim. The
tests were conducted on a machine that had two Intel Xeon Quad-core 2.27GHz and 16GB of
RAM memory. All of these hardware resources were made available to a VM running Ubuntu
8.04 that was used for running the tests. The test simulation environment setup for measuring
the overhead and memory usage by Cloud-sim included DataCenterBroker and DataCenter
(hosting a number of machines) entities. In the rst test, all the machines were hosted within a
single data center. Then for the next test, the machines were symmetrically distributed across
two data centers. The number of hosts in both the experiments varied from 1000 to 1000000.
Each experiment was repeated 30 times. For the memory test, the total physical memory usage
required for fully instantiating and loading the Cloud-sim environment was proled. For the
overhead test, the total delay in instantiating the simulation environment was computed as the
time difference between the following events: (i) the time at which the run-time environment
(Java VM) is instructed to load the Cloud-sim framework; and (ii) the instance at which Cloud-
sims entities are fully initialized and are ready to process events.
Figure 16(a) presents the average amount of time that was required for setting up simulation as
a function of several hosts considered in the experiment. Figure 16(b) plots the amount of
memory that was required for successfully conducting the tests. The results showed that the
overhead does not grow linearly with the system size. Instead, we observed that it grows in
steps when a specic number of hosts were used in the experiment. The obtained results
showed that the time to instantiate an experiment setup with 1 million hosts is around 12s.
These observations proved that Cloud-sim is capable of supporting a large-scale simulation
environment with little or no overhead as regard initialization time and memory consumption.

Hence, Cloud-sim offers signicant benets as a performance testing platform when compared
with the real-world Cloud offerings. It is almost impossible to compute the time and economic
overhead that would incur in setting up such a large-scale test environment on Cloud platforms
(Amazon EC2, Azure). The results showed almost the same behavior under different system
sizes (Cloud infrastructure deployed across one or two data centers). The same behavior was
observed for the cases when only one and two data centers
Figure 16. Cloud-sim evaluation: (a) overhead and (b) memory consumption.
Figure 17. Simulation of scheduling policies: (a) space-shared and (b) time-shared.
were simulated although the latter had averages that were slightly smaller than the former. This
difference was statically signicant (according to unpaired t-tests run with samples for one and
two data centers for each value of number of hosts), and it can be explained with the help of an
efcient use of a multicore machine by the Java VM.
As regards memory overhead, we observed that a linear growth with an increase in the number
of hosts and the total memory usage never grew beyond 320MB even for larger system sizes.
This result indicated an improvement in the performance of the recent version of Cloud-sim
(2.0) as compared with the version that was built based on SimJava simulation coreThe earlier

version incurred an exponential growth in memory utilization for experiments with similar
congurations.
The next test was aimed at validating the correctness of functionalities offered by Cloud-sim.
The simulation environment consisted of a data center with 10000 hosts where each host was
modeled to have a single CPU core (1200MIPS), 4GB of RAM memory, and 2TB of storage.
The provisioning policy for VMs was space-shared that allowed one VM to be active in a host
at a given instance of time. We congured the end-user (through the DatacenterBroker) to
request creation and instantiation of 50VMs that had the following constraints: 1024MB of
physical memory, 1 CPU core, and 1GB of storage. The application granularity was modeled
to be composed of 300 task units, with each task unit requiring 1440000 million instructions
(20min in the simulated hosts) to be executed on a host. Since networking was not the focus of
this study, therefore minimal data transfer (300kB) overhead was considered for the task units
(to and from the data center).
After the creation of VMs, task units were submitted in small groups of 50 (one for each VM)
at an inter-arrival delay of 10min. The VMs were congured to apply both space-shared and
timeshared policies for provisioning tasks units to the processing cores. Figures 17(a) and (b)
present task units progress status with the increase in simulation steps (time) for multiple
provisioning policies (space-shared and time-shared). As expected, in the space-shared mode,
every task took 20min for completion as they had dedicated access to the processing core. In
space-shared mode, the arrival of new task did not have any effect on the tasks under execution.
Every new task was simply queued in for future consideration. However, in the time-shared
mode, the execution time of each task varied with an increase in the number of submitted task
units. Time-shared policy for allocating task units to VMs had a signicant effect on execution
times, as the processing core was massively context switched among the active tasks. The rst
group of 50 tasks had a slightly better response time as compared with the latter groups. The
primary cause for this being that the task units in the latter groups had to deal with
comparatively an over-loaded system (VMs). However, at the end of the simulation as system
became less loaded, the response times improved (see Figure 11). These are the expected
behaviors for both policies considering the experiment input. Hence, the results showed that
policies and components of Cloud-sim are correctly implemented.

9.2. Evaluating federated cloud computing components

The next set of experiments aimed at testing Cloud-sims components that form the basis for
modeling and simulation of a federated network of clouds (private, public, or both). To this
end, a test environment that modeled a federation of three Cloud providers and an end-user
(DataCenterBroker) was created. Every provider also instantiated a Sensor component, which
was responsible for dynamically sensing the availability of information related to the data
center hosts. Next, the sensed statistics were reported to the CloudCoordinator that utilized this
information in undertaking load-migration decisions. We evaluated a straightforward load-
migration policy that performed online migration of VMs across federated cloud providers in
case the origin provider did not have the requested number of free VM slots available. To
summarize, the migration process involved the following steps: (i) creating a VM instance that
had the same conguration as the original VM and which was also compliant with the
destination provider congurations; and (ii) migrating the Cloudlets assigned to the original
VM to the newly instantiated VM. The federated network of Cloud providers was created based
on the topology shown in Figure 18. Every Cloud-based data center in the federated network
was modeled to have 50 computing hosts, 10GB of memory, 2TB of storage, 1 processor with
1000MIPS of capacity, and a timeshared VM scheduler. DataCenterBroker on behalf of the
users, requested instantiation of a VM
Figure 18. A network topology of federated data centers

Table 2. Performance results.

-------------------------------------------------------------------------------------------------------------
Performance metrics With federation Without federation
---------------------------------------------------------------------------------------------------------------
Average turn around time (s) 2221.13 4700.1
Makespan (s) 6613.1 8405
that required 256MB of memory, 1GB of storage, 1CPU, and a time-shared Cloudlet scheduler.
The broker requested instantiation of 25 VMs and associated a Cloudlet to each VM, where
they were to be hosted. These requests originated at the Datacenter 0. The length of each
Cloudlet was set to 1800000 MIs. Further, the simulation experiments were conducted under
the following system congurations and load-migration scenarios: (i) in the rst setup, a
federated network of clouds was available where data centers were able to cope with high
demands by migrating the excess of load to the least-loaded ones; and (ii) in the second setup,
the data centers were modeled as independent entities (without federation or not being part of
any federation). All the workload submitted to a data center must be processed and executed
locally. Table II shows the average turn-around time for each Cloudlet and the overall make
span of the end-user application in both cases. An end-user application consisted of one or
more Cloudlets that had sequential dependencies. The simulation results revealed that the
availability of federated infrastructure of clouds reduces the average turn-around time by more
than 50%, while improving themakespanby20%. It showed that, even for a very simple load-
migration policy, federated Cloud resource pool brings signicant benets to the end-users in
terms of application performance.
9.3 Case study: Hybrid cloud provisioning strategy

In this section, a more complete experiment that also captured the networking behavior
(latencies) between clouds is presented. This experiment showed that the adoption of a hybrid
public/private Cloud computing environments could improve the productivity of a company.
With this model, companies can dynamically expand their system capacity by leasing resources
from public clouds at a reasonable cost.
The simulation scenario models a network of a private and a public cloud (Amazon EC2 cloud).
The public and the private clouds were modeled to have two distinct data centers. A Cloud-
Coordinator in the private data center received the users applications and processed (queue,
execute) them on an FCFS basis. To evaluate the effectiveness of a hybrid cloud in speeding

up tasks execution, two test scenarios were simulated: in the rst scenario, all the workload
was processed locally within the private cloud. In the second scenario, the workload (tasks)
could be migrated to public clouds in case private cloud resources (hosts, VMs) were busy or
unavailable. In other words, the second scenario simulated a Cloud-Burst by integrating the/a
local private cloud with public cloud for handing peak in service demands. Before a task could
be submitted to a public cloud (Amazon EC2), the rst requirement was to load and instantiate
the VM images at the destination. The number of images instantiated in the public cloud was
varied from 10 to 100% of the number of hosts available in the private cloud. Task units were
allocated to the VMs in the space-shared mode. Every time a task nished, the freed VM was
allocated to the next waiting task. Once the waiting queue ran out of tasks or once all tasks had
been processed, all the VMs in the public cloud were destroyed by the Cloud Coordinator.
The private cloud hosted approximately 100 machines. Each machine had 2GB of RAM, 10TB
of storage, and one CPU run 1000MIPS. The VMs created in the public cloud were based on
an Amazons small instance (1.7GB of memory, 1 virtual core, and 160GB of instance storage).
We considered in this evaluation that the virtual core of a small instance has the same
processing power as the local machine.
The workload sent to the private cloud was composed of 10000 tasks. Each task required
between 20 and 22min of processor time. The distributions for processing time were randomly
generated based on the normal distribution. Each of the 10000 tasks was submitted at the same
time to the private cloud.
Table 3. Cost and performance of several public/private cloud strategies.

Table III shows the makespan of the tasks that were achieved for different combinations of
private and public cloud resources. In the third column of the table, we quantify the overall
cost of the services. The pricing policy was designed based on Amazons small instances (U$
0.10 per instance per hour) business model. It means that the cost per instance is charged
hourly. Thus, if an instance runs during 1h and 1min, the amount for 2h (U$ 0.20) will be
charged. As expected, with an increase in the size of the resource pool that was available to
task provisioning, the overall makespan of tasks reduced. However, the cost associated with
the processing also increased, with an increase in % of public cloud resource. Nevertheless, we
found that the increased cost offered signicant gains in terms of improved makespan. Overall,
it was always cheaper to rent resources from public clouds for handling sudden peaks in
demands as compared with buying or installing private infrastructures.
9.4. Case study: Energy-conscious management of data center

In order to test the capability of Cloud-sim for modeling and simulation of energy-conscious
VM provisioning technique, we designed the following experiment setup. The simulation
environment included a Cloud-based data center that had 100 hosts. These hosts were modeled
to have a CPU core (1000MIPS), 2GB of RAM, and 1TB of storage. The workload model for
this evaluation included provisioning requests for 400VMs, with each request demanding 1
CPU core (250MIPS), 256MB of RAM and 1GB of storage. Each VM hosts a web-hosting
application service, whose CPU utilization distribution was generated according to the uniform
distribution. Each instance of a web-hosting service required 150000MIPS or about 10min to
complete execution assuming 100% utilization. Energy-conscious model was implemented
with the assumption that power consumption is the sum of some static power, which is constant
for a switched on host; and a dynamic component, which is a linear function of utilization [21].
Initially, VMs were allocated according to requested parameters (4VMs on each host). The
Cloud computing architecture (see Figure 13) that we consideredforstudyingenergy-conscious
resource management techniques/policies included a data center, CloudCoordinator, and
Sensor component. The CloudCoordinator and Sensor performed the usual roles as described
in the earlier sections. Via the attached Sensors (which are connected with every host),
CloudCoordinator was able to periodically monitor the performance status of active VMs, such
as load conditions, and processing share. This real-time information is passed to VMM, which
used it for performing appropriate resizing of VMs and application of DVFS and soft scaling.

CloudCoordinator continuously adapts allocation of VMs by issuing VM migration commands

and changing power states of nodes according to its policy and current utilization of resources.
In this experiment, we compare the performance of two energy-conscious resource
management techniques against a benchmark trivial technique, which did not consider energy-
optimization during provisioning of VMs to hosts. In the benchmark technique, the processors
were allowed to throttle at maximum frequency (i.e. consume maximum electrical power)
whereas in this case, they
Copyright q 2010 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2011; 41:2350 DOI:
10.1002/spe
Cloud-sim: A TOOLKIT
Figure 19. Architecture diagram: 1data about resource utilization; 2commands for
migration of VMs and adjusting of power states; and 3VM resizing, scheduling, and
migration actions.

Figure 20. Experimental results: (a) total energy consumption by the system; (b) number of
VM migrations; (c) number of SLA violations; and (d) average SLA violation.
operated at the highest possible processing capacity (100%). The rst energy-conscious
technique was DVFS enabled, which means that the VMs were resized during the simulation
based on the dynamics of the hosts CPU utilization. It was assumed that voltage and frequency
of CPU were adjusted linearly. The second energy-conscious technique was an extension of
the DVFS policy; it applied live migration of VMs every 5s for adapting to the allocation. The
basic idea here was to consolidate VMs on a minimal number of nodes and turn off idle ones
in order to minimize power consumption. For mapping VMs to hosts, a greedy algorithm was
applied that sorted VMs in decreasing order of their CPU utilization and allocated them to hosts
in a rst-t manner. VMs were migrated to another host, if that optimized energy consumption.
To avoid SLA violations, the VMs were packed on the hosts in such a way that the host
utilization was kept below a pre-dened utilization threshold. This threshold value was varied
over a distribution during the simulation for investigating its effect on the behavior of the
system. The simulation was repeated 10 times; the mean values of the results that we obtained
are presented in Figure 14. The results showed that energy-conscious techniques can
signicantly reduce the total power consumption of data center hosts (up to 50%) as against
the benchmark technique. However, these are only the indicator results; the actual performance
of energy-conscious techniques directly
depends on the type of application service being hosted in the cloud. There is much scope in
this area for developing application-specic energy optimization techniques. With the growth
of the utilization threshold, the energy consumption decreases because VMs can be
consolidated more aggressively. This also leads to a decrease in the number of VM migrations
and an increase in the number of SLA violations. This simple case study showed how Cloud-
sim can be used to simulate different kinds of energy-conscious resource management
techniques/policies.

REFERENCES:
[1] T. Surcel and F. Alecu, Applications of Cloud Computing, In International Conference
of Science and Technology in the Context of the Sustainable Development, pp. 177-180,
2008.
[2] M. D. Dikaiakos, D. Katsaros, P. Mehra, G. Pallis and A. Vakali, Cloud computing:
Distributed Internet computing for IT and scientific research, Internet Computing, IEEE,
13(5), 10-13, 2009.
[3] Sumit Goyal, Perils of cloud based enterprise resource planning, Advances in Asian
Social Science, 3(4), 880-881, 2013.
[4] G. Lewis, Basics about cloud computing, Software Engineering Institute Carniege
Mellon University, Pittsburgh, 2010.
[5] A. Beloglazov, Energy-Efficient Management of Virtual Machines in Data Centers for
Cloud Computing, PhD Thesis, 2013.
[6] I. Foster, Y. Zhao, I. Raicu and S. Lu, Cloud computing and grid computing 360-degree
compared, In: IEEE Grid Computing Environments Workshop, pp.1-10, November, 2008.
[7] Z. Liu H.S. Lallie and L. Liu L, A Hash-based Secure Interface on Plain Connection,
In: Proceedings of CHINACOM. ICST.OTG & IEEE Press, Harbin, China, 2011.
[8] A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee and I. Stoica, Above the
clouds: A Berkeley view of cloud computing, Department of Electrical Engineering and
Computer Sciences, University of California, Berkeley, Rep. UCB/EECS, 28, 2009.
[9] P. Mell and T. Grance, The NIST definition of cloud computing (draft), NIST special
publication, 800(145), 7, 2011.
[10] A. Stevens, When hybrid clouds are a mixed blessing, The Register, June 29, 2011.
[11] S. Roschke, F. Cheng and C. Meinel, Intrusion Detection in the Cloud, In: Eighth
IEEE International Conference on Dependable, Autonomic and Secure Computing, 2009.
[12] R. Buyya, C.S. Yeo, S. Venugopal, J. Broberg and I. Brandic, Cloud computing and
emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility,
Future Generation computer systems, 25(6), 599-616, 2009.
[13] (2013) Open Cloud Manifesto. [Online]. Available:
http://www.opencloudmanifesto.org/.
[14] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A. Konwinski, G. Lee, D. A.
Patterson, A. Rabkin, I. Stoica, and M. Zaharia, BAbove the clouds: A Berkeley view of

cloud computing,[ Electr. Eng. Comput. Sci. Dept., Univ. California, Berkeley, CA, Tech.
Rep. UCB/EECS2009-28, February 2009.
[15] L. M. Vaquero, L. Rodero-Merino, J. Caceres, and M. Lindner, BA break in the
clouds: Towards a cloud definition, [SIGCOMM Comput. Commun. Rev., 39(1), 5055,
2009.
[16] D. Kondo, B. Javadi, P. Malecot, F. Cappello, and D. P. Anderson, BCost-benefit
analysis of cloud computing versus desktop grids, In: Proceedings of IEEE International
Symposium on Parallel and Distributed Processing, Rome, Italy, May 2009.
[17] R. Buyya, C. S. Yeo, and S. Venugopal, BMarketoriented cloud computing: Vision,
hype, and reality for delivering IT services as computing utilities, In: Proceedings of 10th
IEEE International Conference on High Performance Computing and Communication,
Dalian, China, Sep. 2008, pp. 513.
[18] S. Ostermann, A. Iosup, N. Yigitbasi, R. Prodan, T. Fahringer and D. Epema, A
performance analysis of EC2 cloud computing services for scientific computing, In: Cloud
Computing (pp. 115-131). Springer Berlin Heidelberg, 2010.
[19] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff and D.
Zagorodnov, The eucalyptus open-source cloud-computing system, In: Proceedings of 9th
IEEE/ACM International Symposium on Cluster Computing and the Grid, (pp. 124-131),
May 2009.
[20] T. Velte, A. Velte and R. Elsenpeter, Cloud computing, a practical approach,
McGraw-Hill, Inc., 2009. [21] R. Buyya, S. Pandey and C. Vecchiola, Cloudbus toolkit for
market-oriented cloud computing, In: Cloud Computing (pp. 24-44). Springer Berlin
Heidelberg, 2009.
[22] L. Youseff, M. Butrico and D. Da Silva, Toward a unified ontology of cloud
computing, In: Proceedings of IEEE Grid Computing Environments Workshop, (pp. 1-10),
November 2008.
[23] S. Pearson, Y. Shen and M. Mowbray, A privacy manager for cloud computing, In:
Cloud
Computing (pp. 90-106). Springer Berlin Heidelberg, 2009.
[24] M.A. Vouk, Cloud computingissues, research and implementations, Journal of
Computing and Information Technology, 16(4), 235-246, 2004.
[25] Q. Zhang, L. Cheng and R. Boutaba, Cloud computing: state-of-the-art and research
challenges, Journal of Internet Services and Applications, 1(1), 7-18, 2010.

[26] L. Wang, J. Tao, M. Kunze, A.C. Castellanos, D. Kramer and W. Karl, Scientific
cloud computing: Early definition and experience, In: Proceedings of 10th IEEE
International Conference on High Performance Computing and Communications, (pp. 825-
830), September 2008.
[27] H. Brian, T. Brunschwiler, H. Dill, H. Christ, B. Falsafi, M. Fischer and M. Zollinger,
Cloud computing, Communications of the ACM, 51(7), 9-11, 2008.
[28] R.L. Grossman, The case for cloud computing, IT professional, 11(2), 23-27, 2009.
[29] N. Leavitt, Is cloud computing really ready for prime time, Growth, 27(5), 2009.
[30] C. Wang, Q. Wang, K. Ren and W. Lou, Privacypreserving public auditing for data
storage security in cloud computing, In: IEEE Proceedings of INFOCOM, (pp. 1-9), March
2010.

Hybrid Cloud Documentation

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hybrid Cloud Documentation

Hochgeladen von

Copyright:

Verfügbare Formate

SECURE AUTHORIZED DE-DUPLICATION

1.1 What is cloud?

1.2 What is cloud computing?

FIG 1: ARCHITECTURE OF CLOUD COMPUTING

Dept. CSE, GPREC, KURNOOL. Page 1

1.3 Basic Concepts:

FIG 2: DEPLOYMENT MODELS

(i) PUBLIC CLOUD:

Dept. CSE, GPREC, KURNOOL. Page 2

Drawbacks of public cloud:

Examples of public cloud include:

Microsoft Office 365

Dept. CSE, GPREC, KURNOOL. Page 3

Dept. CSE, GPREC, KURNOOL. Page 4

Dept. CSE, GPREC, KURNOOL. Page 5

Externally-Hosted Private Cloud: This private cloud model is hosted by an external

FIG 3: DIFFERENCE BETWEEN CUSTOMER PRIVATE AND PROVIDER

Dept. CSE, GPREC, KURNOOL. Page 6

(iii) COMMUNITY CLOUD:

Dept. CSE, GPREC, KURNOOL. Page 7

(iv) HYBRID CLOUD:

Dept. CSE, GPREC, KURNOOL. Page 8

Dept. CSE, GPREC, KURNOOL. Page 9

(v) DISTRIBUTED CLOUD:

Public-resource computingThis type of distributed cloud results from an expansive

(vi) INTER CLOUD:

(vii) MULTI CLOUD:

Dept. CSE, GPREC, KURNOOL. Page 10

(viii) NESTED CLOUDS:

1.3.2 Service Models:

Dept. CSE, GPREC, KURNOOL. Page 11

FIG.4- SERVICE MODELS

Dept. CSE, GPREC, KURNOOL. Page 12

(i) Infrastructure as a service (IaaS)

(ii) Platform as a service (PaaS)

PaaS vendors offer a development environment to application developers. The provider

Dept. CSE, GPREC, KURNOOL. Page 13

(iii) Software as a service (SaaS)

Dept. CSE, GPREC, KURNOOL. Page 14

1.3.3 Cloud engineering:

It is the application of engineering disciplines to cloud computing. It brings a systematic

Dept. CSE, GPREC, KURNOOL. Page 15

Dept. CSE, GPREC, KURNOOL. Page 16

2.1.1 Duplication methodology:

2.1.2 File system Scanner

Dept. CSE, GPREC, KURNOOL. Page 17

Dept. CSE, GPREC, KURNOOL. Page 18

2.1.3 Post processing:

2.1.4 Biases and Sources of Error:

Dept. CSE, GPREC, KURNOOL. Page 19

2.2 REDUNDANCY IN FILE CONTENTS

2.2.1 Background on Deduplication

Dept. CSE, GPREC, KURNOOL. Page 20

2.2.2 The Performance Impacts of De-Duplication:

FIG-5: PERFORMANCE IMPACTS OF DE-DUPLICATION

Dept. CSE, GPREC, KURNOOL. Page 21

2.2.3 De-duplication in Primary Storage:

Dept. CSE, GPREC, KURNOOL. Page 22

Dept. CSE, GPREC, KURNOOL. Page 23

2.2.4: De-duplication in Backup storage:

Dept. CSE, GPREC, KURNOOL. Page 24

Dept. CSE, GPREC, KURNOOL. Page 25

3.2 HYBRID CLOUD APPROACH: