You are on page 1of 8

Automating Application Deployment in

Infrastructure Clouds
Gideon Juve and Ewa Deelman
USC Information Sciences Institute
Marina del Rey, California, USA
{gideon,deelman}@isi.edu

Abstract—Cloud computing systems are becoming an important clouds, clusters and grids are static environments. A system
platform for distributed applications in science and engineering. administrator can setup the required services on a cluster and,
Infrastructure as a Service (IaaS) clouds provide the capability to with some maintenance, the cluster will be ready to run
provision virtual machines (VMs) on demand with a specific applications at any time. Clouds, on the other hand, are highly
configuration of hardware resources, but they do not provide
dynamic. Virtual machines provisioned from the cloud may be
functionality for managing resources once they are provisioned.
In order for such clouds to be used effectively, tools need to be used to run applications for only a few hours at a time. In
developed that can help users to deploy their applications in the order to make efficient use of such an environment, tools are
cloud. In this paper we describe a system we have developed to needed to automatically install, configure, and run distributed
provision, configure, and manage virtual machine deployments in services in a repeatable way.
the cloud. We also describe our experiences using the system to Deploying such applications is not a trivial task. It is
provision resources for scientific workflow applications, and usually not sufficient to simply develop a virtual machine
identify areas for further research. (VM) image that runs the appropriate services when the virtual
Keywords—cloud computing; provisioning; application machine starts up, and then just deploy the image on several
deployment VMs in the cloud. Often the configuration of distributed
services requires information about the nodes in the
I. INTRODUCTION deployment that is not available until after nodes are
Infrastructure as a Service (IaaS) clouds are becoming an provisioned (such as IP addresses, host names, etc.) as well as
important platform for distributed applications. These clouds parameters specified by the user. In addition, nodes often form
allow users to provision computational, storage and a complex hierarchy of interdependent services that must be
networking resources from commercial and academic resource configured in the correct order. Although users can manually
providers. Unlike other distributed resource sharing solutions, configure such complex deployments, doing so is time
such as grids, users of infrastructure clouds are given full consuming and error prone, especially for deployments with a
control of the entire software environment in which their large number of nodes. Instead, we advocate an approach
applications run. The benefits of this approach include support where the user is able to specify the layout of their application
for legacy applications and the ability to customize the declaratively, and use a service to automatically provision,
environment to suit the application. The drawbacks include configure, and monitor the application deployment. The
increased complexity and additional effort required to setup service should allow for the dynamic configuration of the
and deploy the application. deployment, so that a variety services can be deployed based
Current infrastructure clouds provide interfaces for on the needs of the user. It should also be resilient to failures
allocating individual virtual machines (VMs) with a desired that occur during the provisioning process and allow for the
configuration of CPU, memory, disk space, etc. However, dynamic addition and removal of nodes.
these interfaces typically do not provide any features to help In this paper we describe and evaluate a system called
users deploy and configure their application once resources Wrangler [10] that implements this functionality. Wrangler
have been provisioned. In order to make use of infrastructure allows users to send a simple XML description of the desired
clouds, developers need software tools that can be used to deployment to a web service that manages the provisioning of
configure dynamic execution environments in the cloud. virtual machines and the installation and configuration of
The execution environments required by distributed software and services. It is capable of interfacing with many
scientific applications, such as workflows and parallel different resource providers in order to deploy applications
programs, typically require a distributed storage system for across clouds, supports plugins that enable users to define
sharing data between application tasks running on different custom behaviors for their application, and allows
nodes, and a resource manager for scheduling tasks onto nodes dependencies to be specified between nodes. Complex
[12]. Fortunately, many such services have been developed for deployments can be created by composing several plugins that
use in traditional HPC environments, such as clusters and set up services, install and configure application software,
grids. The challenge is how to deploy these services in the download data, and monitor services, on several
cloud given the dynamic nature of cloud environments. Unlike interdependent nodes.

The components of the addresses. and terminate. the system • Complex dependencies. of hosts.31]. which support on-demand provisioning of should allow a single application to be deployed resources. A deployment service should make it easy deployment service should enable a user to describe for users to specify tests that can be used to verify that the nodes and services they require. This process should be simple and errors occur. and others. and acts as an information broker to aid application • Dynamic provisioning. In order to that must be repeated each time the application is detect these issues. A analyze a large dataset. the services in a distributed application depend We have developed a system called Wrangler to support on one another for configuration values. reliability.20. but operation of Wrangler. it may become necessary to Distributed applications used in science and provision resources from several cloud providers at engineering research often require resources for short the same time. Distributed systems often should exhibit other characteristics important to distributed consist of many services deployed across a collection systems. distributed applications across multiple clouds. and then a node is functioning properly. This capability is known as federated periods in order to complete a complex simulation. and clouds. and our experience using the Context • Multiple cloud providers. dependencies. Clients have the option of using a constructing virtual clusters have assumed a fixed command-line tool. In the computation. often require complex environments in which to run. provisions nodes from cloud providers. and usability. Similarly. node is added or removed. In addition to these functional requirements. Section V presents two real user to add and remove nodes from a deployment at applications that were deployed in the cloud using Wrangler. it monitor the state of a deployment in order to check for is important that these steps are automated. but only a few nodes during the later next section we describe the requirements for a cloud stages. In order to system are shown in Figure 1. In Section III we explain the design and require more web servers during daylight hours. They include: clients. of worker nodes [17. This capability could be used along with elastic provisioning algorithms (e. In order to minimize errors and save time.12. II. directed acyclic graph. a Python API. or XML-RPC to architecture consisting of a head node and a collection interact with the coordinator. III. the nodes and services coordinator. The coordinator stores information distributed applications often change over time. • Monitoring. For about its deployments in an SQLite database. repeatable. such as IP the requirements outlined above. and agents. In the event that a single Broker from the Nimbus cloud management system [15] we cloud provider is not able to supply sufficient have developed the following requirements for a deployment resources for an application. or reliability concerns service: demand that an application is deployed across • Automatic deployment of distributed applications. Unfortunately. In Section IV we present an evaluation fewer web servers at night. the cloud [11. host names. such as scalability. that can be queried to configure dependent nodes.33]. These services include batch schedulers. SYSTEM REQUIREMENTS [19]) to easily adapt deployments to the needs of an Based on our experience running science applications in application at runtime. runtime. and enable nodes to advertise values collects information about the state of a deployment. a science application may require many worker nodes during the initial stages of a . The resource requirements of configuration. Long-running services may encounter Setting up these environments involves many steps problems that require user intervention. databases. A problems. A deployment service of the time required to deploy basic applications on several should support dynamic provisioning by enabling the different cloud systems. deployment service should support multiple resource This makes them ideal candidates for infrastructure providers with different provisioning interfaces. it is important to continuously deployed. caches. The remainder of this paper is organized as follows. It should also automatically provision. query. ARCHITECTURE AND IMPLEMENTATION Often. file systems. It accepts requests from cluster provisioning system should support complex clients. example. Some previous systems for deployments. This should be possible as long as the Sections VI and VII describe related work and conclude the deployment’s dependencies remain valid when the paper. A virtual application deployments. to cloud computing or sky computing [16]. must be configured in the correct order according to • Clients run on each user’s machine and send requests their dependencies. and port numbers. This severely limits • The coordinator is a web service that manages the type of applications that can be deployed. independent data centers. an e-commerce application may deployment service. or complete an experiment. web servers. a deploy such an application.g.23. which can be expressed as a to the coordinator to launch. and configure the application automatically run these tests and notify the user when on-demand.

that all parameters. are The coordinator is designed to support many different specified as a single node with a “count” of three. Plugins can have multiple no errors. configuring the node with the software /mnt directory as /nfs/data. <deployment> <node name=”server”> <provider name=”amazon”> <image>ami-912837</image> <instance-type>c1.. to machines. providers is designed to be relatively simple. which starts NFS services and mounts the server’s coordinator.xlarge</instance-type> . virtual machines. the coordinator example describes a cluster of 4 nodes: 1 NFS server node. Each node has a provider that specifies the cloud termination. The request can create a new type—as well as authentication credentials required by the deployment. The client sends a request to the coordinator that characteristics of the virtual machine to be provisioned— includes the XML descriptions of all the nodes to be launched. The clients are configured with an “nfs_client. which functionalities that a cloud interface must provide are the starts the required NFS services and exports the /mnt . and OpenNebula [25]. and depend on the server node.sh” plugin. are to be provisioned from Amazon EC2. They are invoked by the agent to are part of a “clients” group. Figure 2: Example request for 4 node virtual cluster The agent is responsible for collecting information with a shared NFS file system about the node (such as its IP addresses and directory. It checks that the request is valid. resource provider to use for the node. Upon receiving a request from a client.small</instance-type> .sh” plugin contains a <ref> tag. automatically retries the request. reporting the state of the node to the plugin. The only The server is configured with an “nfs_server. The transient errors occur during provisioning. or add nodes to an existing deployment. and monitoring the “nfs_client. which correspond to virtual to deploy an application. </provider> <plugin script=”nfs_client. configure and monitor a node. and each node may request and provisions the appropriate type and quantity of depend on zero or more other nodes or groups. Specifying Deployments configured. replaced with the IP address of the server node at runtime and • Plugins are user-defined scripts that implement the used by the clients to mount the NFS file system. including the VM image to use and the hardware resource as well as any plugins used. Nodes exist.sh” hostnames). and defines the Request. provider. The clients. All nodes cloud providers. which define the Provisioning. and different images Eucalyptus [24]. Deployment Process format. Each XML request document describes a deployment Here we describe the process that Wrangler goes through consisting of several nodes. The “SERVER” parameter of the and services specified by the user.sh”> <param name="EXPORT">/mnt</param> </plugin> </node> <node name=”client” count=”3” group=”clients”> <provider name=”amazon”> <image>ami-901873</image> <instance-type>m1. services and functionality that should be coordinator first validates the request to ensure that there are implemented by the node. This parameter is node for failures.sh”> <param name="SERVER"> <ref node="server" attribute="local-ipv4"> </param> <param name=”PATH”>/mnt</param> <param name=”MOUNT”>/nfs/data</param> </plugin> Figure 1: System architecture <depends node=”server”/> </node> </deployment> • Agents run on each of the provisioned nodes to manage their configuration and monitor their health. It currently supports Amazon EC2 [1]. will be available for the clients to mount when they are A. from the initial request. which enable the user to configure the plugin. Adding additional and instance types are specified for the server and the clients. and that no dependency cycles are passed to the script when it is executed on the node. the behaviors. </provider> <plugin script=”nfs_server. Each node in a which ensures that the NFS file system exported by the server deployment can be configured with multiple plugins. In the event that network timeouts and other An example deployment is shown in Figure 2. The clients behavior of a node. and dependencies can be resolved. Users specify their deployment using a simple XML B. which are identical.. Then it contacts the resource providers specified in the may be members of a named group.. Each node has one or more plugins.. and 3 NFS client nodes.

The advantage of this approach more nodes. may be present. If all the node’s dependencies have clusters. Python. the list of plugins for the node. and the coordinator contacts the cloud images. such configuration plugins that apply application-specific settings. behavior of the plugin. and reuse custom downloads and invokes the associated plugin script with the plugins. the error messages are sent to the coordinator and the node’s Startup and Registration. After a node has been configured. agent at runtime to avoid this issue. We envision that there could be a repository for the already been configured. In the future we plan to investigate ways to install the provider to terminate the node(s). such as The attributes collected include: the public and private service plugins that start daemon processes. NFS server and NFS ‘registered’. After checking all the connectivity between nodes so that an application can be plugins. Perl. When the coordinator receives a schedulers. agent to configure the node. The only requirement is that the attributes for the node. many different types of plugins that can be created. then coordinator can communicate with agents and vice versa. For example. such as Condor [18]. application hostnames and IP addresses of the node. • The start command tells the plugin to perform the behavior requested by the user. then the agent reports the node’s status as configuration variables that can be used to customize the ‘configured’ to the coordinator. The agent passes makes sure that they have registered. resolving any <ref> parameters that Plugins are typically shell. For each plugin. several nodes. and the node’s status is set to different applications. and the ability to pass Monitoring. Commands are specific actions that must be performed by the then the coordinator attempts to configure them as well. modify. This interface defines the interactions failure to the coordinator. disadvantage is that it requires users to re-bundle images to and the agents send stop commands to all of their plugins. as well as any other plugins that install software used by the application. as the availability zone. at which point the user must between the agent and the plugin. the node is ready to be configured. The sends messages to the agents on all nodes to be terminated. the coordinator sends a request to the most useful plugins. When the VM boots up. They are specified in the XML request Upon receiving a message that the node has been document described above. the agent This enables users to easily define. If they have not. and involves two intervene to correct the problem. If any of the plugins report errors. it status is set to ‘failed’. include the agent software. and that all dependencies commands to the plugin as arguments. Upon receiving this request. Once the agent has retrieved this Several plugins can be combined to define the behavior of a information. commands that tell the plugin what to do: start. C. Plugins When the agent starts. the agents report their status to many users and makes it more difficult to use off-the-shelf the coordinator. which is not a simple task for Once the plugins are stopped. If the plugin fails with a non-zero exit code. to deploy many different types of compute has any dependencies. and the node’s Plugins are the modular components of a deployment. the monitoring plugins that validate the state of the node.ability to launch and terminate VMs. The agent passes parameters to the configured. It plugin to implement the plugin lifecycle. ID assigned to the node by the coordinator. There are and to collect attributes about the node and its environment. Parameters are the successfully started. the coordinator enables the coordinator to manage a larger set of nodes. There are three have been configured. client plugins can be combined with plugins for different batch Configuration. application-specific behaviors required of a node. starts the agent process. which deployment. security credentials. periodically monitors the node by invoking all the node’s The system does not assume anything about the network plugins with the status command. PBS [26]. It is invoked when the . If there are. If all plugins were components: parameters and commands. stop. the agent custom contextualization data to a VM. nodes that depend on the newly configured node. and to configure the node. it is sent to the coordinator as part of a node. and The configuration process is complete when all agents status. relevant information available from the metadata service. user-specified parameters. the coordinator checks to see if there are any plugin as environment variables when the plugin is invoked. before proceeding. report to the coordinator that they are configured. or an entire monitoring tasks from the coordinator to the agent. and the host and port where the coordinator can be contacted. They are transferred from the client (or potentially a After the agent receives a command from the coordinator repository) to the coordinator when a node is provisioned. a message is sent to the coordinator with updated deployed across many clouds. but can be any executable program that conforms to the then the agent aborts the configuration process and reports the required interface. When the user is ready to terminate one or pre-installed in the VM image. and well-designed plugins can be reused for many registration message. it contacts the coordinator to retrieve from the coordinator to the agent when a node is configured. The contextualization data includes: data plugins that download and install application data. or Sun Grid registration message from a node it checks to see if the node Engine [8]. then the Plugins are implemented as simple scripts that run on the coordinator waits until all dependencies have been configured nodes to perform all of the actions required by the application. it uses a provider-specific adapter Plugins are user-defined scripts that implement the to retrieve contextualization data passed by the coordinator. or Ruby scripts. This requires the agent software to be Termination. At that point. The is that it offloads the majority of the configuration and request can specify a single node. they send a request to the coordinator.

These types of groups are useful elif [ “$1” == “stop” ]. Authentication implement this command.pid the application from being deployed. systems and parallel file systems that require each node implementing the service to be aware of all the others. Only plugins that need to monitor the state components of the system. This experiment correct order so that services and attributes published by one measures the time required to provision N nodes from a single node can be used by another node. are available to all group members during command. All plugins should Nodes that depend on a group are not configured until all implement this command. elif [ “$1” == “status” ]. First. we conducted a few basic use to mount the file system. configuration. Co-dependent groups enable a limited form of cyclic dependencies and are useful for deploying some peer-to-peer Figure 3: Example plugin used for Condor workers. The plugin can advertise node attributes by writing key=value pairs to a file specified by the agent in an IV. and the m1. then Applications that deploy sets of nodes to perform a if [ "$CONDOR_HOST" == "" ]. This authentication mechanism then the node’s status is set to failed. This is CONDOR_HOST = $CONDOR_HOST simpler than specifying dependencies between the node and END $SBIN/condor_master –pidfile $PIDFILE each member of the group. Security example. These attributes are merged with the The performance of Wrangler is primarily a function of the node’s existing attributes and can be queried by other nodes in time it takes for the underlying cloud management system to the virtual cluster using <ref> tags or a command-line tool. and breaks the deadlock that would otherwise • The status command tells the plugin to check the state occur with a cyclic dependency. while Magellan and Sierra both use the command. a node can depend several nodes echo > /etc/condor/condor_config. to query and respond to attributes updated by other nodes. of agents is done using a random key that is generated by the If at any time the plugin exits with a non-zero exit code. This command can be used. between two nodes. and checks to make sure that the CentOS 5. Nodes in a co- • The stop command tells the plugin to stop any running dependent group are not configured until all members of the services and clean up. coordinator for each node.5 VM images. This command is invoked group have registered. node is being configured. Dependencies are . and for all nodes to register with the coordinator. such as parallel file systems and distributed echo "CONDOR_HOST not specified" caches. NERSC’s Magellan cloud [22]. on all condor_master process is running when it receives the status three clouds.local <<END at once by specifying that it depends on the group. the dependent node will not be configured until the other node has been configured. to verify that a service started by the plugin Wrangler uses SSL for secure communications between all is running. Upon failure. an NFS server node can advertise the address of time for nodes to register and be configured in the correct and path of an exported file system that NFS client nodes can order. groups that depend on themselves form co-dependent kill -0 $(cat $PIDFILE) fi groups. the output assumes that the cloud provider’s provisioning service of the plugin is collected and sent to the coordinator to provides the capability to securely transmit the agent’s key to simplify debugging and error diagnosis. This ensures that the basic attributes of before the node is terminated. can be configured using named groups. When a dependency exists provider. We conducted experiments on three separate clouds: A basic plugin for Condor worker nodes is shown in Amazon EC2. such as IP be shut down gracefully need to implement this addresses.#!/bin/bash -e valid as long as they do not form a cycle that would prevent PIDFILE=/var/run/condor/master. command. start the VMs. for E. EC2 uses a proprietary cloud the condor_master process when it receives the start management system. Authentication of clients is of the node or long-running services need to accomplished using a username and password. SBIN=/usr/local/condor/sbin if [ “$1” == “start” ]. This plugin generates a configuration file and starts FutureGrid’s Sierra cloud [7]. Wrangler adds to this a relatively small amount For example. or applications using Wrangler. then collective service. and Figure 3. We used identical the stop command. of the node for errors. Dependencies and Groups The first experiment we performed was provisioning a Dependencies ensure that nodes are configured in the simple vanilla cluster with no plugins. then for services such as Memcached clusters where the clients kill –QUIT $(cat $PIDFILE) need to know the addresses of each of the Memcached nodes. A. Groups are exit 1 fi used for two purposes. of the nodes in the group have been configured. With that in mind. Deployment with no plugins D. Only plugins that must the nodes that are collected during registration. each VM during provisioning. then Second. kills the condor_master process when it receives Eucalyptus cloud management system [24].large instance type. The status command can be used experiments to determine the overhead of deploying to periodically update the attributes advertised by the node. EVALUATION environment variable.

4 sec (std. NFS. In most cases we observe that the provisioning time for a virtual cluster is comparable to the time required to provision one VM.6 s 206. other applications. it was not possible to deploy the manages the workflow and stores data. repeatable deployments by composing The results of this experiment are shown in Table II. This study In the next experiment we again launch a deployment required us to deploy workflows using four parallel storage using Wrangler.3 s Sierra 371.7 s 500.9 s FAIL Table II: Provisioning time for a deployment used for workflow applications. dev. Although these applications are scientific workflows.5 s 112. network and services caused by the larger number of simultaneous requests. Data Storage Study data for Sierra with 16 nodes because the failure rate on Sierra Many workflow applications require shared storage while running these experiments was about 8%.2) on Magellan. The majority of this time is spent binaries on each worker node. Using Wrangler we were able to that execute workflow tasks as shown in Figure 5. depending on the target cloud and with a shared GlusterFS file system and installs application the number of nodes.1 s 131. could be deployed as easily.5 s FAIL The results of this experiment are shown in Table I. and V. worker nodes with Condor Worker. and PVFS) in six workflow management system [6]. file system client. For larger clusters we observe that the provisioning time is up to twice the maximum observed for one VM. GlusterFS. several outlier VMs that took much longer than expected to start.1 s 185. By plugins in different combinations to complete the study. and distributed databases.9 sec (std.8 s 55. nodes for each cluster were provisioned in serial. we can see it takes on the The deployments used in the study were similar to the one order of 1-2 minutes for Wrangler to run all the plugins once shown in Figure 4. and N file system nodes with a file system peer plugin. which systems in order to communicate data products among nodes virtually guaranteed that at least 1 out of every 16 VMs failed. on Magellan and Sierra there were Figure 4: Deployment used in the data storage study.7 s Magellan 101. possibly due to the increased load on the provider’s as web applications. In the future we plan to investigate ways to provision VMs in parallel to reduce this overhead.2 s 98. Recently we conducted a study [12] that B.5 s 433. First. and 428. 88. and NFS to create an environment that is similar to what four cluster sizes—a total of 72 different combinations. This is a result of two factors. EXAMPLE APPLICATIONS application-specific plugins. Second. This deployment sets up a Condor pool the nodes have registered. Due to we have used for executing real workflow applications in the the large number of experiments required.8) on EC2. in a compute cluster. dev. peer to peer systems. dev.5 s Magellan 173.6 s 69.6 s 102. create automatic. which we measured to be 55. The deployment consists of downloading and installing software. Deployment for workflow applications evaluated several different storage configurations that can be used to share data for workflows on Amazon EC2. 2 4 8 16 Nodes Nodes Nodes Nodes Amazon 55.2 s 111. The deployment consists of a master node that of the configurations. 2 4 8 16 Nodes Nodes Nodes Nodes Figure 5: Deployment used for workflow applications. such . which added 1-2 seconds onto the total provisioning time for each node.0 s 455. 4. Condor different configurations using three different applications and [18]. Note that we were not able to collect A.9 s 175.1) on Sierra. Table I: Mean provisioning time for a simple deployment with no plugins. comparing Table I and Table II. 104. N NFS clients to successfully mount the shared file system. DAGMan [5].3 s 349.7 sec (std. The file system nodes form a group so In this section we describe our experience using Wrangler that worker nodes will be configured after the file system is to deploy scientific workflow applications. and N worker nodes environments manually. 10. but this time we add plugins for the Pegasus systems (Amazon S3. and waiting for all the three tiers: a master node using a Condor Master plugin. Amazon 101.0 s 508.9 s 112.8 s Sierra 447. and the complexity cloud [12].

provisioning. a master node running outside deployment. virtual machines. The Kepler mission periodically releases time-series datasets of star brightness called light curves. user-defined plugins. which enables users to deploy provisioned from cloud providers.d can have only one service. This example illustrates how Wrangler can be used to set up experiments for distributed systems research. RELATED WORK (NCB) [14] used with the Nimbus cloud computing system [15].ready. NCB supports roles. while NCB works best with clusters [3. while Wrangler enables VMPlants [17]. In addition.9.32. We deployed this application across the Amazon EC2. In contrast. and do not support virtual machines One example is cloudinit. but each node in explored by several previous research efforts. such as SGE. our system is designed to to easily install and maintain high-performance computing support multiple cloud providers. however. Our respond to failures. and are developing similar solutions. Of these. modular plugins to define the systems typically assume a fixed architecture that consists of a behavior of a node.23]. head node and N worker nodes.29. Generating periodograms for the hundreds of thousands of light curves that have been released by the Kepler mission is a computationally intensive job that demands high-throughput distributed computing. controlled by the user. management and policy engines have been developed for There is still much work to be done in investigating the UNIX systems. most well known example. among others. This is complementary to that of the virtual appliances community application successfully demonstrated Wrangler’s ability to as well. Cfengine [4].34]. is not and configure software. which identify the periodic dimming caused by a planet as it orbits its star. the three cloud sites execute workflow tasks. The unique nodes and custom. These systems assume that the Recently. and elasticity. deploy complex applications across multiple cloud providers. the other concern of this work. Analyzing these light curves to find new planets requires the calculation of periodograms. Our system is similar to the Nimbus Context Broker VI. Periodograms Kepler [21] is a NASA satellite that uses high-precision photometry to detect planets outside our solar system. B. Puppet [13]. and Chef [27] are a best way to manage cloud environments. including a Condor Worker The focus of our project is on deploying collections of plugin to deploy and configure Condor. They also typically support only a single type of cluster software. In order to manage these computations we developed a workflow using Figure 6: Deployment used to execute periodograms the Pegasus workflow management system [6]. The deployment configuration is illustrated in installs a configuration management system on the nodes in a Figure 6. Condor. consistent state across many hosts in a commercial cloud providers. and monitor interdependent services in the cloud. Existing few well-known examples. such as on-demand provisioning. monitor running VMs. which are similar to Wrangler Configuring compute clusters is a well-known systems plugins with the exception that NCB roles must be installed in administration problem. Our approach is similar to these infrastructure clouds support the deployment of isolated systems in that configuration is one of its primary concerns. In this deployment. CONCLUSION Globus. as well as the emergence of maintaining a known. As such.d [2]. workflows. our approach supports complex The rapidly-developing field of cloud computing offers application architectures consisting of many interdependent new opportunities for distributed applications. In order to take advantage of cloud approach can be seen as complementary to these systems in . and worker nodes running in configuration. our research plugin to install application binaries. Rocks [28] is perhaps the Nimbus-based clouds. software on the worker nodes. and allow that system manage node the cloud manages the workflow. These include cloudinit. Many different configuration about deploying and executing distributed applications. other groups are recognizing the need for cluster is deployed on physical machines that are owned and deployment services. and a Periodograms appliances for distributed applications. In the past many cluster management the VM image and cannot be defined by the user when the systems have been developed to enable system administrators application is deployed. or VII. Cloudinit. Configuration management deals with the problem of virtualization. but do not provide functionality to deploy however. features of cloud computing. are changing the way we think distributed environment. StarCluster [31]. FutureGrid Sierra.d Constructing clusters on top of virtual machines has been services are similar to Wrangler plugins. and others [20. and NERSC Magellan clouds using the sense that one could easily create a Wrangler plugin that Wrangler. The deployment This work is related to virtual appliances [30] in that we used several different plugins to set up and configure the are interested in deploying application services in the cloud. or detect and addressed by configuration management systems. These users to compose several.

” 17th USENIX Conference on System Systems.” 10th IEEE/ACM International provisioning failed nodes. and by implementing mechanisms Symposium on Cluster. Livny. G. Rynge. Berriman.J.com/chef.” 7th IEEE International This work was sponsored by the National Science Symposium on Cluster Computing and the Grid (CCGrid 07). and D. “System management framework and tools for Beowulf cluster. LaBissoniere. http://www. K.” 4th International Conference on e-Science (e-Science and India clouds on the FutureGrid.C. Freeman. Maruyama.org/. 2001.wisc. and M. Figueiredo.” Scientific Programming. W. 2011.gov. http://futuregrid. Keahey. Jian-Feng. 3. Freeman. Bruno. Chef.org. W. In practice this has been a [18] M. pp. Tsugawa. Berman. with several different cloud resource providers to provision [12] G. G.A. E. A.S. and H. . Mattson. and S. no. Lei.resources.S. C. 1. 2007. Wei. and S. Kanies. Z. and D.opscode. T. [29] Penguin Computing. and the Skynet cloud at 08).mit. Kepler. 8. Obertelli. Vahi. Papadopoulos. REFERENCES [27] Opscode. earth science.” 9th IEEE/ACM International Symposium on grant 091812 (FutureGrid). provision virtual clusters for scientific workflow applications [14] K. B.-H. 19-25. 2011. We have used these virtual clusters to run several hundred “Science clouds: Early experiences in cloud computing for scientific workflows for applications in astronomy. R. In this paper we presented the design and implementation [11] G. Perceus/Warewulf. and G. Maneesilp. 2004. Keahey. and M. M. Scott. [24] D. http://www. J. E. Mehta. 5. Wrangler assumes that users Supercomputing (SC 04).J.” 8th International Conference of Distributed Computing Systems. [21] NASA. Application Resources. no. 1995. D. Laity. G. coordinates the configuration and initiation P. Singh.J. and for International Symposium on Cluster Computing and the Grid (CCGrid dynamically scaling deployments in response to application 09). J. Keahey and T. vol. This research Youseff. 10). 2010. [7] FutureGrid. Soman. T. 13. http://kepler. Chow. 2005. K. and the Grid (CCGrid ’01). but we “VMPlants: Providing and Managing Virtual Machine Execution have encountered some issues in using it that we plan to Environments for Grid Computing. “Experiences Using Cloud Computing for A Scientific Workflow [8] W. “Virtual Clusters on the ACKNOWLEGEMENTS Fly . “NPACI Rocks: tools and [1] Amazon. 15.” 20th International Parallel and Distributed Processing Symposium (IPDPS 06). Nishimura. and S. 219-237. [33] J. M. Currently. 707-725. 2003. Brumley. [3] M. J. can respond to failures manually. 2001. 2009. and J. 2006. Jacob.gov/. 2008. G. Juve and E. T. no.J. We have been using Wrangler since May 2010 to [13] L. Vahi.” Cloud Computing and Its Applications. http://www. [5] DAGMan. “Elastic Site: Using Clouds to investigate solutions for automatically handling failures by re. vol. J. Cloud and Grid Computing (CCGrid 2010). B. “Scientific Workflow of a system used for automatically deploying distributed Applications on Amazon EC2. R. E.” 1st IEEE/ACM International Symposium on Cluster Computing (ScienceCloud). 3.perceus. 13. Sapuntzakis. Scyld ClusterWare. Keahey. the Asia-Pacific Region. Lin-ping. C. Fortes. Brim. no. 2009. Computing (HPDC).” Fourth Katz.” 2nd Workshop on Scientific Cloud Computing grid.W. demand. [34] Z. M. Energy Research Scientific Computing Center (Magellan). Jun.” http://aws.edu/condor/dagman.com/software/scyld_clusterware. Gil.edu/stardev/cluster/. R. Concurrency and Computation: Practice and Experience.nasa. Vockler. Dan. and R. [26] OpenPBS. Administration. D. 7- [2] J. Litzkow. pp. [28] P. Freeman. Krsul.” 9th IEEE/ACM techniques for re-configuring deployments. Wolski. S. no. M. “Sky So far we have found that Wrangler makes deploying Computing. 2000. “A site configuration engine. “Pegasus: A framework for mapping complex scientific workflows International Conference/Exhibition on High Performance Computing in onto distributed systems.amazon. pp. http://web. Deelman. “Easy and reliable cluster management: the self-management experience of Fire Phoenix. L. the Sierra Virtual Clusters. Vahi.B. Kagey. Maechling. Rosenblum. Lam. Blythe.” 2010 ACM/IEEE conference on Supercomputing (SC applications over time. Berriman. [6] E. A. N. ISI. 43-51.A. Fenn. Scalable. http://www. Figueiredo. and K. J.J. Su.G. http://magellan. Juve.” Login. bioinformatics and applications. Mehta. Grzegorczyk. 2003. Appliance Launches in Infrastructure Clouds. Burgess. Zhang. complex. Deelman. “Contextualization: Providing One-Click on Amazon EC2. “Virtual Appliances for Deploying and [4] M.com.com/ec2. Chandra. http://cs.” Workshop on Cloud-based Services and applications on infrastructure clouds. [25] OpenNebula. [32] P. http://www. M. to fail gracefully or provide degraded service when re- [20] M. S. and M. Gentzsch. A. Elastic Compute Cloud (EC2). Bresnahan.” 2004 ACM/IEEE conference on address in the future. and G. 2010.B. Matsuoka. Good. “Managing 8.penguincomputing. Feb.M. “Dynamic provisioning is not possible.org. Deelman. In the future we plan to [19] P. new provisioning tools need to be developed to [10] G. Juve. Ganguly. T. unattended for long periods. and T. Juve. Magellan. G. 1988. Matsunaga. Zagorodnov.openpbs. Elastically Extend Site Resources. Foundation (NSF) under award OCI-0943725.opennebula. “Wrangler: Virtual Cluster Provisioning for the Cloud. K. Mutka. Deelman. We also plan to develop Provisioning of Virtual Organization Clusters. K. The system interfaces Applications in conjunction with 5th IEEE International Conference on e-Science (e-Science 2009). Zeldovich. “Data Sharing Options for Scientific Workflows on of services to support distributed applications. N. [17] I. vol. Uthayopas. J.B. “Sun Grid Engine: towards creating a compute power Application.nersc. vol. [31] StarCluster. and Flexible Installation. [16] K. [23] H. vol. Marshall. and virtual machines. distributed applications in the cloud easy. [15] K. Nurmi.org/. Y. 2009.P.-S. Murphy.L. 2006. Keahey. G. Angskun. and G. and monitors Amazon EC2. pp. Deelman.” Ottowa Linux Symposium. Goasguen.” USENIX Computing Maintaining Software. “Condor: A Hunter of Idle problem because users often leave virtual clusters running Workstations. Kesselman. 2011. “OSCAR: Open Source Cluster [30] C. [9] Infiniscale. 31. Mehta. the Magellan cloud at NERSC.” Teragrid Conference. Berriman. and J. J. M.” IEEE Internet Computing. 2009. [22] NERSC. techniques for easily deploying manageable Linux clusters. 2008.Fast. Fortes. “Puppet: Next Generation Configuration Management. Katz. M. B. “The Eucalyptus Open-source Cloud- makes use of resources supported in part by the NSF under computing System. Zhi-Hong. Freeman.” 20th International Symposium on High Performance Distributed assist users with these tasks. Fortes. Tsugawa. and resources of the National Cluster Computing and the Grid (CCGrid 09). M. Paisitbenchapol.