Sie sind auf Seite 1von 40

1.

The Grid: A new infrastructure for 21st century science 2


Ian Foster As computer networks become cheaper and more powerful, a new computing paradigm is poised to transform the practice of science and engineering. 1. Technology trends 2. Infrastructure and tools 3. Grid architecture 4. Authentication, authorization, and policy 5. Current status and future directions Driven by increasingly complex problems and propelled by increasingly powerful technology, todays science is as much based on computation, data analysis, and collaboration as on the efforts of individual experimentalists and theorists. But even as computer power, data storage, and communication continue to improve exponentially, computational resources are failing to keep up with what scientists demand of them. A personal computer in 2001 is as fast as a supercomputer of 1990. But 10 years ago, biologists were happy to compute a single molecular structure. Now, they want to calculate the structures of complex assemblies of macromolecules (see Figure 2.1) and screen thousands of drug candidates. Personal computers now ship with up to 100 gigabytes (GB) of storage as much as an entire 1990 supercomputer center. But by 2006, several physics projects, CERNs Large Hadron Collider (LHC) among them, will produce multiple petabytes (1015 byte) of data per year. Some wide area networks now operate at 155 megabits per second (Mb s1), three orders of magnitude faster than the state-of-theart 56 kilobits per second (Kb s1) that connected U.S. supercomputer centers in 1985. But to work with colleagues across the world on petabyte data sets, scientists now demand tens of gigabits per second (Gb s1).

Figure 2.1 Determining the structure of a complex molecule, such as the cholera toxin shown here, is the kind of computationally intense operation that Grids are intended to tackle. (Adapted from G. von Laszewski et al., Cluster Computing, 3(3), page 187, 2000).

What many term the Grid offers a potential means of surmounting these obstacles to progress [1]. Built on the Internet and the World Wide Web, the Grid is a new class of infrastructure. By providing scalable, secure, highperformance mechanisms for discovering and negotiating access to remote resources, the Grid promises to make it possible for scientific collaborations to share resources on an unprecedented scale and for geographically distributed groups to work together in ways that were previously impossible [24]. The concept of sharing distributed resources is not new. In 1965, MITs Fernando Corbato and the other designers of the Multics operating system envisioned a computer facility operating like a power company or water company [5]. And in their 1968 article The Computer as a Communications Device, J. C. R. Licklider and Robert W. Taylor anticipated Gridlike scenarios [6]. Since the late 1960s, much work has been devoted to developing distributed systems, but with mixed success. Now, however, a combination of technology trends and research advances makes it feasible to realize the Grid vision to put in place a new international scientific infrastructure with tools that, together, can meet the challenging demands of twenty-first century science. Indeed, major science communities now accept that Grid technology is important for their future. Numerous government-funded R&D projects are variously developing core technologies, deploying production Grids, and applying Grid technologies to challenging applications. (For a list of major Grid projects, see http://www.mcs.anl.gov/foster/grid-projects.)

1. TECHNOLOGY TRENDS
A useful metric for the rate of technological change is the average period during which speed or capacity doubles or, more or less equivalently, halves in price. For storage, networks, and computing power, these periods are around 12, 9, and 18 months, respectively. The different time constants associated with these three exponentials have significant implications. The annual doubling of data storage capacity, as measured in bits per unit area, has already reduced the cost of a terabyte (1012 bytes) disk farm to less than $10 000. Anticipating that the trend will continue, the designers of major physics experiments are planning petabyte data archives. Scientists who create sequences of highresolution simulations are also planning petabyte archives. Such large data volumes demand more from our analysis capabilities. Dramatic improvements in microprocessor performance mean that the lowly desktop or laptop is now a powerful computational engine. Nevertheless, computer power is falling behind storage. By doubling only every 18 months or so, computer power takes five years to increase by a single order of magnitude. Assembling the computational resources needed for large-scale analysis at a single location is becoming infeasible. The solution to these problems lies in dramatic changes taking place in networking. Spurred by such innovations as doping, which boosts the performance of optoelectronic devices, and by the demands of the Internet economy [7], the performance of wide area networks doubles every nine months or so; every five years it increases by two orders of magnitude. The NSFnet network, which connects the National Science Foundation supercomputer centers in the U.S., exemplifies this trend. In 1985, NSFnets backbone operated at a then-unprecedented 56 Kb s1. This year, the centers will be connected by the 40 Gb s1 TeraGrid network (http://www.teragrid.org/) an improvement of six orders of magnitude in 17 years. The doubling of network performance relative to computer speed every 18 months has already changed how we think about and undertake collaboration. If, as expected, networks outpace computers at this rate, communication becomes essentially free. To exploit this bandwidth bounty, we must imagine new ways of working that are communication intensive, such as pooling computational resources, streaming large amounts of data from databases or instruments to remote computers, linking sensors with each other and with computers and archives, and connecting people, computing, and storage in collaborative environments that avoid the need for costly travel [8]. If communication is unlimited and free, then we are not restricted to using local resources to solve problems. When running a colleagues simulation code, I do not need to install the code locally. Instead, I can run it remotely on my colleagues computer. When applying the code to datasets maintained at other locations, I do not need to get copies of those datasets myself (not so long ago, I would have requested tapes). Instead, I can have the remote code access those datasets directly. If I wish to repeat the analysis many hundreds of times on different datasets, I can call on the collective computing power of my research collaboration or buy the power from a provider. And when I obtain interesting results, my geographically dispersed colleagues and I can look at and discuss large output datasets by using sophisticated collaboration and visualization tools. Although these scenarios vary considerably in their complexity, they share a common thread. In each case, I use remote resources to do things that I cannot do easily at home. High-speed networks are often necessary for such remote resource use, but they are far from sufficient. Remote resources are typically owned by others, exist within different administrative domains, run different software, and are subject to different security and access control policies. Actually using remote resources involves several steps. First, I must discover that they exist. Next, I must negotiate access to them (to be practical, this step cannot involve using the telephone!). Then, I have to configure my hardware and software to use the resources effectively. And I must do all these things without compromising my own security or the security of the remote resources that I make use of, some of which I may have to pay for. Implementing these steps requires uniform mechanisms for such critical tasks as creating and managing services on remote computers, supporting single sign-on to distributed resources, transferring large datasets at high speeds, forming large distributed virtual communities, and maintaining information about the existence, state, and usage policies of community resources. Todays Internet and Web technologies address basic communication requirements, but not the tasks just outlined. Providing the infrastructure and tools that make large-scale, secure resource sharing possible and straightforward is the Grids raison detre.

2. INFRASTRUCTURE AND TOOLS


An infrastructure is a technology that we can take for granted when performing our activities. The road system enables us to travel by car; the international banking system allows us to transfer funds across borders; and the Internet allows us to communicate with virtually any electronic device. To be useful, an infrastructure technology must be broadly deployed, which means, in turn, that it must be simple, extraordinarily valuable, or both. A good example is the set of protocols that must be implemented within a device to allow Internet access. The set is so small that people have constructed matchbox-sized Web servers. A Grid infrastructure needs to provide more functionality than the Internet on which it rests, but it must also remain simple. And of course, the need remains for supporting the resources that power the Grid, such as high-speed data movement, caching of large datasets, and on-demand access to computing. Tools make use of infrastructure services. Internet and Web tools include browsers for accessing remote Web sites, e-mail programs for handling electronic messages, and search engines for locating Web pages. Grid tools are concerned with resource discovery, data management, scheduling of computation, security, and so forth. But the Grid goes beyond sharing and distributing data and computing resources. For the scientist, the Grid offers new and more powerful ways of working, as the following examples illustrate: Science portals: We are accustomed to climbing a steep learning curve when installing and using a new software package. Science portals make advanced problem-solving methods easier to use by invoking sophisticated packages remotely from Web browsers or other simple, easily downloaded thin clients. The packages themselves can also run remotely on suitable computers within a Grid. Such portals are currently being developed in biology, fusion, computational chemistry, and other disciplines. Distributed computing: High-speed workstations and networks can yoke together an organizations PCs to form a substantial computational resource. Entropia Incs Fight-AIDSAtHome system harnesses more than 30 000 computers to analyze AIDS drug candidates. And in 2001, mathematicians across the U.S. and Italy pooled their computational resources to solve a particular instance, dubbed Nug30, of an optimization problem. For a week, the collaboration brought an average of 630 and a maximum of 1006 computers to bear on Nug30, delivering a total of 42 000 CPU-days. Future improvements in network performance and Grid technologies will increase the range of problems that aggregated computing resources can tackle. Large-scale data analysis: Many interesting scientific problems require the analysis of large amounts of data. For such problems, harnessing distributed computing and storage resources is clearly of great value. Furthermore, the natural parallelism inherent in many data analysis procedures makes it feasible to use distributed resources efficiently. For example, the analysis of the many petabytes of data to be produced by the LHC and other future high-energy physics experiments will require the marshalling of tens of thousands of processors and hundreds of terabytes of disk space for holding intermediate results. For various technical and political reasons, assembling these resources at a single location appears impractical. Yet the collective institutional and national resources of the hundreds of institutions participating in those experiments can provide these resources. These communities can, furthermore, share more than just computers and storage. They can also share analysis procedures and computational results. Computer-in-the-loop instrumentation: Scientific instruments such as telescopes, synchrotrons, and electron microscopes generate raw data streams that are archived for subsequent batch processing. But quasi-real-time analysis can greatly enhance an instruments capabilities. For example, consider an astronomer studying solar flares with a radio telescope array. The deconvolution and analysis algorithms used to process the data and detect flares are computationally demanding. Running the algorithms continuously would be inefficient for studying flares that are brief and sporadic. But if the astronomer could call on substantial computing resources (and sophisticated software) in an on-demand fashion, he or she could use automated detection techniques to zoom in on solar flares as they occurred. Collaborative work: Researchers often want to aggregate not only data and computing power but also human expertise. Collaborative problem formulation, data analysis, and the like are important Grid applications. For example, an astrophysicist who has performed a large, multiterabyte simulation might want colleagues around the world to visualize the results in the same way and at the same time so that the group can discuss the results in real time. Real Grid applications will frequently contain aspects of several of these and otherscenarios. For example, our radio astronomer might also want to look for similar events in an international archive, discuss results with colleagues during a run, and invoke distributed computing runs to evaluate alternative algorithms.

3. GRID ARCHITECTURE
Close to a decade of focused R&D and experimentation has produced considerable consensus on the requirements and architecture of Grid technology (see Box 2.1 for the early history of the Grid). Standard protocols, which define the content and sequence of message exchanges used to request remote operations, have emerged as an important and essential means of achieving the interoperability that Grid systems depend on. Also essential are standard application programming interfaces (APIs), which define standard interfaces to code libraries and facilitate the construction of Grid components by allowing code components to be reused.

Box 2.1 Historical origins Grid concepts date to the earliest days of computing, but the genesis of much current Grid R&D lies in the pioneering work conducted on early experimental high-speed networks, such as the gigabit test beds that were established in the U.S. in the early 1990s [9]. One of these test beds was the CASA network, which connected four laboratories in California and New Mexico. Using CASA, Caltechs Paul Messina and his colleagues developed and demonstrated applications that coupled massively parallel and vector supercomputers for computational chemistry, climate modeling, and other sciences. Another test bed, Blanca, connected sites in the Midwest. Charlie Catlett of the National Center for Supercomputing Applications and his colleagues used Blanca to build multimedia digital libraries and demonstrated the potential of remote visualization. Two other test beds investigated remote instrumentation. The gigabit test beds were also used for experiments with wide area communication libraries and high-bandwidth communication protocols. Similar test beds were created in Germany and elsewhere. Within the U.S. at least, the event that moved Grid concepts out of the network laboratory and into the consciousness of ordinary scientists was the I-WAY experiment [10]. Led by Tom DeFanti of the University of Illinois at Chicago and Rick Stevens of Argonne National Laboratory, this ambitious effort linked 11 experimental networks to create, for a week in November 1995, a national highspeed network infrastructure that connected resources at 17 sites across the U.S. and Canada. Some 60 application demonstrations, spanning the gamut from distributed computing to virtual reality collaboration, showed the potential of high-speed networks. The IWAY also saw the first attempt to construct a unified software infrastructure for such systems, the I-Soft system. Developed by the author and others, I-Soft provided unified scheduling, single sign-on, and other services that allowed the I-WAY to be treated, in some important respects, as an integrated infrastructure

Figure 2.2 Grid architecture can be thought of a series of layers of different widths. At the center are the resource and connectivity layers, which contain a relatively small number of key protocols and application programming interfaces that must be implemented everywhere. The surrounding layers can, in principle, contain any number of components.

As Figure 2.2 shows schematically, protocols and APIs can be categorized according to the role they play in a Grid system. At the lowest level, the fabric, we have the physical devices or resources that Grid users want to share and access, including computers, storage systems, catalogs, networks, and various forms of sensors. Above the fabric are the connectivity and resource layers. The protocols in these layers must be implemented everywhere and, therefore, must be relatively small in number. The connectivity layer contains the core communication and authentication protocols required for Grid-specific network transactions. Communication protocols enable the exchange of data between resources, whereas authentication protocols build on communication services to provide cryptographically secure mechanisms for verifying the identity of users and resources. The resource layer contains protocols that exploit communication and authentication protocols to enable the secure initiation, monitoring, and control of resource-sharing operations. Running the same program on different computer systems depends on resource layer protocols. The Globus Toolkit (which is described in Box 2.2) is a commonly used source of connectivity and resource protocols and APIs.

Box 2.2 The Globus Toolkit The Globus Toolkit (http://www.globus.org/) is a community-based, open-architecture, open-source set of services and software libraries that supports Grids and Grid applications. The toolkit includes software for security, information infrastructure, resource management, data management, communication, fault detection, and portability. It is packaged as a set of components that can be used either independently or together to develop applications. For each component, the toolkit both defines protocols and application programming interfaces (APIs) and provides open-source reference implementations in C and (for client-side APIs) Java. A tremendous variety of higher-level services, tools, and applications have been implemented in terms of these basic components. Some of these services and tools are distributed as part of the toolkit, while others are available from other sources. The NSF-funded GRIDS Center (http://www.grids-center.org/) maintains a repository of components. Globus Project and Globus Toolkit are trademarks of the University of Chicago and University of Southern California.

The collective layer contains protocols, services, and APIs that implement interactions across collections of resources. Because they combine and exploit components from the relatively narrower resource and connectivity layers, the components of the collective layer can implement a wide variety of tasks without requiring new resource-layer components. Examples of collective services include directory and brokering services for resource discovery and allocation; monitoring and diagnostic services; data replication services; and membership and policy services for keeping track of who in a community is allowed to access resources. At the top of any Grid system are the user applications, which are constructed in terms of, and call on, the components in any other layer. For example, a high-energy physics analysis application that needs to execute several thousands of independent tasks, each taking as input some set of files containing events, might proceed by obtaining necessary authentication credentials (connectivity layer protocols); querying an information system and replica catalog to determine availability of computers, storage systems, and networks, and the location of required input files (collective services); submitting requests to appropriate computers, storage systems, and networks to initiate computations, move data, and so forth (resource protocols); and monitoring the progress of the various computations and data transfers, notifying the user when all are completed, and detecting and responding to failure conditions (resource protocols). Many of these functions can be carried out by tools that automate the more complex tasks. The University of Wisconsins Condor-G system (http://www.cs.wisc.edu/condor) is an example of a powerful, full-featured task broker.

4. AUTHENTICATION, AUTHORIZATION, AND POLICY


Authentication, authorization, and policy are among the most challenging issues in Grids. Traditional security technologies are concerned primarily with securing the interactions between clients and servers. In such

interactions, a client (that is, a user) and a server need to mutually authenticate (that is, verify) each others identity, while the server needs to determine whether to authorize requests issued by the client. Sophisticated technologies have been developed for performing these basic operations and for guarding against and detecting various forms of attack. We use the technologies whenever we visit e-Commerce Web sites such as Amazon to buy products on-line. In Grid environments, the situation is more complex. The distinction between client and server tends to disappear, because an individual resource can act as a server one moment (as it receives a request) and as a client at another (as it issues requests to other resources). For example, when I request that a simulation code be run on a colleagues computer, I am the client and the computer is a server. But a few moments later, that same code and computer act as a client, as they issue requests on my behalf to other computers to access input datasets and to run subsidiary computations. Managing that kind of transaction turns out to have a number of interesting requirements, such as Single sign-on: A single computation may entail access to many resources, but requiring a user to reauthenticate on each occasion (by, e.g., typing in a password) is impractical and generally unacceptable. Instead, a user should be able to authenticate once and then assign to the computation the right to operate on his or her behalf, typically for a specified period. This capability is achieved through the creation of a proxy credential. In Figure 2.3, the program run by the user (the user proxy) uses a proxy credential to authenticate at two different sites. These services handle requests to create new processes. Mapping to local security mechanisms: Different sites may use different local security solutions, such as Kerberos and Unix as depicted in Figure 2.3. A Grid security infrastructure needs to map to these local solutions at each site, so that local operations can proceed with appropriate privileges. In Figure 2.3, processes execute under a local ID and, at site A, are assigned a Kerberos ticket, a credential used by the Kerberos authentication system to keep track of requests.

Figure 2.3 Smooth and efficient authentication and authorization of requests are essential for Grid operations. Here, a user calls on the computational resources of sites A and B, which then communicate with each other and read files located at a third site, C. Each step requires authorization and authentication, from the single sign-on (or retrieval of the proxy credential) that initiates the task to the remote file access request. Mediating these requests requires the Grid Security Infrastructure (GSI), which provides a single sign-on, run-anywhere authentication service, with support for delegation of credentials to subcomputations, local control over authorization, and mapping from global to local user identities. Also required is the Grid Resource Access and Management (GRAM) protocol and service, which provides remote resource allocation and process creation, monitoring, and management services.

Delegation: The creation of a proxy credential is a form of delegation, an operation of fundamental importance in Grid environments [11]. A computation that spans many resources creates subcomputations (subsidiary computations) that may themselves generate requests to other resources and services, perhaps creating additional subcomputations, and so on. In Figure 2.3, the two subcomputations created at sites A and B both communicate with each other and access files at site C. Authentication operations and hence further delegated credentials are involved at each stage, as resources determine whether to grant requests and computations determine whether resources are trustworthy. The further these delegated credentials are disseminated, the greater the risk that they will be acquired and misused by an adversary. These delegation operations and the credentials that enable them must be carefully managed. Community authorization and policy: In a large community, the policies that govern who can use which resources for what purpose cannot be based directly on individual identity. It is infeasible for each resource to keep track of community membership and privileges. Instead, resources (and users) need to be able to express policies in terms of other criteria, such as group membership, which can be identified with a cryptographic credential issued by a trusted third party. In the scenario depicted in Figure 2.3, the file server at site C must know explicitly whether the user is allowed to access a particular file. A community authorization system allows this policy decision to be delegated to a community representative.

5. CURRENT STATUS AND FUTURE DIRECTIONS


As the Grid matures, standard technologies are emerging for basic Grid operations. In particular, the communitybased, open-source Globus Toolkit (see Box 2.2) is being applied by most major Grid projects. The business world has also begun to investigate Grid applications (see Box 2.3). By late 2001, 12 companies had announced support for the Globus Toolkit. Progress has also been made on organizational fronts. With more than 1000 people on its mailing lists, the Global Grid Forum (http://www.gridforum.org/) is a significant force for setting standards and community development. Its thrice-yearly meetings attract hundreds of attendees from some 200 organizations. The International Virtual Data Grid Laboratory is being established as an international Grid system (Figure 2.4). It is commonly observed that people overestimate the short-term impact of change but underestimate long-term effects [14]. It will surely take longer than some expect before Grid concepts and technologies transform the practice of science, engineering, and business, but the combination of exponential technology trends and R&D advances noted in this article are real and will ultimately have dramatic impacts. In a future in which computing, storage, and software are no longer objects that we possess, but utilities to which we subscribe, the most successful scientific communities are likely to be those that succeed in assembling and making effective use of appropriate Grid infrastructures and thus accelerating the development and adoption of new problem solving methods within their discipline.

Figure 2.4 The International Virtual Data Grid Laboratory (iVDGL) (http://www.ivdgl.org/) is being established to support both Grid research and production computing. The figure shows the approximate distribution of sites and networks planned for the initial rollout. (The actual sites could change by the time iVDGL becomes operational.) Major international projects, including EU DataGrid, Grid Physics Network, and Particle Physics Data Grid, are collaborating on the establishment of iVDGL.

Box 2.3 Commercial Grids and the Open Grid Services Architecture Grid concepts They are becoming increasingly relevant to commercial information technology (IT). With the rise of e-Business and IT outsourcing, large-scale enterprise applications no longer run exclusively within the friendly confines of a central computing facility. Instead, they must operate on heterogeneous collections of resources that may span multiple administrative units within a company, as well as various external networks. Delivering high-quality service within dynamic virtual organizations is just as important in business as it is in science and engineering. One consequence of this convergence is a growing interest in the integration of Grid technologies with previously distinct commercial technologies, which tend to be based on so-called Web services. Despite the name, Web services are not particularly concerned with Web sites, browsers, or protocols, but rather with standards for defining interfaces to, and communicating with, remote processes (services). Thus, for example, a distributed astronomical data system might be constructed as a set of Web services concerned variously with retrieving, processing, and visualizing data. By requiring input, such as a customers address, in a certain format, Web services end up setting standards for remote services on the Web. Several major industrial distributed computing technologies, such as the Microsoft. NET, IBM Corps WebSphere, and Suns Java2 Enterprise Edition, are based on Web services [12]. To effect the integration of Grid technologies and Web services, the Globus Project and IBMs Open Service Architecture group have proposed the Open Grid Services Architecture [13]. In this blueprint, the two technologies are combined to define, among other things, standard behaviors and interfaces for what could be termed a Grid service: a Web service that can be created dynamically and that supports security, lifetime management, manageability, and other functions required in Grid scenarios. These features are being incorporated into the Globus Toolkit and will likely also appear in commercial products.

2.
Ian Foster, Carl Kesselman, and Steven Tuecke 1. Introduction 2. The emergence of virtual organizations 3. The nature of Grid architecture 4. Grid architecture description 4.1 Fabric: Interfaces to local control 4.2 Connectivity: Communicating easily and securely 4.3 Resource: Sharing single resources 4.4 Collective: Coordinating multiple resources 4.5 Applications 5. Grid architecture in practice 6. On the Grid: the need for intergrid protocols 7. Relationships with other technologies 7.1 World Wide Web 7.2 Application and storage service providers 7.3 Enterprise computing systems 7.4 Internet and peer-to-peer computing 8. Other perspectives on Grids 9. Summary

The anatomy of the Grid


Enabling Scalable Virtual Organizations 6

1. INTRODUCTION
The term the Grid was coined in the mid-1990s to denote a proposed distributed computing infrastructure for advanced science and engineering [1]. Considerable progress has since been made on the construction of such an infrastructure (e.g., [25]), but the term Grid has also been conflated, at least in popular perception, to embrace everything from advanced networking to artificial intelligence. One might wonder whether the term has any real substance and meaning. Is there really a distinct Grid problem and hence a need for new Grid technologies? If so, what is the nature of these technologies, and what is their domain of applicability? While numerous groups have interest in Grid concepts and share, to a significant extent, a common vision of Grid architecture, we do not see consensus on the answers to these questions. Our purpose in this article is to argue that the Grid concept is indeed motivated by a real and specific problem and that there is an emerging, well-defined Grid technology base that addresses significant aspects of this problem. In the process, we develop a detailed architecture and roadmap for current and future Grid technologies. Furthermore, we assert that while Grid technologies are currently distinct from other major technology trends, such as Internet, enterprise, distributed, and peer-to-peer computing, these other trends can benefit significantly from growing into the problem space addressed by Grid technologies. The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. The sharing that we are concerned with is not primarily file exchange but rather direct access to computers, software, data, and other resources, as is required by a range of collaborative problem-solving and resource-brokering strategies emerging in industry, science, and engineering. This sharing is, necessarily, highly controlled, with resource providers and consumers defining clearly and carefully just what is shared, who is allowed to share, and the conditions under which sharing occurs. A set of individuals and/or institutions defined by such sharing rules form what we call a virtual organization (VO). The following are examples of VOs: the application service providers, storage service providers, cycle providers, and consultants engaged by a car manufacturer to perform scenario evaluation during planning for a new factory; members of an industrial consortium bidding on a new aircraft; a crisis management team and the databases and simulation systems that they use to plan a response to an emergency situation; and members of a large, international, multiyear high-energy physics collaboration. Each of these examples represents an approach to computing and problem solving based on collaboration in computation- and data-rich environments. As these examples show, VOs vary tremendously in their purpose, scope, size, duration, structure, community, and sociology. Nevertheless, careful study of underlying technology requirements leads us to identify a broad set of common concerns and requirements. In particular, we see a need for highly flexible sharing relationships, ranging from client-server to peer-to-peer; for sophisticated and precise levels of control over how shared resources are used, including fine-grained and multi-stakeholder access control, delegation, and application of local and global policies; for sharing of varied resources, ranging from programs, files, and data to computers, sensors, and networks; and for diverse usage modes, ranging from single user to multiuser and from performance

sensitive to cost sensitive and hence embracing issues of quality of service, scheduling, co-allocation, and accounting. Current distributed computing technologies do not address the concerns and requirements just listed. For example, current Internet technologies address communication and information exchange among computers but do not provide integrated approaches to the coordinated use of resources at multiple sites for computation. Business-to-business exchanges [6] focus on information sharing (often via centralized servers). So do virtual enterprise technologies, although here sharing may eventually extend to applications and physical devices (e.g., [7]). Enterprise distributed computing technologies such as CORBA and Enterprise Java enable resource sharing within a single organization. The Open Groups Distributed Computing Environment (DCE) supports secure resource sharing across sites, but most VOs would find it too burdensome and inflexible. Storage service providers (SSPs) and application service providers (ASPs) allow organizations to outsource storage and computing requirements to other parties, but only in constrained ways: for example, SSP resources are typically linked to a customer via a virtual private network (VPN). Emerging distributed computing companies seek to harness idle computers on an international scale [31] but, to date, support only highly centralized access to those resources. In summary, current technology either does not accommodate the range of resource types or does not provide the flexibility and control on sharing relationships needed to establish VOs. It is here that Grid technologies enter the picture. Over the past five years, research and development efforts within the Grid community have produced protocols, services, and tools that address precisely the challenges that arise when we seek to build scalable VOs. These technologies include security solutions that support management of credentials and policies when computations span multiple institutions; resource management protocols and services that support secure remote access to computing and data resources and the co-allocation of multiple resources; information query protocols and services that provide configuration and status information about resources, organizations, and services; and data management services that locate and transport datasets between storage systems and applications. Because of their focus on dynamic, cross-organizational sharing, Grid technologies complement rather than compete with existing distributed computing technologies. For example, enterprise distributed computing systems can use Grid technologies to achieve resource sharing across institutional boundaries; in the ASP/SSP space, Grid technologies can be used to establish dynamic markets for computing and storage resources, hence overcoming the limitations of current static configurations. We discuss the relationship between Grids and these technologies in more detail below. In the rest of this article, we expand upon each of these points in turn. Our objectives are to (1) clarify the nature of VOs and Grid computing for those unfamiliar with the area; (2) contribute to the emergence of Grid computing as a discipline by establishing a standard vocabulary and defining an overall architectural framework; and (3) define clearly how Grid technologies relate to other technologies, explaining both why emerging technologies do not yet solve the Grid computing problem and how these technologies can benefit from Grid technologies. It is our belief that VOs have the potential to change dramatically the way we use computers to solve problems, much as the Web has changed how we exchange information. As the examples presented here illustrate, the need to engage in collaborative processes is fundamental to many diverse disciplines and activities: it is not limited to science, engineering, and business activities. It is because of this broad applicability of VO concepts that Grid technology is important.

2. THE EMERGENCE OF VIRTUAL ORGANIZATIONS


Consider the following four scenarios: 1. A company needing to reach a decision on the placement of a new factory invokes a sophisticated financial forecasting model from an ASP, providing it with access to appropriate proprietary historical data from a corporate database on storage systems operated by an SSP. During the decision-making meeting, what-if scenarios are run collaboratively and interactively, even though the division heads participating in the decision are located in different cities. The ASP itself contracts with a cycle provider for additional oomph during particularly demanding scenarios, requiring of course that cycles meet desired security and performance requirements. 2. An industrial consortium formed to develop a feasibility study for a next-generation supersonic aircraft undertakes a highly accurate multidisciplinary simulation of the entire aircraft. This simulation integrates proprietary software components developed by different participants, with each component operating on that participants computers and having access to appropriate design databases and other data made available to the consortium by its members. 3. A crisis management team responds to a chemical spill by using local weather and soil models to estimate the spread of the spill, determining the impact based on population location as well as geographic features such as

rivers and water supplies, creating a short-term mitigation plan (perhaps based on chemical reaction models), and tasking emergency response personnel by planning and coordinating evacuation, notifying hospitals, and so forth. 4. Thousands of physicists at hundreds of laboratories and universities worldwide come together to design, create, operate, and analyze the products of a major detector at CERN, the European high energy physics laboratory. During the analysis phase, they pool their computing, storage, and networking resources to create a Data Grid capable of analyzing petabytes of data [810]. These four examples differ in many respects: the number and type of participants, the types of activities, the duration and scale of the interaction, and the resources being shared. But they also have much in common, as discussed in the following (see also Figure 1). In each case, a number of mutually distrustful participants with varying degrees of prior relationship (perhaps none at all) want to share resources in order to perform some task. Furthermore, sharing is about more than simply document exchange (as in virtual enterprises [11]): it can involve direct access to remote software, computers, data, sensors, and other resources. For example, members of a consortium may provide access to specialized software and data and/or pool their computational resources. Resource sharing is conditional: each resource owner makes resources available, subject to constraints on when, where, and what can be done. For example, a participant in VO P of Figure 1 might allow VO partners to invoke their simulation service only for simple problems. Resource consumers may also place constraints on properties of the resources they are prepared to work with. For example, a participant in VO Q might accept only pooled computational resources certified as secure. The implementation of such constraints requires mechanisms for expressing policies, for establishing the identity of a consumer or resource (authentication), and for determining whether an operation is consistent with applicable sharing relationships (authorization). Sharing relationships can vary dynamically over time, in terms of the resources involved, the nature of the access permitted, and the participants to whom access is permitted. And these relationships do not necessarily involve an explicitly named set of individuals, but rather may be defined implicitly by the policies that govern access to resources. For example, an organization might enable access by anyone who can demonstrate that he or she is a customer or a student. The dynamic nature of sharing relationships means that we require mechanisms for discovering and characterizing the nature of the relationships that exist at a particular point in time. For example, a new participant joining VO Q must be able to determine what resources it is able to access, the quality of these resources, and the policies that govern access.

Figure 1 An actual organization can participate in one or more VOs by sharing some or all of its resources. We show three actual organizations (the ovals), and two VOs: P, which links participants in an aerospace design consortium, and Q, which links colleagues who have agreed to share spare computing cycles, for example, to run ray tracing computations. The organization on the left participates in P, the one to the right participates in Q, and the third is a member of both P and Q. The policies governing access to resources (summarized in quotes) vary according to the actual organizations, resources, and VOs involved.

Sharing relationships are often not simply client-server, but peer to peer: providers can be consumers, and sharing relationships can exist among any subset of participants. Sharing relationships may be combined to

coordinate use across many resources, each owned by different organizations. For example, in VO Q, a computation started on one pooled computational resource may subsequently access data or initiate subcomputations elsewhere. The ability to delegate authority in controlled ways becomes important in such situations, as do mechanisms for coordinating operations across multiple resources (e.g., co-scheduling). The same resource may be used in different ways, depending on the restrictions placed on the sharing and the goal of the sharing. For example, a computer may be used only to run a specific piece of software in one sharing arrangement, while it may provide generic compute cycles in another. Because of the lack of a priori knowledge about how a resource may be used, performance metrics, expectations, and limitations (i.e., quality of service) may be part of the conditions placed on resource sharing or usage. These characteristics and requirements define what we term a virtual organization, a concept that we believe is becoming fundamental to much of modern computing. Vos enable disparate groups of organizations and/or individuals to share resources in a controlled fashion, so that members may collaborate to achieve a shared goal.

3. THE NATURE OF GRID ARCHITECTURE


The establishment, management, and exploitation of dynamic, cross-organizational VO sharing relationships require new technology. We structure our discussion of this technology in terms of a Grid architecture that identifies fundamental system components, specifies the purpose and function of these components, and indicates how these components interact with one another. In defining a Grid architecture, we start from the perspective that effective VO operation requires that we be able to establish sharing relationships among any potential participants. Interoperability is thus the central issue to be addressed. In a networked environment, interoperability means common protocols. Hence, our Grid architecture is first and foremost a protocol architecture, with protocols defining the basic mechanisms by which VO users and resources negotiate, establish, manage, and exploit sharing relationships. A standards-based open architecture facilitates extensibility, interoperability, portability, and code sharing; standard protocols make it easy to define standard services that provide enhanced capabilities. We can also construct application programming interfaces and software development kits (see Appendix for definitions) to provide the programming abstractions required to create a usable Grid. Together, this technology and architecture constitute what is often termed middleware (the services needed to support a common set of applications in a distributed network environment [12]), although we avoid that term here because of its vagueness. We discuss each of these points in the following. Why is interoperability such a fundamental concern? At issue is our need to ensure that sharing relationships can be initiated among arbitrary parties, accommodating new participants dynamically, across different platforms, languages, and programming environments. In this context, mechanisms serve little purpose if they are not defined and implemented so as to be interoperable across organizational boundaries, operational policies, and resource types. Without interoperability, VO applications and participants are forced to enter into bilateral sharing arrangements, as there is no assurance that the mechanisms used between any two parties will extend to any other parties. Without such assurance, dynamic VO formation is all but impossible, and the types of VOs that can be formed are severely limited. Just as the Web revolutionized information sharing by providing a universal protocol and syntax (HTTP and HTML) for information exchange, so we require standard protocols and syntaxes for general resource sharing. Why are protocols critical to interoperability? A protocol definition specifies how distributed system elements interact with one another in order to achieve a specified behavior, and the structure of the information exchanged during this interaction. This focus on externals (interactions) rather than internals (software, resource characteristics) has important pragmatic benefits. VOs tend to be fluid; hence, the mechanisms used to discover resources, establish identity, determine authorization, and initiate sharing must be flexible and lightweight, so that resource-sharing arrangements can be established and changed quickly. Because VOs complement rather than replace existing institutions, sharing mechanisms cannot require substantial changes to local policies and must allow individual institutions to maintain ultimate control over their own resources. Since protocols govern the interaction between components, and not the implementation of the components, local control is preserved. Why are services important? A service (see Appendix) is defined solely by the protocol that it speaks and the behaviors that it implements. The definition of standard services for access to computation, access to data, resource discovery, co-scheduling, data replication, and so forth allows us to enhance the services offered to VO participants and also to abstract away resource-specific details that would otherwise hinder the development of VO applications. Why do we also consider application programming interfaces (APIs) and software development kits (SDKs)? There is, of course, more to VOs than interoperability, protocols, and services. Developers must be able to develop sophisticated applications in complex and dynamic execution environments. Users must be able to operate these applications. Application robustness, correctness, development costs, and maintenance costs are

all important concerns. Standard abstractions, APIs, and SDKs can accelerate code development, enable code sharing, and enhance application portability. APIs and SDKs are an adjunct to, not an alternative to, protocols. Without standard protocols, interoperability can be achieved at the API level only by using a single implementation everywhere infeasible in many interesting VOs or by having every implementation know the details of every other implementation. (The Jini approach [13] of downloading protocol code to a remote site does not circumvent this requirement.) In summary, our approach to Grid architecture emphasizes the identification and definition of protocols and services, first, and APIs and SDKs, second.

4. GRID ARCHITECTURE DESCRIPTION


Our goal in describing our Grid architecture is not to provide a complete enumeration of all required protocols (and services, APIs, and SDKs) but rather to identify requirements for general classes of component. The result is an extensible, open architectural structure within which can be placed solutions to key VO requirements. Our architecture and the subsequent discussion organize components into layers, as shown in Figure 2. Components within each layer share common characteristics but can build on capabilities and behaviors provided by any lower layer.

Figure 2 The layered Grid architecture and its relationship to the Internet protocol architecture. Because the Internet protocol architecture extends from network to application, there is a mapping from Grid layers into Internet layers.

In specifying the various layers of the Grid architecture, we follow the principles of the hourglass model [14]. The narrow neck of the hourglass defines a small set of core abstractions and protocols (e.g., TCP and HTTP in the Internet), onto which many different high-level behaviors can be mapped (the top of the hourglass), and which themselves can be mapped onto many different underlying technologies (the base of the hourglass). By definition, the number of protocols defined at the neck must be small. In our architecture, the neck of the hourglass consists of Resource and Connectivity protocols, which facilitate the sharing of individual resources. Protocols at these layers are designed so that they can be implemented on top of a diverse range of resource types, defined at the Fabric layer, and can in turn be used to construct a wide range of global services and application specific behaviors at the Collective layer so called because they involve the coordinated (collective) use of multiple resources. Our architectural description is high level and places few constraints on design and implementation. To make this abstract discussion more concrete, we also list, for illustrative purposes, the protocols defined within the Globus Toolkit [15] and used within such Grid projects as the NSFs National Technology Grid [5], NASAs Information Power Grid [4], DOEs DISCOM [2], GriPhyN (www.griphyn.org), NEESgrid (www.neesgrid.org), Particle Physics Data Grid (www.ppdg.net), and the European Data Grid (www.eu-datagrid.org). More details will be provided in a subsequent paper. 4.1 Fabric: Interfaces to local control The Grid Fabric layer provides the resources to which shared access is mediated by Grid protocols: for example, computational resources, storage systems, catalogs, network resources, and sensors. A resource may be a logical entity, such as a distributed file system, computer cluster, or distributed computer pool; in such cases, a resource implementation may involve internal protocols (e.g., the NFS storage access protocol or a cluster resource management systems process management protocol), but these are not the concern of Grid architecture.

Fabric components implement the local, resource-specific operations that occur on specific resources (whether physical or logical) as a result of sharing operations at higher levels. There is thus a tight and subtle interdependence between the functions implemented at the Fabric level, on the one hand, and the sharing operations supported, on the other. Richer Fabric functionality enables more sophisticated sharing operations; at the same time, if we place few demands on Fabric elements, then deployment of Grid infrastructure is simplified. For example, resource-level support for advance reservations makes it possible for higher-level services to aggregate (coschedule) resources in interesting ways that would otherwise be impossible to achieve. However, as in practice few resources support advance reservation out of the box, a requirement for advance reservation increases the cost of incorporating new resources into a Grid. Experience suggests that at a minimum, resources should implement enquiry mechanisms that permit discovery of their structure, state, and capabilities (e.g., whether they support advance reservation), on the one hand, and resource management mechanisms that provide some control of delivered quality of service, on the other. The following brief and partial list provides a resource-specific characterization of capabilities. Computational resources: Mechanisms are required for starting programs and for monitoring and controlling the execution of the resulting processes. Management mechanisms that allow control over the resources allocated to processes are useful, as are advance reservation mechanisms. Enquiry functions are needed for determining hardware and software characteristics as well as relevant state information such as current load and queue state in the case of scheduler-managed resources. Storage resources: Mechanisms are required for putting and getting files. Third-party and high-performance (e.g., striped) transfers are useful [16]. So are mechanisms for reading and writing subsets of a file and/or executing remote data selection or reduction functions [17]. Management mechanisms that allow control over the resources allocated to data transfers (space, disk bandwidth, network bandwidth, CPU) are useful, as are advance reservation mechanisms. Enquiry functions are needed for determining hardware and software characteristics as well as relevant load information such as available space and bandwidth utilization. Network resources: Management mechanisms that provide control over the resources allocated to network transfers (e.g., prioritization, reservation) can be useful. Enquiry functions should be provided to determine network characteristics and load. Code repositories: This specialized form of storage resource requires mechanisms for managing versioned source and object code: for example, a control system such as CVS. Catalogs: This specialized form of storage resource requires mechanisms for implementing catalog query and update operations: for example, a relational database [18]. Globus Toolkit : The Globus Toolkit has been designed to use (primarily) existing fabric components, including vendor-supplied protocols and interfaces. However, if a vendor does not provide the necessary Fabric-level behavior, the Globus Toolkit includes the missing functionality. For example, enquiry software is provided for discovering structure and state information for various common resource types, such as computers (e.g., OS version, hardware configuration, load [19], scheduler queue status), storage systems (e.g., available space), and networks (e.g., current and predicted future load [20, 21], and for packaging this information in a form that facilitates the implementation of higher level protocols, specifically at the Resource layer. Resource management, on the other hand, is generally assumed to be the domain of local resource managers. One exception is the General-purpose Architecture for Reservation and Allocation (GARA) [22], which provides a slot manager that can be used to implement advance reservation for resources that do not support this capability. Others have developed enhancements to the Portable Batch System (PBS) [23] and Condor [24, 25] that support advance reservation capabilities. 4.2 Connectivity: Communicating easily and securely The Connectivity layer defines core communication and authentication protocols required for Grid-specific network transactions. Communication protocols enable the exchange of data between Fabric layer resources. Authentication protocols build on communication services to provide cryptographically secure mechanisms for verifying the identity of users and resources. Communication requirements include transport, routing, and naming. While alternatives certainly exist, we assume here that these protocols are drawn from the TCP/IP protocol stack: specifically, the Internet (IP and ICMP), transport (TCP, UDP), and application (DNS, OSPF, RSVP, etc.) layers of the Internet layered protocol architecture [26]. This is not to say that in the future, Grid communications will not demand new protocols that take into account particular types of network dynamics.

With respect to security aspects of the Connectivity layer, we observe that the complexity of the security problem makes it important that any solutions be based on existing standards whenever possible. As with communication, many of the security standards developed within the context of the Internet protocol suite are applicable. Authentication solutions for VO environments should have the following characteristics [27]: Single sign-on: Users must be able to log on (authenticate) just once and then have access to multiple Grid resources defined in the Fabric layer, without further user intervention. Delegation: [2830]. A user must be able to endow a program with the ability to run on that users behalf, so that the program is able to access the resources on which the user is authorized. The program should (optionally) also be able to conditionally delegate a subset of its rights to another program (sometimes referred to as restricted delegation). Integration with various local security solutions: Each site or resource provider may employ any of a variety of local security solutions, including Kerberos and Unix security. Grid security solutions must be able to interoperate with these various local solutions. They cannot, realistically, require wholesale replacement of local security solutions but rather must allow mapping into the local environment. User-based trust relationships: In order for a user to use resources from multiple providers together, the security system must not require each of the resource providers to cooperate or interact with each other in configuring the security environment. For example, if a user has the right to use sites A and B, the user should be able to use sites A and B together without requiring that As and Bs security administrators interact. Grid security solutions should also provide flexible support for communication protection (e.g., control over the degree of protection, independent data unit protection for unreliable protocols, support for reliable transport protocols other than TCP) and enable stakeholder control over authorization decisions, including the ability to restrict the delegation of rights in various ways. Globus Toolkit : The Internet protocols listed above are used for communication. The public-key based Grid Security Infrastructure (GSI) protocols [27, 28] are used for authentication, communication protection, and authorization. GSI builds on and extends the Transport Layer Security (TLS) protocols [31] to address most of the issues listed above: in particular, single sign-on, delegation, integration with various local security solutions (including Kerberos [32]), and user-based trust relationships. X.509-format identity certificates are used. Stakeholder control of authorization is supported via an authorization toolkit that allows resource owners to integrate local policies via a Generic Authorization and Access (GAA) control interface. Rich support for restricted delegation is not provided in the current toolkit release (v1.1.4) but has been demonstrated in prototypes. 4.3 Resource: Sharing single resources The Resource layer builds on Connectivity layer communication and authentication protocols to define protocols (and APIs and SDKs) for the secure negotiation, initiation, monitoring, control, accounting, and payment of sharing operations on individual resources. Resource layer implementations of these protocols call Fabric layer functions to access and control local resources. Resource layer protocols are concerned entirely with individual resources and hence ignore issues of global state and atomic actions across distributed collections; such issues are the concern of the Collective layer discussed next. Two primary classes of Resource layer protocols can be distinguished: Information protocols are used to obtain information about the structure and state of a resource, for example, its configuration, current load, and usage policy (e.g., cost). Management protocols are used to negotiate access to a shared resource, specifying, for example, resource requirements (including advanced reservation and quality of service) and the operation(s) to be performed, such as process creation, or data access. Since management protocols are responsible for instantiating sharing relationships, they must serve as a policy application point, ensuring that the requested protocol operations are consistent with the policy under which the resource is to be shared. Issues that must be considered include accounting and payment. A protocol may also support monitoring the status of an operation and controlling (e.g., terminating) the operation. While many such protocols can be imagined, the Resource (and Connectivity) protocol layers form the neck of our hourglass model and as such should be limited to a small and focused set. These protocols must be chosen so as to capture the fundamental mechanisms of sharing across many different resource types (e.g., different local resource management systems), while not overly constraining the types or performance of higher-level protocols that may be developed.

The list of desirable Fabric functionality provided in Section 4.1 summarizes the major features required in Resource layer protocols. To this list we add the need for exactly once semantics for many operations, with reliable error reporting indicating when operations fail. Globus Toolkit : A small and mostly standards-based set of protocols is adopted. In particular : A Grid Resource Information Protocol (GRIP, currently based on the Lightweight Directory Access Protocol: LDAP) is used to define a standard resource information protocol and associated information model. An associated soft-state resource registration protocol, the Grid Resource Registration Protocol (GRRP), is used to register resources with Grid Index Information Servers, discussed in the next section [33]. The HTTP-based Grid Resource Access and Management (GRAM) protocol is used for allocation of computational resources and for monitoring and control of computation on those resources [34]. An extended version of the File Transfer Protocol, GridFTP, is a management protocol for data access; extensions include use of Connectivity layer security protocols, partial file access, and management of parallelism for high-speed transfers [35]. FTP is adopted as a base data transfer protocol because of its support for thirdparty transfers and because its separate control and data channels facilitate the implementation of sophisticated servers. LDAP is also used as a catalog access protocol. The Globus Toolkit defines client-side C and Java APIs and SDKs for each of these protocols. Server-side SDKs and servers are also provided for each protocol, to facilitate the integration of various resources (computational, storage, network) into the Grid. For example, the Grid Resource Information Service (GRIS) implements serverside LDAP functionality, with callouts allowing for publication of arbitrary resource information [33]. An important server-side element of the overall toolkit is the gatekeeper, which provides what is in essence a GSIauthenticated inetd that speaks the GRAM protocol and can be used to dispatch various local operations. The Generic Security Services (GSS) API [36] is used to acquire, forward, and verify authentication credentials and to provide transport layer integrity and privacy within these SDKs and servers, enabling substitution of alternative security services at the Connectivity layer. 4.4 Collective: Coordinating multiple resources While the Resource layer is focused on interactions with a single resource, the next layer in the architecture contains protocols and services (and APIs and SDKs) that are not associated with any one specific resource but rather are global in nature and capture interactions across collections of resources. For this reason, we refer to the next layer of the architecture as the Collective layer. Because Collective components build on the narrow Resource and Connectivity layer neck in the protocol hourglass, they can implement a wide variety of sharing behaviors without placing new requirements on the resources being shared. For example: Directory services allow VO participants to discover the existence and/or properties of VO resources. A directory service may allow its users to query for resources by name and/or by attributes such as type, availability, or load [33]. Resource-level GRRP and GRIP protocols are used to construct directories. Coallocation-allocation, scheduling, and brokering services allow VO participants to request the allocation of one or more resources for a specific purpose and the scheduling of tasks on the appropriate resources. Examples include AppLeS [37, 38], Condor-G [39], Nimrod-G [40], and the DRM broker [2]. Monitoring and diagnostics services support the monitoring of VO resources for failure, adversarial attack (intrusion detection), overload, and so forth. Data replication services support the management of VO storage (and perhaps also network and computing) resources to maximize data access performance with respect to metrics such as response time, reliability, and cost [9, 35]. Grid-enabled programming systems enable familiar programming models to be used in Grid environments, using various Grid services to address resource discovery, security, resource allocation, and other concerns. Examples include Grid-enabled implementations of the Message Passing Interface [41, 42] and manager-worker frameworks [43, 44]. Workload management systems and collaboration frameworks also known as problem solving environments (PSEs) provide for the description, use, and management of multistep, asynchronous, multicomponent workflows. Software discovery services discover and select the best software implementation and execution platform based on the parameters of the problem being solved [45]. Examples include NetSolve [46] and Ninf [47].

Community authorization servers enforce community policies governing resource access, generating capabilities that community members can use to access community resources. These servers provide a global policy enforcement service by building on resource information, and resource management protocols (in the Resource layer) and security protocols in the Connectivity layer. Akenti [48] addresses some of these issues. Community accounting and payment services gather resource usage information for the purpose of accounting, payment, and/or limiting of resource usage by community members. Collaboratory services support the coordinated exchange of information within potentially large user communities, whether synchronously or asynchronously. Examples are CAVERNsoft [49, 50], Access Grid [51], and commodity groupware systems. These examples illustrate the wide variety of Collective layer protocols and services that are encountered in practice. Notice that while Resource layer protocols must be general in nature and are widely deployed, Collective layer protocols span the spectrum from general purpose to highly application or domain specific, with the latter existing perhaps only within specific VOs.

Figure 3 Collective and Resource layer protocols, services, APIs, and SDKs can be combined in a variety of ways to deliver functionality to applications.

Collective functions can be implemented as persistent services, with associated protocols, or as SDKs (with associated APIs) designed to be linked with applications. In both cases, their implementation can build on Resource layer (or other Collective layer) protocols and APIs. For example, Figure 3 shows a Collective coallocation API and SDK (the middle tier) that uses a Resource layer management protocol to manipulate underlying resources. Above this, we define a co-reservation service protocol and implement a coreservation service that speaks this protocol, calling the co-allocation API to implement co-allocation operations and perhaps providing additional functionality, such as authorization, fault tolerance, and logging. An application might then use the co-reservation service protocol to request end-to-end network reservations. Collective components may be tailored to the requirements of a specific user community, VO, or application domain, for example, an SDK that implements an application specific coherency protocol, or a co-reservation service for a specific set of network resources. Other Collective components can be more general purpose, for example, a replication service that manages an international collection of storage systems for multiple communities, or a directory service designed to enable the discovery of VOs. In general, the larger the target user community, the more important it is that a Collective components protocol(s) and API(s) be standards based. Globus Toolkit : In addition to the example services listed earlier in this section, many of which build on Globus Connectivity and Resource protocols, we mention the Meta Directory Service, which introduces Grid Information Index Servers (GIISs) to support arbitrary views on resource subsets, with the LDAP information protocol used to access resource-specific GRISs to obtain resource state and GRRP used for resource registration. Also, replica catalog and replica management services are used to support the management of dataset replicas in a Grid environment [35]. An on-line credential repository service (MyProxy) provides secure storage for proxy credentials [52]. The DUROC co-allocation library provides an SDK and API for resource co-allocation [53].

Figure 4 APIs are implemented by software development kits (SDKs), which in turn use Grid protocols to interact with network services that provide capabilities to the end user. Higher-level SDKs can provide functionality that is not directly mapped to a specific protocol but may combine protocol operations with calls to additional APIs as well as implement local functionality. Solid lines represent a direct call, dashed lines protocol interactions.

4.5 Applications The final layer in our Grid architecture comprises the user applications that operate within a VO environment. Figure 4 illustrates an application programmers view of Grid architecture. Applications are constructed in terms of, and by calling upon, services defined at any layer. At each layer, we have well-defined protocols that provide access to some useful service: resource management, data access, resource discovery, and so forth. At each layer, APIs may also be defined whose implementation (ideally provided by third-party SDKs) exchange protocol messages with the appropriate service(s) to perform desired actions. We emphasize that what we label applications and show in a single layer in Figure 4 may in practice call upon sophisticated frameworks and libraries (e.g., the Common Component Architecture [54], SCIRun [45], CORBA [55, 56], Cactus [57], workflow systems [58]) and feature much internal structure that would, if captured in our figure, expand it out to many times its current size. These frameworks may themselves define protocols, services, and/or APIs (e.g., the Simple Workflow Access Protocol [58]). However, these issues are beyond the scope of this article, which addresses only the most fundamental protocols and services required in a Grid.

5. GRID ARCHITECTURE IN PRACTICE


We use two examples to illustrate how Grid architecture functions in practice. Table 1 shows the services that might be used to implement the multidisciplinary simulation and cycle sharing (ray tracing) applications introduced in Figure 1. The basic Fabric elements are the same in each case: computers, storage systems, and networks. Furthermore, each resource speaks standard Connectivity protocols for communication and security and Resource protocols for enquiry, allocation, and management. Above this, each application uses a mix of generic and more application-specific Collective services.
Table 1 The Grid services used to construct the two example applications of Figure 1 Multidisciplinary Simulation Collective (application-specific) Collective (generic) Solver coupler, distributed data archiver Ray Tracing Check pointing, job management, failover, staging

Resource discovery, resource brokering, system monitoring, community authorization, certificate revocation

Resource

Access to computation; access to data; access to information about system structure, state, performance.

Connectivity

Communication (IP), service discovery (DNS), authentication, authorization, delegation

Fabric

Storage systems, computers, networks, code repositories, catalogs

In the case of the ray tracing application, we assume that this is based on a highthroughput computing system [25, 39]. In order to manage the execution of large numbers of largely independent tasks in a VO environment, this system must keep track of the set of active and pending tasks, locate appropriate resources for each task, stage executables to those resources, detect and respond to various types of failure, and so forth. An implementation in the context of our Grid architecture uses both domain-specific Collective services (dynamic checkpoint, task pool management, failover) and more generic Collective services (brokering, data replication for executables and common input files), as well as standard Resource and Connectivity protocols. Condor-G represents a first step toward this goal [39]. In the case of the multidisciplinary simulation application, the problems are quite different at the highest level. Some application framework (e.g., CORBA, CCA) may be used to construct the application from its various components. We also require mechanisms for discovering appropriate computational resources, for reserving time on those resources, for staging executables (perhaps), for providing access to remote storage, and so forth. Again, a number of domain-specific Collective services will be used (e.g., solver coupler, distributed data archiver), but the basic underpinnings are the same as in the ray tracing example.

6. ON THE GRID: THE NEED FOR INTERGRID PROTOCOLS


Our Grid architecture establishes requirements for the protocols and APIs that enable sharing of resources, services, and code. It does not otherwise constrain the technologies that might be used to implement these protocols and APIs. In fact, it is quite feasible to define multiple instantiations of key Grid architecture elements. For example, we can construct both Kerberos-and PKI-based protocols at the Connectivity layer and access these security mechanisms via the same API, thanks to GSS-API (see Appendix). However, Grids constructed with these different protocols are not interoperable and cannot share essential services at least not without gateways. For this reason, the long-term success of Grid computing requires that we select and achieve widespread deployment of one set of protocols at the Connectivity and Resource layers and, to a lesser extent, at the Collective layer. Much as the core Internet protocols enable different computer networks to interoperate and exchange information, these Intergrid protocols (as we might call them) enable different organizations to interoperate and exchange or share resources. Resources that speak these protocols can be said to be on the Grid. Standard APIs are also highly useful if Grid code is to be shared. The identification of these Intergrid protocols and APIs is beyond the scope of this article, although the Globus Toolkit represents an approach that has had some success to date.

7. RELATIONSHIPS WITH OTHER TECHNOLOGIES


The concept of controlled, dynamic sharing within VOs is so fundamental that we might assume that Gridlike technologies must surely already be widely deployed. In practice, however, while the need for these technologies is indeed widespread, in a wide variety of different areas we find only primitive and inadequate solutions to VO problems. In brief, current distributed computing approaches do not provide a general resource-sharing framework that addresses VO requirements. Grid technologies distinguish themselves by providing this generic approach to resource sharing. This situation points to numerous opportunities for the application of Grid technologies. 7.1 World Wide Web The ubiquity of Web technologies (i.e., IETF and W3C standard protocols TCP/IP, HTTP, SOAP, etc. and languages, such as HTML and XML) makes them attractive as a platform for constructing VO systems and applications. However, while these technologies do an excellent job of supporting the browser-client-to-Webserver interactions that are the foundation of todays Web, they lack features required for the richer interaction models that occur in VOs. For example, todays Web browsers typically use TLS for authentication but do not support single sign-on or delegation. Clear steps can be taken to integrate Grid and Web technologies. For example, the single sign-on capabilities provided in the GSI extensions to TLS would, if integrated into Web browsers, allow for single sign-on to multiple Web servers. GSI delegation capabilities would permit a browser client to delegate capabilities to a Web server so that the server could act on the clients behalf. These capabilities, in turn, make it much easier to use Web technologies to build VO portals that provide thin client interfaces to sophisticated VO applications. WebOS addresses some of these issues [59].

7.2 Application and storage service providers Application service providers, storage service providers, and similar hosting companies typically offer to outsource specific business and engineering applications (in the case of ASPs) and storage capabilities (in the case of SSPs). A customer negotiates a service level agreement that defines access to a specific combination of hardware and software. Security tends to be handled by using VPN technology to extend the customers intranet to encompass resources operated by the ASP or SSP on the customers behalf. Other SSPs offer file-sharing services, in which case access is provided via HTTP, FTP, or WebDAV with user ids, passwords, and access control lists controlling access. From a VO perspective, these are low-level building-block technologies. VPNs and static configurations make many VO sharing modalities hard to achieve. For example, the use of VPNs means that it is typically impossible for an ASP application to access data located on storage managed by a separate SSP. Similarly, dynamic reconfiguration of resources within a single ASP or SSP is challenging and, in fact, is rarely attempted. The load sharing across providers that occurs on a routine basis in the electric power industry is unheard of in the hosting industry. A basic problem is that a VPN is not a VO: it cannot extend dynamically to encompass other resources and does not provide the remote resource provider with any control of when and whether to share its resources. The integration of Grid technologies into ASPs and SSPs can enable a much richer range of possibilities. For example, standard Grid services and protocols can be used to achieve a decoupling of the hardware and software. A customer could negotiate an SLA for particular hardware resources and then use Grid resource protocols to dynamically provision that hardware to run customer-specific applications. Flexible delegation and access control mechanisms would allow a customer to grant an application running on an ASP computer direct, efficient, and securely access to data on SSP storage and/or to couple resources from multiple ASPs and SSPs with their own resources, when required for more complex problems. A single sign-on security infrastructure able to span multiple security domains dynamically is, realistically, required to support such scenarios. Grid resource management and accounting/payment protocols that allow for dynamic provisioning and reservation of capabilities (e.g., amount of storage, transfer bandwidth) are also critical. 7.3 Enterprise computing systems Enterprise development technologies such as CORBA, Enterprise Java Beans, Java 2 Enterprise Edition, and DCOM are all systems designed to enable the construction of distributed applications. They provide standard resource interfaces, remote invocation mechanisms, and trading services for discovery and hence make it easy to share resources within a single organization. However, these mechanisms address none of the specific VO requirements listed above. Sharing arrangements are typically relatively static and restricted to occur within a single organization. The primary form of interaction is client-server, rather than the coordinated use of multiple resources. These observations suggest that there should be a role for Grid technologies within enterprise computing. For example, in the case of CORBA, we could construct an object request broker (ORB) that uses GSI mechanisms to address cross-organizational security issues. We could implement a Portable Object Adaptor that speaks the Grid resource management protocol to access resources spread across a VO. We could construct Gridenabled Naming and Trading services that use Grid information service protocols to query information sources distributed across large VOs. In each case, the use of Grid protocols provides enhanced capability (e.g., interdomain security) and enables interoperability with other (non-CORBA) clients. Similar observations can be made about Java and Jini. For example, Jinis protocols and implementation are geared toward a small collection of devices. A Grid Jini that employed Grid protocols and services would allow the use of Jini abstractions in a large-scale, multi-enterprise environment. 7.4 Internet and peer-to-peer computing Peer-to-peer computing (as implemented, for example, in the Napster, Gnutella, and Freenet [60] file sharing systems) and Internet computing (as implemented, for example, by the SETI@home, Parabon, and Entropia systems) is an example of the more general (beyond client-server) sharing modalities and computational structures that we referred to in our characterization of VOs. As such, they have much in common with Grid technologies. In practice, we find that the technical focus of work in these domains has not overlapped significantly to date. One reason is that peer-to-peer and Internet computing developers have so far focused entirely on vertically integrated (stovepipe) solutions, rather than seeking to define common protocols that would allow for shared infrastructure and interoperability. (This is, of course, a common characteristic of new market niches, in which participants still hope for a monopoly.) Another is that the forms of sharing targeted by various applications are quite limited, for example, file sharing with no access control, and computational sharing with a centralized server.

As these applications become more sophisticated and the need for interoperability becomes clearer, we will see a strong convergence of interests between peer-to-peer, Internet, and Grid computing [61]. For example, single sign-on, delegation, and authorization technologies become important when computational and data-sharing services must interoperate, and the policies that govern access to individual resources become more complex.

8. OTHER PERSPECTIVES ON GRIDS


The perspective on Grids and VOs presented in this article is of course not the only view that can be taken. We summarize here and critique some alternative perspectives (given in italics). The Grid is a next-generation Internet : The Grid is not an alternative to the Internet: it is rather a set of additional protocols and services that build on Internet protocols and services to support the creation and use of computation- and data-enriched environments. Any resource that is on the Grid is also, by definition, on the Net. The Grid is a source of free cycles: Grid computing does not imply unrestricted access to resources. Grid computing is about controlled sharing. Resource owners will typically want to enforce policies that constrain access according to group membership, ability to pay, and so forth. Hence, accounting is important, and a Grid architecture must incorporate resource and collective protocols for exchanging usage and cost information, as well as for exploiting this information when deciding whether to enable sharing. The Grid requires a distributed operating system: In this view (e.g., see [62]), Grid software should define the operating system services to be installed on every participating system, with these services providing for the Grid what an operating system provides for a single computer: namely, transparency with respect to location, naming, security, and so forth. Put another way, this perspective views the role of Grid software as defining a virtual machine. However, we feel that this perspective is inconsistent with our primary goals of broad deployment and interoperability. We argue that the appropriate model is rather the Internet Protocol suite, which provides largely orthogonal services that address the unique concerns that arise in networked environments. The tremendous physical and administrative heterogeneities encountered in Grid environments means that the traditional transparencies are unobtainable; on the other hand, it does appear feasible to obtain agreement on standard protocols. The architecture proposed here is deliberately open rather than prescriptive: it defines a compact and minimal set of protocols that a resource must speak to be on the Grid; beyond this, it seeks only to provide a framework within which many behaviors can be specified. The Grid requires new programming models: Programming in Grid environments introduces challenges that are not encountered in sequential (or parallel) computers, such as multiple administrative domains, new failure modes, and large variations in performance. However, we argue that these are incidental, not central, issues and that the basic programming problem is not fundamentally different. As in other contexts, abstraction and encapsulation can reduce complexity and improve reliability. But, as in other contexts, it is desirable to allow a wide variety of higher-level abstractions to be constructed, rather than enforcing a particular approach. So, for example, a developer who believes that a universal distributed shared-memory model can simplify Grid application development should implement this model in terms of Grid protocols, extending or replacing those protocols only if they prove inadequate for this purpose. Similarly, a developer who believes that all Grid resources should be presented to users as objects needs simply to implement an object-oriented API in terms of Grid protocols. The Grid makes high-performance computers superfluous: The hundreds, thousands, or even millions of processors that may be accessible within a VO represent a significant source of computational power, if they can be harnessed in a useful fashion. This does not imply, however, that traditional high-performance computers are obsolete. Many problems require tightly coupled computers, with low latencies and high communication bandwidths; Grid computing may well increase, rather than reduce, demand for such systems by making access easier.

9. SUMMARY
We have provided in this article a concise statement of the Grid problem, which we define as controlled and coordinated resource sharing and resource use in dynamic, scalable virtual organizations. We have also presented both requirements and a framework for a Grid architecture, identifying the principal functions required to enable sharing within VOs and defining key relationships among these different functions. Finally, we have discussed in some detail how Grid technologies relate to other important technologies. We hope that the vocabulary and structure introduced in this document will prove useful to the emerging Grid community, by improving understanding of our problem and providing a common language for describing solutions. We also hope that our analysis will help establish connections among Grid developers and proponents of related technologies.

The discussion in this paper also raises a number of important questions. What are appropriate choices for the Intergrid protocols that will enable interoperability among Grid systems? What services should be present in a persistent fashion (rather than being duplicated by each application) to create usable Grids? And what are the key APIs and SDKs that must be delivered to users in order to accelerate development and deployment of Grid applications? We have our own opinions on these questions, but the answers clearly require further research.

APPENDIX: DEFINITIONS

We define here four terms that are fundamental to the discussion in this article but are frequently misunderstood and misused, namely, protocol, service, SDK, and API. + Protocol: A protocol is a set of rules that end points in a telecommunication system use when exchanging information. For example: The Internet Protocol (IP) defines an unreliable packet transfer protocol. The Transmission Control Protocol (TCP) builds on IP to define a reliable data delivery protocol. The Transport Layer Security (TLS) protocol [31] defines a protocol to provide privacy and data integrity between two communicating applications. It is layered on top of a reliable transport protocol such as TCP. The Lightweight Directory Access Protocol (LDAP) builds on TCP to define a queryresponse protocol for querying the state of a remote database. An important property of protocols is that they admit to multiple implementations: two end points need only implement the same protocol to be able to communicate. Standard protocols are thus fundamental to achieving interoperability in a distributed computing environment. A protocol definition also says little about the behavior of an entity that speaks the protocol. For example, the FTP protocol definition indicates the format of the messages used to negotiate a file transfer but does not make clear how the receiving entity should manage its files. As the above examples indicate, protocols may be defined in terms of other protocols. + Service: A service is a network-enabled entity that provides a specific capability, for example, the ability to move files, create processes, or verify access rights. A service is defined in terms of the protocol one uses to interact with it and the behavior expected in response to various protocol message exchanges (i.e., service = protocol + behavior.). A service definition may permit a variety of implementations. For example: An FTP server speaks the File Transfer Protocol and supports remote read and write access to a collection of files. One FTP server implementation may simply write to and read from the servers local disk, while another may write to and read from a mass storage system, automatically compressing and uncompressing files in the process. From a Fabric-level perspective, the behaviors of these two servers in response to a store request (or retrieve request) are very different. From the perspective of a client of this service, however, the behaviors are indistinguishable; storing a file and then retrieving the same file will yield the same results regardless of which server implementation is used. An LDAP server speaks the LDAP protocol and supports response to queries. One LDAP server implementation may respond to queries using a database of information, while another may respond to queries by dynamically making SNMP calls to generate the necessary information on the fly. A service may or may not be persistent (i.e., always available), be able to detect and/or recover from certain errors, run with privileges, and/or have a distributed implementation for enhanced scalability. If variants are possible, then discovery mechanisms that allow a client to determine the properties of a particular instantiation of a service are important. Note also that one can define different services that speak the same protocol. For example, in the Globus Toolkit, both the replica catalog [35] and information service [33] use LDAP. + API : An Application Program Interface (API) defines a standard interface (e.g., set of subroutine calls, or objects and method invocations in the case of an object-oriented API) for invoking a specified set of functionality. For example: The generic security service (GSS) API [36] defines standard functions for verifying identify of communicating parties, encrypting messages, and so forth. The message-passing interface API [63] defines standard interfaces, in several languages, to functions used to transfer data among processes in a parallel computing system. An API may define multiple language bindings or use an interface definition language. The language may be a conventional programming language such as C or Java, or it may be a shell interface. In the latter case, the API refers to particular a definition of command line arguments to the program, the input and output of the program, and the exit status of the program. An API normally will specify a standard behavior but can admit to multiple implementations.

It is important to understand the relationship between APIs and protocols. A protocol definition says nothing about the APIs that might be called from within a program to generate protocol messages. A single protocol may have many APIs; a single API may have multiple implementations that target different protocols. In brief, standard APIs enable portability; standard protocols enable interoperability. For example, both public key and Kerberos bindings have been defined for the GSS-API [36]. Hence, a program that uses GSS-API calls for authentication operations can operate in either a public key or a Kerberos environment without change. On the other hand, if we want a program to operate in a public key and a Kerberos environment at the same time, then we need a standard protocol that supports interoperability of these two environments. See Figure 5. + SDK: The term software development kit (SDK) denotes a set of code designed to be linked with, and invoked from within, an application program to provide specified functionality. An SDK typically implements an API. If an API admits to multiple implementations, then there will be multiple SDKs for that API. Some SDKs provide access to services via a particular protocol. For example: The OpenLDAP release includes an LDAP client SDK, which contains a library of functions that can be used from a C or C++ application to perform queries to an LDAP service.

Figure 5 On the left, an API is used to develop applications that can target either Kerberos or PKI security mechanisms. On the right, protocols (the Grid security protocols provided by the Globus Toolkit) are used to enable interoperability between Kerberos and PKI domains.

JNDI is a Java SDK that contains functions that can be used to perform queries to an LDAP service. Different SDKs implement GSS-API using the TLS and Kerberos protocols, respectively. There may be multiple SDKs, for example, from multiple vendors, that implement a particular protocol. Further, for client-server oriented protocols, there may be separate client SDKs for use by applications that want to access a service and server SDKs for use by service implementers that want to implement particular, customized service behaviors. An SDK need not speak any protocol. For example, an SDK that provides numerical functions may act entirely locally and not need to speak to any services to perform its operations.

REFERENCES
1. Foster, I. and Kesselman, C. (eds) (1999) The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann. 2. Beiriger, J., Johnson, W., Bivens, H., Humphreys, S. and Rhea, R. (2000) Constructing the ASCI Grid. Proceedings of the 9th IEEE Symposium on High Performance Distributed Computing, IEEE Press. 3. Brunett, S. et al. (1998) Application experiences with the Globus Toolkit. Proceedings of the 7th IEEE Symposium on High Performance Distributed Computing, IEEE Press, pp. 8189. 4. Johnston, W. E., Gannon, D. and Nitzberg, B. (1999) Grids as production computing environments: The engineering aspects of NASAs Information Power Grid. Proceedings of the 8th IEEE Symposium on High Performance Distributed Computing, IEEE Press. 5. Stevens, R., Woodward, P., DeFanti, T. and Catlett, C. (1997) From the I-WAY to the National Technology Grid. Communications of the ACM, 40(11), 5061. 6. Sculley, A. and Woods, W. (2000) B2B Exchanges: The Killer Application in the Business-toBusiness Internet Revolution. ISI Publications. 7. Barry, J. et al. (1998) NIIIP-SMART: An investigation of distributed object approaches to support MES development and deployment in a virtual enterprise. 2nd International Enterprise Distributed Computing Workshop, IEEE Press. 8. Chervenak, A., Foster, I., Kesselman, C., Salisbury, C. and Tuecke, S. (2001) The Data Grid: Towards an architecture for the distributed management and analysis of large scientific data sets. Journal of Network and Computer Applications. 9. Hoschek, W., Jaen-Martinez, J., Samar, A., Stockinger, H. and Stockinger, K. (2000) Data management in an international Data Grid project. Proceedings of the 1st IEEE/ACM International Workshop on Grid Computing, Springer Verlag Press.

10. Moore, R., Baru, C., Marciano, R., Rajasekar, A. and Wan, M. (1999) Data-intensive computing, in Foster, I. and Kesselman, C. (eds) The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, pp. 105129. 11. Camarinha-Matos, L. M., Afsarmanesh, H., Garita, C. and Lima, C. Towards an architecture for virtual enterprises. Journal of Intelligent Manufacturing. 12. Aiken, R. et al. (2000) Network Policy and Services: A Report of a Workshop on Middleware, RFC 2768, IETF, http://www.ietf.org/rfc/rfc2768.txt. 13. Arnold, K., OSullivan, B., Scheifler, R. W., Waldo, J. and Wollrath, A. (1999) The Jini Specification. Addison-Wesley, See also www.sun.com/jini. 14. (1994) Realizing the Information Future: The Internet and Beyond. National Academy Press, http://www.nap.edu/readingroom/books/rtif/. 15. Foster, I. and Kesselman, C. (1998) The Globus Project: A status report. Proceedings of the Heterogeneous Computing Workshop, IEEE Press, pp. 418. 16. Tierney, B., Johnston, W., Lee, J. and Hoo, G. (1996) Performance analysis in high-speed wide area IP over ATM networks: Top-to-bottom end-to-end monitoring. IEEE Networking. 17. Beynon, M., Ferreira, R., Kurc, T., Sussman, A. and Saltz, J. (2000) DataCutter: middleware for filtering very large scientific datasets on archival storage systems. Proceedings of the 8th Goddard Conference on Mass Storage Systems and Technologies/17th IEEE Symposium on Mass Storage Systems, pp. 119133. 18. Baru, C., Moore, R., Rajasekar, A. and Wan, M. (1998) The SDSC storage resource broker. Proceedings of the CASCON98 Conference, 1998. 19. Dinda, P. and OHallaron, D. (1999) An evaluation of linear models for host load prediction. Proceedings of the 8th IEEE Symposium on High-Performance Distributed Computing, IEEE Press. 20. Lowekamp, B., Miller, N., Sutherland, D., Gross, T., Steenkiste, P. and Subhlok, J. (1998) A resource query interface for network-aware applications. Proceedings of the 7th IEEE Symposium on High-Performance Distributed Computing, IEEE Press. 21. Wolski, R. (1997) Forecasting network performance to support dynamic scheduling using the Network Weather Service. Proceedings of the 6th IEEE Symposium on High Performance Distributed Computing, Portland, Oregon, 1997. 22. Foster, I., Roy, A. and Sander, V. (2000) A quality of service architecture that combines resource reservation and application adaptation. Proceedings of the 8th International Workshop on Quality of Service, 2000. 23. Papakhian, M. (1998) Comparing job-management systems: The users perspective. IEEE Computational Science & Engineering, See also http://pbs.mrj.com. 24. Litzkow, M., Livny, M. and Mutka, M. (1988) Condor a hunter of idle workstations. Proceedings of the 8th International Conference on Distributed Computing Systems, pp. 104111. 25. Livny, M. (1999) High-throughput resource management, in Foster, I. and Kesselman, C. (eds) The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, pp. 311337. 26. Baker, F. (1995) Requirements for IP Version 4 Routers, RFC 1812, IETF, http://www.ietf.org/rfc/rfc1812.txt. 27. Butler, R., Engert, D., Foster, I., Kesselman, C., Tuecke, S., Volmer, J. and Welch, V. (2000) Design and deployment of a national-scale authentication infrastructure. IEEE Computer, 33(12), 6066. 28. Foster, I., Kesselman, C., Tsudik, G. and Tuecke, S. (1998) A security architecture for computational Grids. ACM Conference on Computers and Security, pp. 8391. 29. Gasser, M. and McDermott, E. (1990) An architecture for practical delegation in a distributed system. Proceedings of the 1990 IEEE Symposium on Research in Security and Privacy, IEEE Press, pp. 2030. 30. Howell, J. and Kotz, D. (2000) End-to-end authorization. Proceedings of the 2000 Symposium on Operating Systems Design and Implementation, USENIX Association. 31. Dierks, T. and Allen, C. (1999) The TLS Protocol Version 1.0, RFC 2246, IETF, http://www.ietf.org/rfc/rfc2246.txt. 32. Steiner, J., Neuman, B. C. and Schiller, J. (1988) Kerberos: An authentication system for open network systems. Proceedings of the Usenix Conference, pp. 191202. 33. Czajkowski, K., Fitzgerald, S., Foster, I. and Kesselman, C. (2001) Grid Information Services for Distributed Resource Sharing, 2001. 34. Czajkowski, K., Foster, I., Karonis, N., Kesselman, C., Martin, S., Smith, W. and Tuecke, S. (1998) A resource management architecture for metacomputing systems. The 4th Workshop on Job Scheduling Strategies for Parallel Processing, pp. 6282. 35. Allcock, B. et al. (2001) Secure, efficient data transport and replica management for highperformance data-intensive computing. Mass Storage Conference, 2001. 36. Linn, J. (2000) Generic Security Service Application Program Interface Version 2, Update 1, RFC 2743, IETF, http://www.ietf.org/rfc/rfc2743. 196 IAN FOSTER, CARL KESSELMAN, AND STEVEN TUECKE 37. Berman, F. (1999) High-performance schedulers, in Foster, I. and Kesselman, C. (eds) The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, pp. 279309. 38. Berman, F., Wolski, R., Figueira, S., Schopf, J. and Shao, G. (1996) Application-level scheduling

on distributed heterogeneous networks. Proceedings of the Supercomputing 96, 1996. 39. Frey, J., Foster, I., Livny, M., Tannenbaum, T. and Tuecke, S. (2001) Condor-G: A Computation Management Agent for Multi-Institutional Grids. University of Wisconsin Madison. 40. Abramson, D., Sosic, R., Giddy, J. and Hall, B. (1995) Nimrod: A tool for performing parameterized simulations using distributed workstations. Proceedings of the 4th IEEE Symposium on High Performance Distributed Computing, 1995. 41. Foster, I. and Karonis, N. (1998) A Grid-enabled MPI: Message passing in heterogeneous distributed computing systems. Proceedings of the SC98, 1998. 42. Gabriel, E., Resch, M., Beisel, T. and Keller, R. (1998) Distributed computing in a heterogeneous computing environment. Proceedings of the EuroPVM/MPI98, 1998. 43. Casanova, H., Obertelli, G., Berman, F. and Wolski, R. (2000) The AppLeS parameter sweep template: User-level middleware for the Grid. Proceedings of the SC2000, 2000. 44. Goux, J.-P., Kulkarni, S., Linderoth, J. and Yoder, M. (2000) An enabling framework for master-worker applications on the computational Grid. Proceedings of the 9th IEEE Symposium on High Performance Distributed Computing, IEEE Press. 45. Casanova, H., Dongarra, J., Johnson, C. and Miller, M. (1999) Application-specific tools, in Foster, I. and Kesselman, C. (eds) The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, pp. 159180. 46. Casanova, H. and Dongarra, J. (1997) NetSolve: A network server for solving computational science problems. International Journal of Supercomputer Applications and High Performance Computing, 11(3), 212223. 47. Nakada, H., Sato, M. and Sekiguchi, S. (1999) Design and implementations of Ninf: Towards a global computing infrastructure, Future Generation Computing Systems. 48. Thompson, M., Johnston, W., Mudumbai, S., Hoo, G., Jackson, K. and Essiari, A. (1999) Certificate-based access control for widely distributed resources. Proceedings of the 8th Usenix Security Symposium, 1999. 49. DeFanti, T. and Stevens, R. (1999) Teleimmersion, in Foster, I. and Kesselman, C. (eds) The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, pp. 131155. 50. Leigh, J., Johnson, A. and DeFanti, T. A. CAVERN: A distributed architecture for supporting scalable persistence and interoperability in collaborative virtual environments. Virtual Reality: Research, Development and Applications, 2(2), 217237. 51. Childers, L., Disz, T., Olson, R., Papka, M. E., Stevens, R. and Udeshi, T. (2000) Access Grid: Immersive group-to-group collaborative visualization. Proceedings of the 4th International Immersive Projection Technology Workshop, 2000. 52. Novotny, J., Tuecke, S. and Welch, V. (2001) Initial Experiences with an Online Certificate Repository for the Grid: MyProxy. 53. Czajkowski, K., Foster, I. and Kesselman, C. (1999) Coallocation services for computational Grids. Proceedings of the 8th IEEE Symposium on High Performance Distributed Computing, IEEE Press. 54. Armstrong, R., Gannon, D., Geist, A., Keahey, K., Kohn, S., McInnes, L. and Parker, S. (1999) Toward a common component architecture for high performance scientific computing. Proceedings of the 8th IEEE Symposium on High Performance Distributed Computing, 1999. 55. Gannon, D. and Grimshaw, A. (1999) Object-based approaches, in Foster, I. and Kesselman, C. (eds) The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, pp. 205236. 56. Lopez, I. et al. (2000) NPSS on NASAs IPG: Using CORBA and Globus to coordinate multidisciplinary aeroscience applications. Proceedings of the NASA HPCC/CAS Workshop, NASA Ames Research Center. 57. Benger, W., Foster, I., Novotny, J., Seidel, E., Shalf, J., Smith, W. and Walker, P. (1999) Numerical relativity in a distributed environment. Proceedings of the 9th SIAM Conference on Parallel Processing for Scientific Computing, 1999. THE ANATOMY OF THE GRID 197 58. Bolcer, G. A. and Kaiser, G. (1999) SWAP: Leveraging the web to manage workflow. IEEE Internet Computing, 8588. 59. Vahdat, A., Belani, E., Eastham, P., Yoshikawa, C., Anderson, T., Culler, D. and Dahlin, M. 1998 WebOS: Operating system services for wide area applications. 7th Symposium on High Performance Distributed Computing, July 1998. 60. Clarke, I., Sandberg, O., Wiley, B. and Hong, T. W. (1999) Freenet: A distributed anonymous information storage and retrieval system. ICSI Workshop on Design Issues in Anonymity and Unobservability, 1999. 61. Foster, I. (2000) Internet computing and the emerging Grid, Nature Web Matters. http://www.nature.com/nature/webmatters/grid/grid.html. 62. Grimshaw, A. and Wulf, W. (1996) Legion a view from 50,000 feet. Proceedings of the 5th IEEE Symposium on High Performance Distributed Computing, IEEE Press, pp. 8999. 63. Gropp, W., Lusk, E. and Skjellum, A. (1994) Using MPI: Portable Parallel Programming with the Message Passing Interface. MIT Press.

Rationale for choosing the Open Grid Services Architecture 7


Malcolm Atkinson 1. Introduction 2. The significance of data for e-Science 3. Building on an OGSA platform 3.1 Web services 3.2 The Open Grid Services Architecture 4. Case for OGSA 5. The challenge of OGSA 6. Planning the United Kingdoms OGSA contributions 7. Establishing common infrastructure 7.1 Grid services hosting environment 7.2 Standard types 8. Baseline database access 8.1 An overview of the OGSA-DAI architecture 9. Baseline logging infrastructure 10. Summary and conclusions

1. INTRODUCTION
This chapter presents aspects of the UK e-Science communities plans for generic Grid middleware. In particular, it derives from the discussions of the UK Architecture Task Force [1]. The UK e-Science Core Programme will focus on architecture and middleware development in order to contribute significantly to the emerging Open Grid Services Architecture (OGSA) [2]. This architecture views Grid technology as a generic integration mechanism assembled from Grid Services (GS), which are an extension of Web Services (WS) to comply with additional Grid requirements. The principal extensions from WS to GS are the management of state, identification, sessions and life cycles and the introduction of a notification mechanism in conjunction with Grid service data elements (SDE). The UK e-Science Programme has many pilot projects that require integration technology and has an opportunity through its Core Programme to lead these projects towards adopting OGSA as a common framework. That framework must be suitable, for example, it must support adequate Grid service interoperability and portability. It must also be populated with services that support commonly required functions, such as authorisation, accounting and data transformation. To obtain effective synergy with the international community that is developing Grid standards and to best serve the United Kingdoms community of scientists, it is necessary to focus the United Kingdoms middleware development resources on a family of GS for which the United Kingdom is primarily responsible and to deliver their reference implementations. The UK e-Science and computing science community is well placed to contribute substantially to structured data integration services [315]. Richer information models should be introduced at the earliest opportunity to progressively approach the goal of a semantic Grid (see Chapter 17). The UK e-Science community also recognises an urgent need for accounting mechanisms and has the expertise to develop them in conjunction with international efforts. This chapter develops the rationale for working with OGSA and a plan for developing commonly required middleware complementary to the planned baseline Globus Toolkit 3 provision. It takes the development of services for accessing and integrating structured data via the Grid as an example and shows how this will map to GS.

2. THE SIGNIFICANCE OF DATA FOR e-SCIENCE


The fundamental goal of the e-Science programme is to enable scientists to perform their science more effectively. The methods and principles of e-Science should become so pervasive that scientists can use them naturally whenever they are appropriate just as they use mathematics today. The goal is to arrive at the state where we just say science. Just as there are branches of mathematics that support different scientific domains, so will there be differentiated branches of computation. We are in a pioneering phase, in which the methods and principles must be elucidated and made accessible and in which the differentiation of domain requirements must

be explored. We are confident that, as with mathematics, these results will have far wider application than the scientific testing ground where we are developing them. The transition that we are catalysing is driven by technology and is largely manifest in the tsunami of data (see Chapter 36). Detectors and instruments benefit from Moores law, so that in astronomy for instance, the available data is doubling every year [16, 17]. Robotics and nanoengineering accelerates and multiplies the output from laboratories. For example, the available genetic sequence data is doubling every nine months [16]. The volume of data we can store at a given cost doubles each year. The rate at which we can move data is doubling every nine months. Mobile sensors, satellites, ocean-exploring robots, clouds of disposable micro-sensors, personal-health sensors, combined with digital radio communication are rapidly extending the sources of data. These changes warrant a change in scientific behaviour. The norm should be to collect, annotate, curate and share data. This is already a trend in subjects such as large-scale physics, astronomy, functional genomics and earth sciences. But perhaps it is not yet as prevalent as it should be. For example, the output of many confocal microscopes, the raw data from many micro-arrays and the streams of data from automated pathology labs and digital medical scanners, do not yet appear as a matter of course for scientific use and analysis. It is reasonable to assume that if the benefits of data mining and correlating data from multiple sources become widely recognised, more data will be available in shared, often public, repositories. This wealth of data has enormous potential. Frequently, data contains information relevant to many more topics than the specific science, engineering or medicine that motivated its original collection and determined its structure. If we are able to compose and study these large collections of data for correlations and anomalies, they may yield an era of rapid scientific, technological and medical progress. But discovering the valuable knowledge from the mountains of data is well beyond unaided human capacity. Sophisticated computational approaches must be developed. Their application will require the skills of scientists, engineers, computer scientists, statisticians and many other experts. Our challenge is to enable both the development of the sophisticated computation and the collaboration of all of those who should steer it. The whole process must be attainable by the majority of scientists, sustainable within a typical economy and trustable by scientists, politicians and the general public. Developing the computational approaches and the practices that exploit them will surely be one of the major differentiated domains of e-Science support. The challenge of making good use of growing volumes of diverse data is not exclusive to science and medicine. In government, business, administration, health care, the arts and humanities, we may expect to see similar challenges and similar advantages in mastering those challenges. Basing decisions, judgements and understanding on reliable tests against trustworthy data must benefit industrial, commercial, scientific and social goals. It requires an infrastructure to support the sharing, integration, federation and analysis of data.

3. BUILDING ON AN OGSA PLATFORM


The OGSA emerged contemporaneously with the UK e-Science review of architecture and was a major and welcome influence. OGSA is the product of combining the flexible, dynamically bound integration architecture of WS with the scalable distributed architecture of the Grid. As both are still evolving rapidly, discussion must be hedged with the caveat that significant changes to OGSAs definition will have occurred by the time this chapter is read. OGSA is well described in other chapters of this book (see Chapter 8) and has been the subject of several reviews, for example, References [18, 19]. It is considered as the basis for a data Grid (see Chapter 15) and is expected to emerge as a substantial advance over the existing Globus Toolkit (GT2) and as the basis for a widely adopted Grid standard. 3.1 Web services Web Services are an emerging integration architecture designed to allow independently operated information systems to intercommunicate. Their definition is the subject of W3Cstandards processes in which major companies, for example, IBM, Oracle, Microsoft and SUN, are participating actively. WS are described well in Reference [20], which offers the following definition: A Web service is a platform and implementation independent software component that can be described using a service description language, published to a registry of services, discovered through a standard mechanism (at run time or design time), invoked through a declared Application Programming Interface (API), usually over a network, composed with other services.

WS are of interest to the e-Science community on two counts: 1. Their function of interconnecting information systems is similar to the Grids intended function. Such interconnection is a common requirement as scientific systems are often composed using many existing components and systems. 2. The support of companies for Web services standards will deliver description languages, platforms, common services and software development tools. These will enable rapid development of Grid services and applications by providing a standard framework for describing and composing Web services and Grid services. They will also facilitate the commercialisation of the products from e-Science research. An important feature of WS is the emergence of languages for describing aspects of the components they integrate that are independent from the implementation and platform technologies. They draw heavily on the power of XML Schema. For example, the Web Services Description Language (WSDL) is used to describe the function and interfaces (portTypes) of Web services and the Web Services Inspection Language (WSIL) is used to support simple registration and discovery systems. Simple Object Access Protocol (SOAP) is a common denominator interconnection language that transmits structured data across representational boundaries. There is currently considerable activity proposing revisions of these standards and additional languages for describing the integration and the coordination of WS, for describing quality-of-service properties and for extending Web service semantics to incorporate state, more sophisticated types for ports and transactions. It is uncertain what will emerge, though it is clear that the already strong support for distributed system integration will be strengthened. This will be useful for many of the integration tasks required to support e-Science. Inevitably, the products lag behind the aspirations of the standards proposals and vary significantly. Nevertheless, they frequently include sophisticated platforms to support operations combined with powerful development tools. It is important that developers of e-Science applications take advantage of these. Consequently, the integration architectures used by e-Science should remain compatible with Web services and e-Science developers should consider carefully before they develop alternatives. 3.2 The Open Grid Services Architecture As other chapters describe OGSA (see Chapter 8), it receives only minimal description here, mainly to introduce vocabulary for later sections. A system compliant with OGSA is built by composing GS. Each Grid service is also a Web service and is described by WSDL. Certain extensions to WSDL are proposed to allow Grid-inspired properties to be described, and these may be adopted for wider use in forthcoming standards. This extended version of WSDL is called Grid Services Description Language (GSDL). To be a Grid service the component must implement certain portTypes, must comply with certain lifetime management requirements and must be uniquely identifiable by a Grid Service Handle (GSH) throughout its lifetime. The lifetime management includes a soft-state model to limit commitments, to avoid permanent resource loss when partial failures occur and to guarantee autonomy. In addition, evolution of interfaces and function are supported via the Grid Service Record (GSR). This is obtainable via a mapping from a GSH, and has a time to live so that contracts that use it must be renewed. These properties are important to support long-running scalable distributed systems. A Grid service may present some of its properties via SDE. These SDE may be static or dynamic. Those that are static are invariant for the lifetime of the Grid service they describe, and so may also be available via an encoding in an extension of WSDL in the GSR. Those that are dynamic present aspects of a Grid services state. The SDE may be used for introspection, for example, by tools that generate glue code, and for monitoring to support functions such as performance and progress analysis, fault diagnosis and accounting. The SDE are described by XML Schema and may be queried by a simple tag, by a value pair model and by more advanced query languages. The values may not be stored as XML but synthesised on demand. An event notification, publish and subscribe, mechanism is supported. This is associated with the SDE, so that the query languages may be used to specify interest. The functions supported through the mandatory portTypes include authentication and registration/discovery.

4. CASE FOR OGSA


The authors of OGSA [2] expect the first implementation, Globus Toolkit 3 (GT3), to faithfully reproduce the semantics and the APIs of the current GT2, in order to minimise the perturbation of current projects. However, the influence of the thinking and the industrial momentum behind WS, and the need to achieve regularities that can be exploited by tools, will surely provoke profound changes in Grid implementations of the future. Indeed, OGSA is perceived as a good opportunity to restructure and re-engineer the Globus foundation technology. This will almost certainly be beneficial, but it will also surely engender semantically significant changes.

Therefore, because of the investment in existing Grid technology (e.g. GT2) by many application projects, the case for a major change, as is envisaged with OGSA, has to be compelling. The arguments for adopting OGSA as the direction in which to focus the development of future Grid technology concern three factors: politics, commerce and technology. The political case for OGSA is that it brings together the efforts of the e-Science pioneers and the major software companies. This is essential for achieving widely accepted standards and the investment to build and sustain high-quality, dependable Grid infrastructure. Only with the backing of major companies will we meet the challenges of installing widespread support in the network and the operating system infrastructures, developing acceptance of general mechanisms for interconnection across boundaries between different authorities and obtaining interworking agreements between nations permitting the exchange of significant data via the Grid. The companies will expect from the e-Science community a contribution to the political effort particularly through compelling demonstrations. The commercial case is the route to a sustainable Grid infrastructure and adequate Grid programming tools, both of which are missing for the Grid at present because the e-Science communitys resources are puny compared to the demands of building and sustaining comprehensive infrastructure and tool sets. If convergence can be achieved between the technology used in commercial applications for distributed software integration and that used for scientific applications, then a common integration platform can be jointly constructed and jointly maintained. As commerce is ineluctably much larger than the science base alone, this amortises those costs over a much larger community. Commerce depends on rapid deployment and efficient use of many application developers who are rarely experts in distributed systems. Yet it also depends on a growing number of ever more sophisticated distributed systems. It therefore has strong incentives to build tool sets and encapsulated services that would also benefit scientists if we share infrastructure, as we do today for computers, operating systems, compilers and network Internet protocol (IP) stacks. A further commercial advantage emerges from the proposed convergence. It will be easier to rapidly transfer eScience techniques to commerce and industry. Using a common platform, companies will have less novel technology to learn about, and therefore less assimilation costs and risks when they take up the products of eScience research. The technological case for OGSA is largely concerned with software engineering issues. The present set of components provided by the Grid has little structure to guide the application developers. This lack of explicit structure may also increase the costs of maintaining and extending the existing Grid infrastructure. The discipline of defining Grid services in terms of a language (GSDL) and of imposing a set of common requirements on each Grid service should significantly improve the ease and the accuracy with which components can be composed. Those same disciplines will help Grid service developers to think about relevant issues and to deliver dependable components. We expect significant families of GS that adopt additional constraints on their definition and address a particular domain. Such families will have improved compositional properties, and tools that exploit these will be a natural adjunct. Dynamic binding and rebinding with soft state are necessary for large-scale, long-running systems that are also flexible and evolvable. The common infrastructure and disciplines will be an appropriate foundation from which to develop tools, subsystems and portals to facilitate e-Science application development, taking advantage of the richer information available from the metadata describing Grid services, advances in the precision and detail of the infrastructure and the disciplines to yield dependable, predictable and trustworthy services.

5. THE CHALLENGE OF OGSA


To deliver the potential of OGSA many challenges have to be met. Sustaining the effort to achieve widely adopted standards that deliver the convergence of the WS and the Grid and rallying the resources to build high-quality implementations are obvious international challenges. Here we focus on more technical issues. 1. The types commonly needed for e-Science applications and for database integration need to be defined as XML Schema namespaces. If this is not done, different e-Science application groups will develop their own standards and a label of types will result.

2. The precision required in GSDL definitions will need to be specified so that it is sufficient to support the planned activities. A succession of improved standards should be investigated, so that progressively more of the assembly of Grid services can be automated or at least automatically verified. 3. Standard platforms on which Grid services and their clients are developed. 4. The infrastructure for problem diagnosis and response (e.g. detecting, reporting, localising and recovering from partial failures) has to be defined. 5. The infrastructure for accounting within an assembly of Grid services. The infrastructure for management and evolution. This would deliver facilities for limiting and controlling the behaviour of Grid services and facilities for dynamically replacing Grid service instances in an extensively distributed and continuously operating system. Coordination services that some Grid services can participate in. 8. Definitions and compliance testing mechanisms so that it is possible to establish and monitor quality and completeness standards for Grid services. 9. Programming models that establish and support good programming practice for this scale of integration. 10. Support for Grid service instance migration to permit operational, organisational and administrative changes. 11. Support for intermittently connected mobile Grid services to enable the use of mobile computing resources by e-Scientists. These issues already exist as challenges for the classical Grid architecture. In some cases, Global Grid Forum (GGF) working groups are already considering them. OGSA provides an improved framework in which they may be addressed. For example, interfaces can be defined using GSDL to more precisely delimit a working groups area of activity.

PLANNING THE UNITED KINGDOMS OGSA CONTRIBUTIONS


The UK e-Science Core Programme should coordinate middleware development to align with, influence and develop the OGSA. This will inevitably be a dynamic process; that is, an initial plan will need to be monitored and modified in response to contributions by other countries and by companies. The UK Grid middleware community must work closely with pilot projects to explore the potential of OGSA, to conduct evaluations and to share implementation effort. Our initial plan proposed work on a number of sub-themes. Phase I : Current actions to position the UK e-Science Core Programme middleware effort. 1. Understanding, validating and refining OGSA concepts and technical design. 2. Establishing a common context, types and a baseline set of Grid services. 3. Defining and prototyping baseline database access technology [21]. 4. Initiating a Grid service validation and testing process. 5. Establishing baseline logging to underpin accounting functions. Phase II : Advanced development and research. 1. Refining GSDL, for example specifying semantics, and developing tools that use it. 2. Pioneering higher-level data integration services en route to the semantic Grid. 3. Pioneering database integration technology. 4. Developing advanced forms of Grid Economies. The work in Phases I and II must take into account a variety of engineering and design issues that are necessary to achieve affordable, viable, maintainable and trustworthy services. These include performance engineering, dependable engineering, engineering for change, manageability and operations support, and privacy, ethical and legal issues. The following sections of the chapter expand some of the topics in this plan.

ESTABLISHING COMMON INFRASTRUCTURE


There are three parts to this: the common Grid services, namely computational context, the standard set of eScience and Grid service types and the minimal set of Grid service primitives. Developers building new Grid services or applications require to know which operations are always supported by a hosting environment. As code portability can only pertain within a single hosting environment, these operations

may be specific to that hosting environment. For example, the operations for developers working within a J2EE hosting environment need not be the same as those for developers using C. However, there will be a pervasive baseline functionality, which will have various syntactic forms. The physiology chapter (see Chapter 8) describes it thus: implementation of Grid services can be facilitated by specifying baseline characteristics that all hosting environments must possess, defining the internal interface from the service implementation to the global Grid environment. These characteristics would then be rendered into different implementation technologies (e.g. J2EE or shared libraries). Whilst traditional Grid implementations have mapped such requirements directly to the native operating system or to the library code, GS are likely to be significantly influenced by the platforms used in Web service hosting. 1 Grid services hosting environment We are familiar with the power of well-defined computational contexts, for example, that defined by the Java Virtual Machine [22] and that defined for Enterprise Java Beans [23]. Designing such a context for GS requires a balance between the following: Parsimony of facilities to minimise transition and learning costs. Complete functionality to provide rich resources to developers. A common issue is a standard representation of a computations history, sometimes called its context. This context contains information that must be passed from one service to the next as invocations occur. Examples include the subject on behalf of whom the computation is being conducted (needed for authorisation), the transaction identifier if this computation is part of a transaction and so on. An example of such a context is MessageContext in Apache Axis. 2 Standard types The SOAP definition [24] defines a set of types including primitive types and the recursive composition of these as structures and arrays, based on namespaces and notations of the XML Schema definitions [25, 26]. The applications and libraries of e-Science use many standard types, such as complex numbers, diagonalised matrices, triangular matrices and so on. There will be a significant gain if widely used types are defined and named early, so that the same e-Science-oriented namespaces can be used in many exchange protocols, port definitions, components and services. The advantages include 1. simplification of interworking between components that adopt these standards, 2. better amortisation of the cost of type design, 3. early validation of the use of WSDL for these aspects of e-Science and 4. simplification of the task of providing efficient mappings for serialising and deserialising these structures by avoiding multiple versions. Many communities, such as astronomers, protein crystallographers, bioin-formaticians and so on, are developing standards for their own domains of communication and for curated data collections. The e-Science core programme can build on and facilitate this process by developing component types that can be reused across domains. As higher-level types are standardised and reused, it becomes easier to move activities towards the information and knowledge layers to which the semantic Grid aspires.

8. BASELINE DATABASE ACCESS


This is primarily the responsibility of the UK Database Task Force [12] and now the GGF Database Access and Integration Services (DAIS) working group [21]. A Centre Project, OGSA-DAI based on a consortium of EPCC, IBM, NEeSC, NeSC, NWeSC and Oracle, is developing a set of components to serve this function and to contribute to the standards. This is an illustration of using OGSA and so it is presented as an example. The suite of middleware envisaged is complementary to that produced by data management projects, such as the European Data Grid (see Chapter 15) and distributed file and replication management, such as GIGGLE [27]. Their primary concern is to manage reliable storage, distributed naming, replication and movement of large collections of data that are under their management. They operate largely without concern for the structure and the interpretation of data within their containers (mostly files and collections of files). Database Access and Integration (DAI) components permit access to data, which is usually stored in standard Database Management Systems (DBMS), which is often managed autonomously and provide database operations, such as query and update, for the data held in these databases. The challenges include establishing

connections, establishing authority and handling the variety of forms of data. At present DAI aspires to handle distributed database operations applied to collections of data in relational databases, XML collections and files whose structure is adequately described. A current description may be found in Reference [21] and in other papers prepared for GGF5 (http://www.cs.man.ac.uk/grid-db/). An overview of the OGSA-DAI architecture DAIcomponent categories: There are four categories of component. Each may have a variety of detailed forms. Their function and range of forms are introduced here. Grid database services (GDS): These are either a structured data store or (more often) a proxy for a structured data store, such as an instance of a DBMS. In either case they provide access to the stored data through a standard portType. In later versions of DAI they will also represent federations. To accommodate the variety of data models, query languages, data description languages and proprietary languages, this API includes explicit identification of the language that is used to specify each (database) operation that is to be applied. This flexibility also allows for GDS that reveal proprietary operations of specific DBMS, allows batches of operations to be optimised and processed by interpreting a Grid Job Control Language and provides for extension for future data models. Developers may restrict themselves to standard widely supported languages, such as SQL92 and Xpath, in order to achieve platform independence. Grid database service factories (GDSF): These may be associated with an instance of a DBMS or they may be associated with a particular DBMS type or data model. In the former case, the GDS that are generated by this factory will act as proxies for the underlying database. In the latter case, the factory will produce data storage systems managed according to the associated DBMS type, for example Oracle 9i, or according to the associated model, for example XML. The result may then be a GDSF, which produces proxies to that storage system, or a GDS that allows direct access on that storage system. The API will again provide control and flexibility by explicitly defining the language used. Grid data transport vehicles (GDTV): These provide an abstraction over bulk data transmission systems, such as GridFTP, MPICH-G, Unix Pipes and so on. They provide two APIs, one for a data producer and one for a data consumer.1 These may then be kept invariant, while performance, synchronisation and reliability properties may be adjusted. For example, the data may be delivered immediately or stored for later collection, it may be transferred in bulk or as a stream, it may be sent to a third party within a specified time and so on. A GDTV may be used to supply data to an operation, for example to a distributed join or to a bulk load, or to transfer data from an operation, for example as a result set. Instances of GDTV may not be services, in order to avoid redundant data movement. These will be bindings to libraries supporting the APIs, hence vehicle rather than service. Others, for example those that store results for later collection, will be services. Grid data service registries (GDSR): These allow GDS and GDSF to register and then allow client code to query GDSR to find data sources or GDS/F that match their requirements according to data content, operations, resources and so forth. They will be a special form of registry in the OGSA sense, providing support for data content and structure searches. These depend on a metadata infrastructure. Grid data metadata: These metadata may be distinct from metadata used by applications and from metadata in database schemas, though, in some cases they may be derived from these. The metadata will include the following aspects of DAI components: The types of data model supported, for example, Relational, XML and so forth. The languages supported by this GDS, SQL92, Xpath, Xquery and so forth. The operations supported, for example, query, bulk load, insert, update, delete, schema edit and so forth. Grid Data Transport Vehicles supported. The data content that is accessible. Resources and restrictions. Capacity and performance information. Access policies. Charging policies. Example showing execution model A diagram of an application using OGSA-DAI is shown in Figure 1. It illustrates a scenario in which a sequence of steps takes place involving a client application, five OGSA-DAI components and four GDTV. Control messages

are not shown. The notation uses yellow ellipses for OGSA-DAI components and a blue rectangle for a client. It uses various forms of dotted open arrow for invocation messages and solid thick arrows for the applications of GDTVs. The scenario presumes that the client wishes to achieve two things using data integrated from three sources managed by GDS: Obtain data combined from database1 and database3, perhaps to verify some aspect of the composite operation. Send data combined from database1, database2 and database3 as a stream to a specified third party, for example a data-mining tool (not shown).

Figure 1 Example of an application using OGSA-DAI components.

In the scenario, we envisage the client using OGSA-DAI as follows: 1. It knows the GSH of one of the GDSR and sends to that GDSR a description of the GDS it requires. The description specifies the three databases that must be accessed and the operations (query and bulk/stream transfer) that are required. The GDSR may use a peer-to-peer protocol to refer the request to other GDSR in a large system. 2. The GDSR replies with an indication that the required GDS do not exist. It provides a list of GDSFs (as their GSHs) that can generate the required GDS. 3. The client chooses one of these, perhaps after exercising some dialogues to determine more about the offered GDSF, and constructs a script requesting three GDS. These will be described using the same notation 2 as was used in the request to the GDSR but probably with additional information. 4. The GDSF probably responds with an immediate confirmation that it has understood the script and will make the GDS (this message is not shown). The GDSF then schedules the construction and initialisation of the three GDS, presumably making each an OGSA-DAI-compliant proxy for the three databases of interest. 5. When they have all reported that they are initialised (not shown), the GDSF sends a composite message to the client providing their identities as GSH. 6. After a dialogue with each GDS to determine the finer details of its capabilities (not shown), the client sends a script to each GDS indicating the graph of tasks each is required to undertake. The script sent to GDS1 indicated that it should transport a batch of data to a task in the script sent to GDS2 and send a stream of data to a task identified in a script sent to GDS3. The script sent to GDS2 indicated that it should expect a batch of data from a task sent to GDS1 and a stream of data from a task sent to GDS3. It was required to run a task that established

the flow of a stream of data to a specified third party. The script sent to GDS 3 contained a task that was required to send a batch of data to the client. For each of these data transfers there would be a description allowing an appropriate GDTV to be constructed and/or used. 7. GDS1 uses a GDTV to send the specified batch of data to GDS2 and another GDTV to send a stream of data to GDS3. 8. GDS3 combines that incoming data stream with its own data and uses a GDTV to send the result data as a stream to GDS2. 9. GDS3 uses the incoming data stream, its own data and a GDTV to send a batch of data to the client. 10. GDS2 combines the incoming batch of data, its own data and the incoming stream of data to construct a stream of data for the third party that is delivered using a GDTV. This scenario does not illustrate all possible relationships. For example, a GDSR may use a GDS to support its own operation, and a script may require data transport between tasks within the same GDS. The first steps in data integration are distributed query systems in which the schemas are compatible. This has already been prototyped [28]. Subsequent stages require interposition of data transformations. Tools may be developed to help in the formulation of scripts and transformations, in order to render consistent data from heterogeneous data sources [29].

9. BASELINE LOGGING INFRASTRUCTURE


The community of e-Scientists uses a large variety of facilities, such as compute resources, storage resources, high-performance networks and curated data collections. Our goal is to establish a culture in UK science where the e-Science techniques, which depend on these resources, are widely used. To make this possible, accounting mechanisms must be effective. To support the resource sharing within virtual organisations and between real organisations it is essential that resource usage be recorded. The units that need to be recorded will differ between services but could include number of bytes crossing various boundaries, such as a gateway on a network, the main memory of a computer, the fibres from disk and so on; number of byte seconds of storage use for various stores; number of CPU-hours; number of records examined in a database; uses of licensed software. There are clearly many potential units that could be the basis of charging, but for a logging infrastructure to be scalable and usable a relative small number must be agreed upon and understood. These units may be accumulated or the critical recorded unit may be based on peak rates delivered. If resource owners are to contribute their resources, these units also have to approximately reflect the origins of their costs within their organisation. The OGSA needs to intercept service usage as part of its core architecture and record this information through a logging service. This service should provide reliable mechanisms to distribute this data to other organisations. It is unlikely that a single charging model or a single basis for charging will emerge in a diverse community. There therefore needs to be a mechanism in which different charging policies and different clients can meet. We refer to it as a market. The creation of a sustainable long-term economic model that will attract independent service providers requires 1. a logging infrastructure that reliably and economically records resource usage, 2. a commonly understood means of describing charges, 3. a mechanism to negotiate charges between the consumer and the provider, 4. a secure payment mechanism. A project to develop such a Grid market infrastructure for OGSA has been proposed.

10. SUMMARY AND CONCLUSIONS


This chapter has recorded the UK e-Science commitment to the OGSA and explained the rationale for that commitment. It has illustrated the consequences of this commitment by presenting the steps that are necessary to augment the baseline OGSA middleware with other common facilities that are needed by e-Science projects. The focus on access and integration of structured data, typically held in databases, was motivated by the prevalence of data integration within those projects. The OGSA-DAI projects plans for GS provided an illustration of the ways in which this requirement maps onto OGSA. Many other middleware functions can be extracted and developed as GS. One example is accounting and the infrastructure for a Grid market. This is identified as urgently required by many UK projects and by those who provide computation resources. The OGSA infrastructure and the componentisation of e-Science infrastructure is expected to have substantial long-term benefits. It assists in the dynamic composition of components and makes it more likely that tools to support safe composition will be developed. It increases the chances of significant contribution to the required infrastructure from industry. It improves the potential for meeting challenges, such as agreements about interchange across political and organisational boundaries. By providing a description regime, it provides the basis for improved engineering and better partitioning of development tasks. However, it is not a panacea. There are still many functionalities required by e-Scientists that are yet to be implemented; only two of them were illustrated above. It depends on WS, which are an excellent foundation for distributed heterogeneous system integration. But these are still the subject of vigorous development of standards and platforms. The outcome is very likely to be beneficial, but the journey to reach it may involve awkward revisions of technical decisions. As the e-Science community stands to benefit from effective WS and effective GS, it should invest effort in developing applications using these new technologies and use that experience to influence the design and ensure the compatibility of these foundation technologies.

3.

Grids and the virtual observatory


Roy Williams

1. The virtual observatory - Data federation 2. What is a Grid? - Virtual Observatory middleware

1. THE VIRTUAL OBSERVATORY


Astronomers have always been early adopters of technology, and information technology has been no exception. There is a vast amount of astronomical data available on the Internet, ranging from spectacular processed images of planets to huge amounts of raw, processed and private data. Much of the data is well documented with citations, instrumental settings, and the type of processing that has been applied. In general, astronomical data has few copyright, or privacy or other intellectual property restrictions in comparison with other fields of science, although fresh data is generally sequestered for a year or so while the observers have a chance to reap knowledge from it. As anyone with a digital camera can attest, there is a vast requirement for storage. Breakthroughs in telescope, detector, and computer technology allow astronomical surveys to produce terabytes of images and catalogs (Figure 38.1). These datasets will cover the sky in different wavebands, from - and X rays, optical, infrared, through to radio. With the advent of inexpensive storage technologies and the availability of high-speed networks, the concept of multiterabyte on-line databases interoperating seamlessly is no longer outlandish [1, 2]. More and more catalogs will be interlinked, query engines will become more and more sophisticated, and the research results from on-line data will be just as rich as that from real observatories. In addition to the quantity of data increasing exponentially, its heterogeneity the number of data publishers is also rapidly increasing. It is becoming easier and easier to put data on the web, and every scientist builds the service, the table attributes, and the keywords in a slightly different way. Standardizing this diversity without destroying it is as challenging as it is critical. It is also critical that the community recognizes the value of these standards, and agrees to spend time on implementing them.

Figure 38.1 The total area of astronomical telescopes in m2, and CCDs measured in gigapixels, over the last 25 years. The number of pixels and the data double every year.

Recognizing these trends and opportunities, the National Academy of Sciences Astronomy and Astrophysics Survey Committee, in its decadal survey [3] recommends, as a first priority, the establishment of a National Virtual Observatory (NVO), leading to US funding through the NSF. Similar programs have begun in Europe and Britain, as well as other national efforts, now unified by the International Virtual Observatory Alliance (IVOA). The Virtual Observatory (VO) will be a Rosetta Stone linking the archival data sets of space- and ground-based observatories, the catalogs of multiwavelength surveys, and the computational resources necessary to support comparison and cross-correlation among these resources. While this project is mostly about the US effort, the emerging International VO will benefit the entire astronomical community, from students and amateurs to professionals. We hope and expect that the fusion of multiple data sources will also herald a sociological fusion. Astronomers have traditionally specialized by wavelength, based on the instrument with which they observe, rather than by the

physical processes actually occurring in the Universe: having data in other wavelengths available by the same tools, through the same kinds of services will soften these artificial barriers. Data federation Science, like any deductive endeavor, often progresses through federation of information: bringing information from different sources into the same frame of reference. The police detective investigating a crime might see a set of suspects with the motive to commit the crime, another group with the opportunity, and another group with the means. By federating this information, the detective realizes there is only one suspect in all three groups this federation of information has produced knowledge. In astronomy, there is great interest in objects between large planets and small stars the so-called brown dwarf stars. These very cool stars can be found because they are visible in the infrared range of wavelengths, but not at optical wavelengths. A search can be done by federating an infrared and an optical catalog, asking for sources in the former, but not in the latter. The objective of the Virtual Observatory is to enable the federation of much of the digital astronomical data. A major component of the program is about efficient processing of large amounts of data, and we shall discuss projects that need Grid computing, first those projects that use images and then projects that use databases. Another big part of the Virtual Observatory concerns standardization and translation of data resources that have been built by many different people in many different ways. Part of the work is to build enough metadata structure so that data and computing resources can be automatically connected in a scientifically valid fashion. The major challenge with this approach, as with any standards effort, is to encourage adoption of the standard in the community. We can then hope that those in control of data resources can find it within them to expose it to close scrutiny, including all its errors and inconsistencies.

2. WHAT IS A GRID?
People often talk about the Grid, as if there is only one, but in fact Grid is a concept. In this paper, we shall think of a Grid in terms of the following criteria: Powerful resources: There are many websites where clients can ask for computing to be done or for customized data to be fetched, but a true Grid offers sufficiently powerful resources that their owner does not want arbitrary access from the public Internet. Supercomputer centers will become delocalized, just as digital libraries are already. Federated computing: The Grid concept carries the idea of geographical distribution of computing and data resources. Perhaps a more important kind of distribution is human: that the resources in the Grid are managed and owned by different organizations, and have agreed to federate themselves for mutual benefit. Indeed, the challenge resembles the famous example of the federation of states which is the United States. Security structure: The essential ingredient that glues a Grid together is security. A federation of powerful resources requires a superstructure of control and trust to limit uncontrolled, public use, but to put no barriers in the way of the valid users. In the Virtual Observatory context, the most important Grid resources are data collections rather than processing engines. The Grid allows federation of collections without worry about differences in storage systems, security environments, or access mechanisms. There may be directory services to find datasets more effectively than the Internet search engines that work best on free text. There may be replication services that find the nearest copy of a given dataset. Processing and computing resources can be used through allocation services based on the batch queue model, on scheduling multiple resources for a given time, or on finding otherwise idle resources. Virtual Observatory middleware The architecture is based on the idea of services: Internet-accessible information resources with well-defined requests and consequent responses. There are already a large number of astronomical information services, but in general each is hand-made, with arbitrary request and response formats, and little formal directory structure. Most current services are designed with the idea that a human, not a computer, is the client, so that output comes back as HTML or an idiosyncratic text format. Furthermore, services are not designed with scaling in mind to gigabyte or terabyte result sets, with a consequent lack of authentication mechanisms that are necessary when resources become significant. To solve the scalability problem, we are borrowing heavily from progress by information technologists in the Grid world, using GSI authentication [4], Storage Resource Broker [5], and GridFTP [6] for moving large datasets. In Sections 38.3 and 38.4, we discuss some of the applications in astronomy of this kind of powerful distributed computing framework, first for image computing, then for database computing. In Section 38.5, we discuss approaches to the semantic challenge in linking heterogeneous resources.

Figures 1.20 and 1.21 illustrate two applications of interest to NASA; in the first, we depict key aspects airframe, wing, stabilizer, engine, landing gear and human factors of the design of a complete aircraft. Each part could be the responsibility of a distinct, possibly geographically distributed, engineering team whose work is integrated together by a Grid realizing the concept of concurrent engineering. Figure 1.21 depicts possible Grid controlling satellites and the data streaming from them. Shown are a set of Web (OGSA) services for satellite control, data acquisition, analysis, visualization and linkage (assimilation) with simulations as well as two of the Web services broken up into multiple constituent services. Key standards for such a Grid are addressed by the new Space Link Extension international standard [109] in which part of the challenge is to merge a preGrid architecture with the still evolving Grid approach.

Figure 1.20 A Grid for aerospace engineering showing linkage of geographically separated subsystems needed by an aircraft.

Figure 1.21 A possible Grid for satellite operation showing both spacecraft operation and data analysis. The system is built fromWeb services (WS) and we show how data analysis and simulation services are composed from smaller WSs

The interesting e-Science concept illustrates changes that information technology is bringing to the methodology of scientific research [114]. e-Science is a relatively new term that has become particularly popular after the launch of the major United Kingdom initiative described in Section 1.4.3. e-Science captures the new approach to science involving distributed global collaborations enabled by the Internet and using very large data collections, terascale computing resources and high-performance visualizations. e-Science is about global collaboration in key areas of science, and the next generation of infrastructure, namely the Grid, that will enable it. Figure 1.28 summarizes the e-Scientific method. Simplistically, we can characterize the last decade as focusing on simulation and its integration with science and engineering this is computational science. e-Science builds on this adding data from all sources with the needed information technology to analyze and assimilate the data into the simulations.

Figure 1.28 Computational science and information technology merge in e-Science.

An example of a data-oriented application is Distributed Aircraft Maintenance Environment (DAME) [111], illustrated in Figure 1.23. DAME is an industrial application being developed in the United Kingdom in which Grid technology is used to handle the gigabytes of in-flight data gathered by operational aircraft engines and to integrate maintenance, manufacturer and analysis centers.

Figure 1.23 DAME Grid to manage data from aircraft engine sensors.

In Europe, there are also interesting Grid Engineering applications being investigated. For example, the UK Grid Enabled Optimization and Design Search for Engineering (GEODISE) project [110] is looking at providing an engineering design knowledge repository for design in the aerospace area. Rolls Royce and BAESystems are industrial collaborators. Figure 1.22 shows this GEODISE engineering design Grid that will address, in particular the repeat engagement challenge in which one wishes to build a semantic Grid (Chapter 17) to capture the knowledge of experienced designers. This of course is a research challenge and its success would open up many similar applications.

Figure 1.22

GEODISE aircraft engineering design Grid.

Das könnte Ihnen auch gefallen