Sie sind auf Seite 1von 8

originalarbeiten

Elektrotechnik & Informationstechnik (2006) 123/6: 251258. DOI 10.1007/s00502-006-0344-0

The Grid: vision, technology development and applications1


P. Brezany, A. Woehrer, A. M. Tjoa
Grid computing is a set of concepts about a computing infrastructure which can utilize distributed computational, storage, network, and other resources to solve problems as easy as plugging into the electric Power Grid. In this paper, we describe the evolution of Grid systems and supporting development tools and characterize main Grid application areas. In particular, we discuss our contributions to the solution Grid research challenges. First, we describe the functionality and strength of the infrastructure supporting the development of data discovery applications, called the GridMiner, and then our proposals for the future interconnection environment, called the Wisdom Grid, which includes the GridMiner functionality. Keywords: grid computing; data mining; workow; virtualization; OLAP; online Analytical Processing

Das Grid: Vision, technologische Entwicklung und Anwendungen. Grid Computing ist ein Konzept einer informationstechnischen Infrastruktur, um verteilte Rechenleistung, Speicher, Netzwerkkapazita ten und andere Ressourcen zur Lo sung komplexer Problemstellungen auf einfachste Weise, a hnlich dem Anzapfen des Stromnetzes, zu benutzen. In diesem Artikel beschreiben die Autoren die Evolution von Grid-Systemen gemeinsam mit deren Entwicklungswerkzeugen und deren Hauptanwendungsgebieten. Genauer gehen sie dabei auf ihre Beitra ge zu den gestellten Herausforderungen ein. Als erstes werden die Funktionalita ten und Sta rken eines Frameworks zur Entwicklung von Wissenndungsanwendungen mit dem Namen GridMiner beschrieben. Im zweiten Schritt wird ein darauf aufbauendes System fu r zuku nftige Entwicklungen, das Wisdom Grid, vorgestellt. Schlu sselwo rter: verteiltes Rechnen; Wissensndung; Ablaufplanung; Virtualisierung; OLAP

Received April 10, 2006, accepted April 18, 2006 Springer-Verlag 2006

1. Introduction The Grid, built on the Internet and the World Wide Web, is a novel 21st century computing, data, information, and knowledge management infrastructure, which is transforming science, industry, business, health, and society. The Grid concepts and technologies were rst expressed and dened by Foster and Kesselman in 1998, in their book The Grid: Blueprint for a New Computing Infrastructure (Foster, Kesselman, 1998). This work reected the earlier roots of the Grid, that of interconnecting high-performance facilities at various US laboratories and universities. In 2001, Foster, Kesselman and Tuecke rened the denition of a Grid to coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations (Foster, Kesselman, Tuecke, 2001). In this context, resources can represent PCs, workstations and supercomputers, storage systems, data sources, databases, libraries, computational kernels, special purpose scientic instruments, etc. This denition is the one most commonly used today to abstractly dene a Grid. Generally, it can be said that the Grid has evolved from a carefully congured infrastructure, supporting a limited number of grand challenge applications executing on high-performance hardware between a number of supercomputing centers, to what we are aiming at today, which can be seen as a seamless and dynamic virtual environment. During this evolution, many software development toolkits, frameworks, and programming environments sup-

porting realization of Grid applications were developed. Moreover, signicant attention has been paid to data-oriented applications. Like labeling current PCs and notebooks by intel inside, we can add an attribute large data sets inside to a large group of modern Grid applications. As described in (Hey, Trefethen, 2003), data is emerging as the killer application of the Grid. In this paper, we describe the evolution of Grid systems in Sect. 2. Then we discuss software development kits supporting Grid application realization (Sect. 3). One of them, the GridMiner system developed by our research group, supporting the development of Grid analytics applications is described in Sect. 4. Nowadays, it is impossible to list and describe all past and current Grid applications. Therefore, Sect. 5 only characterizes the current trends in Grid application development. It is assumed that future Grid developments will be strongly based on the achievements in Articial Intelligence and Semantic Web technologies; our response to this vision, a Wisdom Grid proposal, is discussed in Sect. 6. We briey conclude in Sect. 7. 2. The evolutionary phases of the Grid Since their birth, Grid technologies have traversed different phases or generations. According to the nature of the distributed resources it manages, the Grid has been evolving from Computational Grid (Foster, Kesselman, Tuecke, 2001) (concerning, e.g., job scheduling,
Brezany, Peter, Ao. Univ.-Prof. Dr., Woehrer, Alexander, Mag., Institute of Scientic Computing, University of Vienna Nordbergstrae 15=C, 1090 Vienna, Austria (E-Mail: {brezany, woehrer}@par.univie.ac.at); Tjoa, A Min, Univ.-Prof. Dr., Institute of Software Technology and Interactive Systems, Vienna University of Technology, Favoritenstrae 9, 1040 Vienna, Austria (E-Mail: tjoa@ifs.tuwien.ac.at)

The work described in this paper is being carried out as part of the research project Advanced Analysis on Computational Grids supported by the Austrian Research Foundation, and was also funded by the project Austrian Grid supported by the Austrian Federal Ministry for Education, Science and Culture.

Juni 2006 | 123. Jahrgang

heft 6.2006

251

originalarbeiten
P. Brezany, A. Woehrer, A. M. Tjoa The Grid: vision, technology development and applications

system information services, life cycle management) to Data Grid (Chervenak et al., 2001) (concerning, e.g., distributed data access, metadata management, data replication), and recently to semantic and knowledge-oriented Grids this terms denote several development directions, e.g., Data Mining Grid (Discovery Net, (http:==www.discovery-on-the.net=; the DataMining Grid project, http:==www.datamininggrid.org=; Cannataro, Talia, 2003; Tjoa et al., 2005) (concerning discovery of descriptive and predictive knowledge patterns in Grid data sets), Semantic Grid (Goble, De Roure, http:==www.semanticgrid.org) (it attempts to incorporate the advantages of the Grid, the Semantic Web, and Web Services to extend the Grid in semantics and enhance the Semantic Webs computing power.), and Knowledge Grid (Zhuge, 2004) (A vision of an intelligent and sustainable interconnection environment that enables people and machines to effectively capture, publish, share and manage knowledge resources). In our visionary paper (Brezany et al., 2004), we proposed the architecture of a future Grid infrastructure called the Wisdom Grid, which integrates all the Grid technologies mentioned above into an advanced complex solving environment. These phases are discussed in more detail below2. 2.1 The Computational Grid As already mentioned in the introduction, the early Grid efforts (the early to mid 1990s) started as projects to link supercomputing sites; at this time this approach was known as metacomputing (Smarr, Catlett, 1992). The objective was to provide computational resources to a range of high performance applications. Essentially all major Grid projects are currently built on protocols and services provided by the Globus Toolkit (http:==www.globus.org) that enables applications to handle distributed heterogeneous computing resources as a single virtual machine. It provides the interoperability that is essential to achieve large-scale computation. The evolution of this toolkit is discussed in Sect. 3. The Globus internal structure supports a layered Grid architecture which is depicted in Fig. 1. This architecture reects the very early visions of the Grid expressed by (Foster, Kesselman, 1998). The Grid Fabric layer provides the resources to which shared access is mediated by Grid protocols. The Connectivity layer denes core communication and authentication and authorization protocols required for Gridspecic transactions. The Resource layer denes protocols for the secure negotiation, initiation, monitoring, control, accounting, and payment of sharing operations on individual resources. The Collective layer contains protocols and services that are not associated with any one specic resource but rather are global in nature and capture interactions across collections of resources. Applications are constructed in terms of, and by calling upon, services dened at any layer.

are performed at distributed computing resources, a Data Grid, which is an extension of a Computational Grid, deals with the controlled sharing and efcient management, placement and replication of large amounts of data. Figure 2 outlines main features of the Data Grid architecture presented in (Chervenak et al., 2001) Among the core Data Grid services, data access and metadata services are viewed as fundamental. The data access service (not explicitly expressed in the gure) provides mechanisms for accessing, managing, and initiating transfers of data stored in storage systems. The metadata access service provides mechanisms for accessing and managing information about data stored in storage systems. A potentially unlimited number of components can exists in the upper layer of the Data Grid architecture. There are two representative components: replica management and replica selection. The role of a Replica Manager is to create (or delete) copies of le instances, or replicas, within specied storage systems. Often, a replica is created because the new storage location offers better performance or availability for accesses to or from a particular location. Replica selection is the process of choosing a replica that will provide an application with the data access characteristics that optimize desired performance criteria. On top of the Computational and Data Grids, it is possible to built information- and knowledge-oriented Grids and other application Grids, as discussed in the subsequent sections. 2.3 Information- and knowledge-oriented Grids To convert the huge, low-level, heterogeneous Data Grid data into information and powerful knowledge, additional application Grid layers have to be built on top of the Data Grid. In Fig. 3, we provide a layered view of such a system, as proposed by K. G. Jeffery (Jeffery, 2001). The role of the Information Grid is to integrate heterogeneous information into a homogeneous presentation to the Knowledge Grid, whose task is the extraction of knowledge from the information, which is presented to the human user or another application by an appropriate Interface; as we will see later, in the recent developments, the Knowledge Grid can involve more advanced services associated with knowledge distribution and management. Components within each layer share common characteristics, but can built on capabilities and behaviors provided by any lower layer. 2.3.1 Grid data integration Modern science, business, industry, and society increasingly rely on global collaborations, which are very often based on large scale linking of databases that were not expected to be used together when they were originally developed. We can make decisions and discovery with data collected within our business or research. But we improve decisions or increase the chance and scope of discoveries when we combine information from multiple sources. Then correlation and patterns in the combined data can support new hypotheses, which can be tested and turned into useful knowledge. This can be a crucial factor in generating revenues, cutting costs, achievement of scientic discoveries, optimal treatment of patients, etc. To harness this unprecedented wealth of data for society advantage, a radical new technology for data integration is strongly needed. These challenges drive research efforts in many Grid projects. The aim is on-demand data integration, which allows always up-todate data and makes a costly centralized and inexible data warehouse obsolete. Within the distributed database community database integration approaches traditionally focus on structural heterogeneity (at the structural level of the schema) different information systems store their data in different structures. However, in many applications,

Fig. 1. The layered Grid architecture (Foster, Kesselman, Tuecke, 2001) 2.2 The Data Grid Whereas a Computational Grid can be considered as a natural extension of a cluster computer system where large computing tasks
2

De Roure and Baker (De Roure et al., http:==citeseer.nj.nec.com=535794.html) present the evolution of the Grid from another point of view.

252

heft 6.2006

e&i elektrotechnik und informationstechnik

originalarbeiten
P. Brezany, A. Woehrer, A. M. Tjoa The Grid: vision, technology development and applications

Fig. 2. Major components and structure of the Data Grid architecture

Fig. 3. The Information and Knowledge Grids there is additionally a strong demand to solve problems of semantic heterogeneity information is integrated across sources that have differing terminologies or ontologies. There has already been signicant research effort addressing semantic data integration before the advent of the Grid. The typical approach is to build a domain model of the application domain, establishing a vocabulary (ontology) for describing data sets in the domain. Using this language, each available information source is described. Queries to the integration system are posed using the terms from the domain model, and reformulation operators are employed to select an appropriate set of information sources and to determine how to integrate the available information to satisfy a query. Semantic Web languages and special knowledge representation languages (especially, in the pre-Semantic Web era projects) have been used for building the ontologies, data source mapping, and query specication. The Grid and Grid applications pose, on one hand, new requirements on semantic data integrations, on the other hand, offer new opportunities to Semantic Web technologies. We believe that there are many challenges of the Grid from the perspective of Semantic Web researchers and vice versa with respect to this application domain. For example, one of the pilot applications of our Grid research project is addressing the integration of data for the Ecological Datagrid; there is a lot of large data sources provided by 27 organizations spread across Europe; in the future, new partners bringing new data sets will be included into this long-term project. The data sources include data from different interest domains: biodiversity, waste, air, water, soil, emissions, forests, etc. The data sets can be dynamically included (published) into collaborations (virtual organizations) or withdrawn from them (e.g. for privacy or security reasons). A crucial issue is an effective semantic integration of these data sources. Besides the factors mentioned above, there are other crucial requirements put on the dynamics and adaptability of the semantic integration concepts; within the global virtual organization, data sources will be dynamically integrated into different problem-oriented virtual sub-organizations performing different data exploration processes, for example, geostatistics, ow analysis, building prediction models, etc. There is a strong need to address performance aspects of the associated integration processes, because they signicantly inuence the whole turnaround time of the system and response times in individual data exploration tasks. 2.3.2 Data mining on the Grid The effective and efcient management and use of increasing amounts of stored data, and in particular the transformation of these data into information and knowledge, is considered as a key requirement in modern information systems. Data mining (also

Juni 2006 | 123. Jahrgang

heft 6.2006

253

originalarbeiten
P. Brezany, A. Woehrer, A. M. Tjoa The Grid: vision, technology development and applications

known as knowledge discovery in databases KDD) is the technology addressing this information need. However, this eld has mainly been developed for largely homogeneous and localized computing environments. These assumptions are increasingly not met in modern scientic and industrial complex-problem solving environments. Currently there exists no coherent framework for developing and deploying datamining applications on the Grid. Several projects, e.g. DataMiningGrid (http:==www.datamininggrid.org), Discovery Net (Curcin et al., 2002) and GridMiner (http:==www.gridminer.org) address this gap by developing generic and sector-independent data mining tools and Grid interfaces, allowing data mining tools to operate in a distributed Grid computing environment. Our GridMiner project is described in more detail in Sect. 4. 2.3.3 The Semantic Grid The Semantic Grid (De Roure, Jennings, Shadbolt, 2003) can be considered as the intersection of the Semantic Web, Grid and software agent research. It supports integration of scientic data and automatic execution of computations, providing important functionality at the Grid (semantics in the Grid) and scientic applications level (semantics on the Grid). Information, computing resources and services are described in standard ways that can be processed by a computer. This makes it easier for resources to be discovered and joined up automatically, which helps bring resources together to create virtual organizations. 3. Tools support for Grid application development A Grid infrastructure is a technology that we can take for granted when developing applications on top of it. Tools make use of services of this infrastructure. They are concerned with resource discovery, data access and management, scheduling of computation, security, and so forth. In the past years several Grid tools and development frameworks were made available. Some of them are discussed below. Most Grid applications are based on the Globus Toolkit (http:==www.globus.org). It is a community-based, open-architecture, open-source set of services and software libraries that support Grids and Grid applications. Its rst versions, Globus 1 and 2, provided library-style APIs. Later the Globus research program focused on architecture and middleware development to contribute to the emerging Open Grid Service Architecture (OGSA). This shifting happened with respect to the increasing popularity of serviceoriented architectures in general and Web services in particular. This development also started to foster synergies between the Web service and Grid computing communities. The Globus Toolkit 3 realized these ideas. In the rst years, in the Grid community, research and development activities generally focused on applications where data was stored in les. However, it was recognized that if the Grid should support a wider range of applications, both scientic and commercial, then database integration with the Grid would become important. Therefore, a specication for a collection of Grid database services was elaborated. There were already provided several reference implementations, named OGSA-DAI (Atkinson, Baxter, Hong, 2002) of the service interfaces. They allow access to relational and XML databases and Comma Separated Value les. In January 2004 the new Web Services Resource Framework (WSRF) (Czajkowski et al., 2004) was presented, which represents the latest step in harmonizing Grid and Web services and an important contribution to converging Grid and Web technologies it means that Grid and Web communities can move forward on a common base. The concepts of WSRF are implemented by the release of Globus Toolkit 4. The OGSA-DAI groups also ported their Grid Data Services to WSRF.

Another relevant development is the German UNICORE project (Erwin, Snelling, 2001), which developed software for a seamless, secure and intuitive access to distributed HPC resources, before the Grid was invented. Now, the development is fully focused on providing support for Grid-based high-performance scientic computing. In the Grid tools research, the term Legion (Grimshaw et al., 2003) means both the academic project at the University of Virginia as well as the commercial product, Avaki, distributed by AVAKI Corp. Legion helps organizations create a Computational Grid, allowing processing power to be shared, as well as a data Grid, a virtual single set of les that can be accessed without regards to location or platform. gLite (http:==www.glite.org) is the next generation middleware for Grid computing. Born from the collaborative efforts of more than 80 people in 12 centers as part of the EGEE Projects (http:==egeeintranet.web.cern.ch), gLite provides a framework for building Grid applications tapping into the power of distributed computing and storage resources across the Internet. As pointed out at in (Hey, Trefethen, 2005), in many real-world applications, researchers are now nding themselves faced with an increasingly difcult burden of both managing and storing vast amounts of data as well as analyzing, combining and mining the data to extract useful information and knowledge. Often this can involve automation of the task of annotating the data with relevant metadata as well as constructing complex search engines and work ows that capture complex usage patterns of distributed data and compute resources. Most of these problems and the tools and techniques to tackle them are similar across many different types of application. It makes no sense for each community to develop these basic tools in isolation. The GridMiner framework developed by our research group contributes to the solution of these problems. It is introduced in the following section. 4. The GridMiner The GridMiner project (http:==www.gridminer.org) aims to cover all aspects of the knowledge discovery process on the Grid, as illustrated in Fig. 4 on the left, and integrate them into an advanced service-oriented Grid application; its architecture and its possible association with concrete Grid resource types is shown in Fig. 4 on the right. The innovative architecture provides: (1) a robust and reliable high performance data mining and OLAP environment, (2) seamless access to intermediate data and results of the discovery process for further reuse in a standardized way, (3) a persistent workspace for continuous and evolving data mining tasks supported via a exible GUI, and (4) a framework to include your own and specialized data mining tasks and services into a KDD process. We will describe the various components of the GridMiner supporting the scientist along with the following typical use-case found in many e-science applications. This scenario is taken from the ecological domain, one of our pilot applications in the Austrian Grid project. Imagine the following traditional steps taken by an ecological scientist to achieve his research goal, described in Fig. 5. First, the data sources probably contributing and needed have to be found, accessed and prepared for data mining. Some species data is just accessible via FTP in a proprietary le format, needed weather data is distributed over two public available databases. The ecologist uses hand written scripts and stand-alone tools installed on his workstation to combine the data towards a form suitable for his data mining software. Then he applies standard data mining code as well as some domain specic method to the data. This is likely to happen locally, which takes a lot of time to complete, and hides the used workow=parameters for later usage and renement by other members of his research group. For the evaluation and visualization he is restricted to the offerings of his (proprietary)

254

heft 6.2006

e&i elektrotechnik und informationstechnik

originalarbeiten
P. Brezany, A. Woehrer, A. M. Tjoa The Grid: vision, technology development and applications

Fig. 4. The knowledge discovery process (left) and the GridMiner solution to it (right)

Fig. 5. Use-case of a data mining task done with traditional tools

data mining tool. This kind of data mining process is work-intensive, not well documented and not suited for collaborative work. It doesnt support the share and reuse of knowledge and its corresponding workow, the local processing restricts the processable data volume and therefore the possible knowledge ndings. The local efforts to automate parts of this process are hard to migrate and use for other scientists. This simple use-case scenario already shows the need for a more integrated approach towards a data mining infrastructure, supporting the user in the various phases and allowing him to focus on its specic goal. Distributing the workload as much as possible (technical and organizational) as well as the already found (partial) solutions are very important aspects of this assistance. Lets have a look

onto a solution to the above described task using a GridMiner supported approach. To have access to the required data sources on-demand in a transparent way with all data heterogeneities handled, the scientist (or a project administrator) denes a virtual data source via an administrative GUI supporting this task, e.g. in Fig. 6. Access to the public databases are likely to be needed by other research colleagues as well, so once congured and setup they can be used in a virtual data source by simple dragging the representing icons on the workspace and dening how they should be combined, e.g. specifying the join attributes. For transforming and manipulating the data in any way one might need, user dened functions can be integrated into this process on-the-y. Supported data sources include native XML databases, comma separated value

Juni 2006 | 123. Jahrgang

heft 6.2006

255

originalarbeiten
P. Brezany, A. Woehrer, A. M. Tjoa The Grid: vision, technology development and applications

Fig. 6. The preparation of a virtual data source within the administrative GUI

les, relational databases as well as the de facto standard service for data access on the Grid called OGSA-DAI (Open Grid Services Architecture for Data Access and Integration; http:==www.ogsadai.org). The resulting virtual relational schema can be further annotated to ease foreign usage and saved into the project wide workspace provided by the GridMiner. The so fully specied new virtual data source then gets deployed at one of the available data services of the working group. One data service can accommodate multiple virtual data sources. Figure 6 shows the administrative GUI for combing two data sources via a union operation towards a virtual one and deploying it on a specic data service. This approach can be seen as specify-once and use often principle, saving time by reusing previous efforts, in this example for data access and integration. The GridMiner infrastructure supports this for other tasks as well, e.g. workow specications. After the deployment the virtual data source can be used as it would be a relational database, including the possibility to query it against the dened schema with the well established and very common SQL statements. Additionally, the data service supports data preprocessing functionality. The user denes the SQL query as well as the preprocessing tasks to be applied on the resulting data, e.g. replace missing values by the mean of a column. This dene local and process global is another central feature of the GridMiner infrastructure. The distributed services are instructed as everything would be available locally, but the actual work is delegated back to them, e.g. you get the data you asked for in the preprocessed way you like. Now that we have solved the data access problems, lets take a look onto the data mining and other data analysis parts and other assistance given by the GridMiner infrastructure. We support data understanding by allowing various visualizations for a data set, e.g. histograms (and other data statistics) used in Fig. 7. The researcher will have to get familiar with a data set in order to make better

Fig. 7. The preparation of a user workow within the GridMiner GUI

decisions by means of which data mining technique to choose and how to parameterize it. In our example he needs a standardized data mining technique followed by an application=domain specic method. For the rst kind our infrastructure gives great support, having parallel and=or distributed version of them shipped with it. We are, to the best of our knowledge, the only project investigating and implementing traditional data mining methods, e.g. classication, clustering, sequence analysis, etc., using high performance technologies and tailoring them for the Grid. An especially interesting development is the combination of them with our distributed OLAP engine, allowing the application of different data mining techniques on top of it, e.g. association rules. This techniques are called as OnLine Analytical Mining (OLAM see Fig. 4). But also the later is easy to achieve within our framework, and the effort to integrate it occurs just once for each needed functionality. The domain specic method has to be wrapped as a service, for which implementation templates are provided, and deployed at some host. Then the details of this new service have to be specied, like location and accepted input=output parameters, and stored into our knowledge base. The information from the knowledge base is used to setup the GUI, in our case a new icon for the new domain specic data mining service would be available. In order to use the new service in a workow, the GridMiner system has to know how to handle=communicate with it. For this purpose each service has a Web application associated to it, for the default once a Java Server Page, which guides the user throughout its usage. This conguration component is used each time when a task has to be congured during a workow, e.g. for the decision tree data mining method the target attribute has to be specied. By this process, nearly all arbitrary needed functionality, not only domain specic data mining methods, can be seamlessly integrated into our framework without losing the conveniences of the GridMiner infrastructure. Having dened the workow via drag and drop, connected the various parts to one needs and congured them appropriately, the scientist can start the execution of the work by simple pressing the play button. The history tab in the bottom frame of the GUI reports all applied steps within the GridMiner application their success, failure or other status messages. The progress can also be followed visually, because specially tagged items of the visualized workow have already nished successfully and their intermediate results can be viewed and examined while the rest of the workow is still running. An important feature is the possibility to disconnect the user GUI and come back later to check for results. This is possible by distributing the components on three layers, as already shown in Fig. 4 on the right. The GUI is the visualization layer, allowing the user and administrator to nicely interact with the GridMiner infrastructure, based on Java Web Start. There is not much processing power needed here, everything is delegated towards the Web and the Grid layer. This thin client approach is also suitable for mobile device support. The Web layer acts as glue between the GUI and the Grid. A workow service acts on behalf of the user, processing its request and collecting intermediate results. The knowledge base is the permanent brain of the GridMiner infrastructure, storing the information about the pre-congured services and available data sets for various projects. Research colleagues can load previous work done by other project members from the knowledge base. This offers the possibility to inspect used parameters and methods or even rene the workow. The Grid layer is the supplier of virtualized processing power, storageand network capacity in a secure and manageable way. The user experiences a decision tree service, hidden from the fact that it is distributed and parallelized over heterogenous computing nodes. Using this three layers together enables us to provide a simple and intuitive way of persistent working on ones research targets in complex workows, and achieve results in a shorter time period.

256

heft 6.2006

e&i elektrotechnik und informationstechnik

originalarbeiten
P. Brezany, A. Woehrer, A. M. Tjoa The Grid: vision, technology development and applications

The knowledge life cycle doesnt end with a model discovered. Our scientist working with traditional tools and hand-crafted solutions from our initial use case runs into various issues. The documentation of the underlying data and the used parameters of the used data mining tasks might be hard to extract. In the GridMiner system one can store the whole workow, including its intermediate results, for later usage. This allows to follow a different approach from any point in it, e.g. change the size of the evaluation data set or play around with the parameters of a data mining method. Repeatability is given automatically, especially important when various discoveries should be published and the following information requests answered. Thinking on information reuse, the knowledge base plays an important role. The discovered models can be used as inputs for other data mining tasks. The intermediate results can be cached and the work effort to gather them saved seamlessly. A valuable support, possible through the input=output descriptions of the workow components stored in the knowledge base, is the blocking of not possible connections between components, e.g. trying to connect a model visualization to a data source instead of a data mining task. The GridMiner framework accommodates the changed way in which scientists are doing their research. More data sources are needed to be combined in a exible way, often implying huge data sizes and huge preprocessing capabilities not manageable in a centralized way. The reuse of knowledge and the documentation of how it has been achieved are getting more and more important. We believe in the importance of the famous Isaac Newton cite If I have seen further it is by standing on the shoulders of giants. For this, you have to be able to focus on your scientic work and collaborate with other colleagues as you like, sharing ideas=solutions and involved data sets. Our framework delivers a standard based system able to export and import data=knowledge from other vendors. It allows the integration of your own needed functionality without losing the conveniences of the predened components available in GridMiner. An idea of the overall system functionality gives the online demo available at (http:==www.gridminer.org), together with further information about the various developed components. 5. Example Grid applications The Grid serves as an enabling technology for a broad set of applications in science, business, entertainment, health and other areas. However, the community faces a problem common to the development of new technologies: applications are needed to drive the research and development of the new technologies, but applications are difcult to develop in the absence of stable and mature technologies (Berman, Hey, Fox, 2003). The Grid can be said to be delivering in a scientic context, because it opens up challenging problems to scientic scrutiny that otherwise would have remain out of reach here we observe an application push. The same is not true in the business domain, mainly because the Grid doesnt take into account typical management issues faced by many industries (Persons, 2005). In this Section, we discuss some of the successful Grid applications and application middleware efforts to date. 5.1 Bioinformatics In this area a large number of specialized databases have emerged for protein structures, genome sequences and life science literature and they increase rapidly in size and complexity. Bioinformaticians need to interconnect them to mine for meaningful information and knowledge. An open middleware for data oriented in-silicon experiments in biology is the myGrid (Goble, Wroe, Stevens, 2003) project. It has been applied in various projects to study the genetic basis of autoimmune diseases, nd and classify proteins secreted by the anthrax bacterium and calculate properties of chemical compounds.

The BIRN project (Ellisman, peltier, 2004) is pioneering the use of the Grid for medical research and patient care. The driving force behind BIRN is neuroimaging, one of the most rapidly advancing elds in biological science. The integration of these advances allows the investigation of previously intractable problems, e.g. cross-correlate functional and structural brain data to develop a more complete understanding of brain function. 5.2 Astronomy There is a vast amount of astronomical data available much of it well documented and with few intellectual property restrictions in comparison with other elds of science. Traditionally, astronomers focused on some wavelength. Having data in other wavelengths available and integratable will broaden the horizon for new ndings. A step to support this was the establishment of national and international virtual observatories. One big part of the work effort is donated to enable the federation of much of the digital astronomical data, including the required power for the complex image processing and spatial queries. As an example visit the Virtual Sky project (http:==www.virtualsky.org). Another part concerns standardization and translation of data resources built by different people in different ways, rst results are the VOTable (http:==us-vo.org=voTable) specication. 6. Future trends towards the Wisdom Grid Web Intelligence (WI) is a new direction for scientic research and development that explores the fundamental roles as well as practical impacts of Articial Intelligence (AI) and advanced Information Technology (IT) on the next generation of Web-empowered products, systems, services, and activities (Zhong, Liu, Yao, 2003). WI aims to develop Wisdom3 Web in order to help people achieve better ways of living, working, learning etc. and so approve quality of life. There are several research areas that are related to WI topics, for instance Web Agents, Web Farming, Web Mining, Web Based Application, Web Information Management, Web Human-Media Engineering. Our contribution to this research area is to integrate knowledge discovery and knowledge management as an autonomic system that can give a strong support to other intelligent entities in their needs for knowledge and appropriate ways (mechanisms) for practical knowledge application. We call such an infrastructure as the Wisdom Grid and see its place in the space of the WI-related topics where it can support them with managing whole lifecycle of knowledge from its discovery to reusing and practical application. Figure 8 pictures a layered architecture of the Wisdom Grid infrastructure and depicts which actor is associated with each layer. The Knowledge Consumer is an actor asking for a knowledge. It could be an agent, a service of a graphical user interface able to construct questions (e.g., in the FIPA ACL=RDF (http:==www.pa.org) message format) in a way the Intelligent Interface understands. This actor initializes the whole process of knowledge discovery and receives the nal results. The knowledge search process is organized by the Knowledge Management Infrastructure either the knowledge is available in the knowledge base and can be immediately retrieved and passed the Intelligent Interface, or it is not available and has to be searched for by the Data Mining and OLAP Infrastructure (realized by the GridMiner) in the databases attached to the Grid; the functionality of this infrastructure is based on the Generic Grid Services layer. The Domain Application Expert is responsible for building and managing of appropriate ontologies, the Data Mining and OLAP Expert for data preparation for data

Based on the denitions provided by the Websters-Dictionary and the Oxford Encyclopedic Dictionary, we understand wisdom as knowledge and experience, and the capacity to make due use of them.

Juni 2006 | 123. Jahrgang

heft 6.2006

257

originalarbeiten
P. Brezany, A. Woehrer, A. M. Tjoa The Grid: vision, technology development and applications

Intelligence; FIPA, Foundation for Intelligent Physical Agents; RDF, Resource Description Framework.

References Atkinson, M., Baxter, R., Hong, N. C. (2002): Grid data access and integration in ogsa. http:==www.cs.man.ac.uk=grid-db=papers=OGSA-DAI-spec-1.2.pdf. Berman, F., Hey, A. J. G., Fox, G. (2003): The Grid: past, present, future. In: Berman, F., Hey, A. J. G., Fox, G. (eds.): Grid Computing: making the global infrastructure a reality: 950. John Wiley & Sons. Brezany, P., Goscinski, A., Janciak, I., Tjoa, A. M. (2004): The development of a wisdom autonomic grid. In: Workshop on Knowledge Grid and Grid Intelligence 2004, held in conjunction with 2004 IEEE=WIC Int. Conf. on Web Intelligence=Intelligent Agent Technology, Beijing, China, September 20, 2004. Cannataro, M., Talia, D. (2003): Knowledge grid: An architecture for distributed knowledge discovery. Communications of the ACM, January 2003. Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., Tuecke, S. (2001): The Data Grid: towards an architecture for the distributed management and analysis of large scientic datasets. Journal of Network and Computer Applications. Curcin, V., Ghanem, M., Guo, Y., Kohler, M., Rowe, A., Syed, J., Wendel, P. (2002): Discovery net: Towards a grid of knowledge discovery. In: 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, July 2002. Czajkowski, K., Ferguson, D. F., Foster, I., Frey, J., Graham, S., et al. (2004): The WSResource Framework. http:==www-106.ibm.com=developerworks=library=wsresource=ws-wsrf.pdf. De Roure, D., Baker, A. M., Jennings, N. R., Shadbolt, N. R. (2003): The evolution of the grid. http:==citeseer.nj.nec.com=535794.html. De Roure, D., Jennings, N. R., Shadbolt, N. R. (2003): The semantic grid: A future e-science infrastructure. In: Concurrency and Computation: Practice and Experience. Ellisman, M., Peltier, S. (2004): Medical data federation: The biomedical informatics research network. In: The Grid 2: Blueprint for a new computing infrastructure: 109120. Erwin, D. W., Snelling, D. F. (2001): UNICORE: A Grid computing environment. Lecture Notes in Computer Science, 2150. Foster, I., Kesselman, C. (eds.). (1998): The Grid: Blueprint for a new computing infrastructure. Morgan Kaufmann. Foster, I., Kesselman, C., Tuecke, S. (2001): The anatomy of the Grid: Enabling scalable virtual organizations. Int. J. Supercomputer Applications, 15 (3): 200222. Goble, C., De Roure, D. (2003): The semantic Grid: A future e-science infrastructure. www.semanticgrid.org. Goble, C., Wroe, C., Stevens, R. (2003): The myGrid project: services, architecture and demonstrator. Technical report, EPSRC e-Science Pilot Project myGrid. Grimshaw, A. S., et al. (2003): From Legion to Avaki: the persistence of vision. In: Berman, F., Hey, A. J. G., Fox, G. (eds.): Grid Computing: making the global infrastructure a reality: 265298. John Wiley & Sons. Hey, T., Trefethen, A. (2003): The data deluge: an e-science perspective. In: Berman, F., Hey, A. J. G., Fox, G. (eds.): Grid Computing: making the global infrastructure a reality: 809824. John Wiley & Sons. Hey, T., Trefethen, A. (2005): The e-science challenge: Creating a reusable e-infrastructure for collaborative multidisciplinary science. CTWatch Quarterly, 1: 26. Jeffery, K. G. (2001): GRIDs in ERCIM. ERCIM News, April 2001. Open Grid Services Architecture for Data Access and Integration (OGSA-DAI). http:==www.ogsadai.org. Parsons, M. (2005): The Next Generation Grid: 2125. Smarr, L., Catlett, C. E. (1992): Metacomputing. Communications of the ACM, 35 (6): 4452. Tjoa, A. M., Janciak, I., Woehrer, A., Brezany, P. (2005): Providing an integrated framework for knowledge discovery on computational grids. Proc. of I-KNOW05 5th Int. Conf. on Knowledge Management: 604611, Graz, June 29 1 July, 2005. Zhong, N., Liu, J., Yao, Y. (eds.) (2003): Web Intelligence. Springer. Zhuge, H. (2004): The Knowledge Grid. World Scientic Co.

Fig. 8. The Wisdom Grid

mining and OLAP, selection of appropriate exploration methods and their parameters, and the Service Provider Administrator congures Generic Grid Services (e.g. the Globus toolkit) and optimizes mapping of high-level Wisdom Grid services onto available Grid resources. The overall Wisdom Grid architecture and the participation of its components in the knowledge discovery workow are described in our paper (Brezany et al., 2004). 7. Conclusions First we introduced the Grid idea and described the main Grid evolution phases. Presentation of our own research results comprises the kernel parts of the paper: GridMiner, a research infrastructure for escience analytics includes a data integration subsystem, high-performance data mining and OLAP services as well advanced interactive workow management concepts. The research exhibition of a running GridMiner prototype was provided at the SC2004 and SC2005 events. We are now working on specication and development of a follow-up Grid system called the Wisdom Grid. Abbreviations OLAP, Online Analytical Processing; KDD, Knowledge Discovery in Databases; OGSA, Open Grid Services Architecture; OGSA-DAI, Open Grid Services Architecture for Data Access and Integration; API, Application Program Interface; WSRF, Web Services Resource Framework; XML, Extensible Markup Language; HPC, High Performance Computing; EGEE, Enabling Grids for e-Science; GUI, Graphical User Interface; FTP, File Transfer Protocol; OLAM, On-Line Analytical Mining; SQL, Structured Query Language; BIRN, Biomedical Informatics Research Network; WI, Web Intelligence; AI, Articial

258

heft 6.2006

e&i elektrotechnik und informationstechnik

Das könnte Ihnen auch gefallen