Sie sind auf Seite 1von 9

Future Generation Computer Systems 24 (2008) 824832

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

Improving the performance of Federated Digital Library services


Jernej Trnkoczy, Vlado Stankovski
Faculty of Civil and Geodetic Engineering, University of Ljubljana, Jamova 2, SI-1000 Ljubljana, Slovenia

article

info

a b s t r a c t
The number of Digital Libraries (DLs) accessible over the Open Archives Initiative Protocol for Metadata Harvesting (OAIPMH) has been constantly increasing in the past years. Earlier efforts in the DL area have concentrated on metadata harvesting and provisioning of value-added Federated Digital Library (FDL) services to the users. FDL services, however, have to meet significant performance and scalability requirements, which is difficult to achieve in centralized metadata harvesting systems. The goal of the present study was to evaluate the benefits of using Web Services Resource Framework (WSRF) compliant grid middleware infrastructure for providing efficient and reliable FDL services. The presented FDL application allows for parallel harvesting of OAIPMH compliant DLs. The results show that this approach efficiently solves the performance related problems, while it also contributes to greater flexibility of the system. The quality of service is improved as metadata can be updated frequently, and the system does not exhibit a single point of failure. 2008 Elsevier B.V. All rights reserved.

Article history: Received 4 December 2007 Received in revised form 8 April 2008 Accepted 8 April 2008 Available online 18 April 2008 Keywords: OAIPMH Grid Federated Digital Library Performance

1. Introduction The advancement of World Wide Web (WWW) technologies causes an exponential growth of available widely distributed digital content. Web search engines, such as Google or Yahoo, and encyclopedias, such as Wikipedia already point to millions of digital objects and may be considered as huge, ubiquitous digital libraries. Digital Library (DL) technologies address the needs for the management of vast amounts of available digital content (i.e. freetext articles and multimedia) and provide sophisticated content and knowledge services for the users. An important goal in the development of DL technology is to improve the quality, scope and accuracy of existing Web search engines by utilizing structured resource descriptions i.e. semantically rich metadata. This is possible since metadata are usually freely available. On the other hand, due to restrictive copyrights, the actual digital content is freely available only in limited number of cases. A practical approach is, therefore, to harvest metadata from a number of geographically distributed DLs at a central location, index these metadata, and let the users search the generated index from a single user interface. The resulting search services are also known as Federated Digital Libraries (FDLs). When building a FDL, it is important to take into consideration the growing number of available DLs, the necessity to use

advanced, computationally intensive information retrieval algorithms, as well as the growing number of users who need personalized perspectives on the DL content. All these factors are likely to cause scalability and performance problems when building FDLs. For example, Maly et al. [17] used exhaustive harvesting to build and update a large collection of metadata. Their FDL took over four days to complete one cycle of harvesting from over 160 existing DLs and two additional days to index the harvested metadata. They found that the harvesting process was long running because of the low network bandwidth and slow response of the contacted DLs. In this kind of application, the scalability, reliability and performance requirements may possibly be alleviated by the use of grid technology [8]. Grid technology is particularly beneficial when computational power and storage must be scaled to meet the demands of complex problem solving applications. Properties like these make grid technology particularly suitable for the area of DLs, which is demonstrated by a number of on-going research projects, such as GRACE [12], DILIGENT [2], DELOS [6], Digital Library GRID [16] and Cheshire3 [14]. The proposed innovation is to develop a full range of FDL services on top of mainstream, WSRF-standard [4] compliant grid technology. The application will benefit by exploiting available, otherwise idle computational resources on the Internet. We investigated a use case scenario in which:

Corresponding author. Tel.: +386 (0)1 4768511, +386 (0)41 200565 (mobile); fax: +386 (0)1 4250681. E-mail address: vlado@stankovski.net (V. Stankovski). URL: http://www.stankovski.net (V. Stankovski).
0167-739X/$ see front matter 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2008.04.007

the end-user selects a set of distributed DLs; metadata are harvested from the selected DLs; the harvested metadata are transformed into a proprietary
format, which is used by a particular indexing algorithm;

a central index is computed; the index is used by a search service;

J. Trnkoczy, V. Stankovski / Future Generation Computer Systems 24 (2008) 824832

825

the user can now search for digital objects contained in


geographically distributed DLs. Our goal is to overcome the scalability, reliability and performance problems of todays FDLs by distributing the metadata harvesting, indexing and other computing power and time consuming tasks on various computational clusters on the Internet. As a starting point, we investigated the possibility of speeding up the process of metadata harvesting by its parallelization. In this paper, we will therefore focus on the investigation of the key performance parameters, which are related to the metadata harvesting problem. To the best of our knowledge, this is the first study that investigates the performance of metadata harvesting in the context of WSRF-standard compliant grid environments. As an exception, the Grid File Transfer Protocol (GridFTP) service [1] is used, which is not WSRF compliant. The paper is organized as follows. Section 2 presents the state-of-the-art in the area of grid computing for DL applications. Section 3 describes the methodology and the middleware technologies that were used to build the experimental grid environment. Section 4 describes the actual grid test bed used in our experiments and the developed grid-enabled FDL application. The evaluation of the system performance is presented in Section 5, and finally, Section 6 discusses the results obtained and presents the conclusions. 2. State-of-the-art overview With the rapid evolution of grid technologies and the benefits they offer, several projects combining DL and grid technologies have recently emerged. Here is a brief overview of these projects. The key goal of the Digital Library Grid [16] project is similar to that of existing Web search engines such as Google i.e. to harvest all of the existing content repositories in the World. For this purpose, grid technology is used to distribute the cost of high latency harvesting and indexing tasks to grid nodes, and only leave the cost of maintaining the federated search service to a service provider. Their grid-based architecture, similar to ours, enables parallel harvesting over the OAIPMH protocol and it supports: dynamic allocation of harvesting nodes, scheduling of harvesting tasks to maximize the performance, and uniform load distribution for the indexing node. However, the Digital Library Grid architecture is tuned only for distributed harvesting and indexing and it has been implemented by an earlier version 3 the Globus Toolkit, which is not WSRF-compliant. Their system scales-up the harvesting task, but, it does not provide for larger scale virtualization and personalization of the services. In the Cheshire3 [14] project, a low level architecture has been defined that permits DL operations to be distributed over many nodes on a network, vastly increasing the throughput of data for computational and storage intensive processes. The implementation uses distributed indexing and search processes over a cluster of high performance machines to achieve high speed indexing. Their implementation is not based on standard grid middleware and protocols, such as WSRF, and it uses a proprietary grid solution. Grid-IR [18] is an initiative to realize Information Retrieval (IR) on the Open Grid Services Architecture (OGSA) platform. It aims to move existing IR standards (such as Z39.50) to the Web service platform. The Grid-IR approach differs from ours in the sense that their architecture is purely service based. This means that every entity in the system is implemented as a service (e.g. metadata service; collection management service, indexing service; searching service; query processing service). Furthermore, the Grid-IR project builds on the distributed model of DL federation, while our approach builds on the harvesting model. On the other hand, the Grid-IR initiative is currently a proposed working group of the Open Grid Forum.

The DILIGENT project [2] aims to build a test bed that integrates Grid and Digital Library technologies. Their developments are based on the achievements of the European Enabling Grids for E-science (EGEE) project. The EGEE infrastructure already provides for some of the functionality required for DILIGENT (e.g. the dynamic allocation of resources, support for crossorganizational resource sharing, security infrastructure). For effectively supporting DLs, additional services such as support for redundant storage and automatic data distribution, metadata broker, metadata and content management, advanced resource brokers, approaches for ensuring content security in distributed environments and the management of content and community workflows are currently being developed, in addition to services that support the creation and management of Virtual DLs. The GRACE project [12] addresses situations where no centralized index is available. It proposes the development of distributed search and categorization engine that enables just in time flexible allocation of data and computational resources. GRACE adopts the grid middleware developed by Large Hadron Collider (LHC) Computing Grid (LCG). In this project, grid technology is used to meet the computational demands of natural language processing methods, which are mainly text normalization and categorization for indexing purposes. This is accomplished by distributing the computationally intensive part on a grid, which involves secure and dynamic sharing of computational and storage resources. 3. Methodology and grid middleware technologies This section focuses on two fundamental components of the proposed FDL system: the grid middleware services and tools used to build the FDL application and the OAIPMH protocol by which metadata records are harvested from distributed DLs. The grid test bed, which is used in this study is based on state-ofthe-art DataMiningGrid [23], Globus toolkit [9] and Condor [22] middleware technologies, that will be described in the following sections. 3.1. Grid middleware services One of the most important grid-related standards developed in recent years is the Web Services Resource Framework (WSRF), a specification promoted by the Organization for the Advancement of Structured Information Standards (OASIS). WSRF provides a generic, open framework for modeling and accessing stateful resources using Web services, a functionality that is typically needed in todays grid computing infrastructures. Web services, as currently specified by the World Wide Web Consortium (W3C), are usually stateless i.e. there is no standard way that a Web service can keep its state from one invocation to another. However, grid applications do generally require statefulness and the WSRF specification defines a standard way of making Web services stateful. The latest WSRF specification is version 1.2 and it has been approved as an OASIS Standard in 2006, a status of the highest level of ratification. Different grid middleware software solutions exist and continue to be developed over the past years. One of the first grid middleware toolkits implementing the WSRF v. 1.2 specification [4], is the Globus Toolkit 4 (GT4) [10]. GT4 provides a range of grid services that can be directly used to build a distributed grid environment. These include data management, job execution management, community authorization services etc. All these services can be used to build custom grid applications, and are elaborated in detail elsewhere [1,7,11,20]. Besides these ready-to-use services, the GT4 provides an Application Programming Interface (API), that allows for development of proprietary WSRF-compliant services. Due to these reasons the GT4 was selected to be used in this study.

826

J. Trnkoczy, V. Stankovski / Future Generation Computer Systems 24 (2008) 824832

Following is a short review of relevant ready-to-use GT4 services. The Web Service - Grid Resource Allocation and Management (WS-GRAM) provides all basic mechanisms required for execution management i.e., initiation, monitoring, management, scheduling, and coordination of remote computations. GT4 also provides a number of services for data management. The services GridFTP and Reliable File Transfer (RFT) [15] are particularly useful for the FDL application. These data services are mainly used for transfer and management of distributed, file based data, including program executables and their software libraries. GridFTP is used, for example, to transfer executables and required libraries to the selected computational server in the grid. Information services are used to discover, characterize and monitor resources, services and computation [3]. The GT4s Monitoring and Discovery System 4 (MDS4) provides information about the available grid resources and their status. It has the ability to collect and store information from multiple, distributed information sources. This information is used to monitor (e.g. to track usage) and discover (e.g. to assign computing jobs and other tasks) the current state of services and resources in a grid system. The DataMiningGrid high-level services (in particular the Resource Broker and Information Services) are using the MDS4 service. In our FDL application, the following GT4 services are extensively used: WS-GRAM, GridFTP, and MDS4. Scheduling of grid jobs in local computing clusters is achieved by using the Condor [22] middleware. Condor is a specialized workload management software for submitting computeintensive jobs to local computational clusters. In our application, the GT4 submits a subset of parallel jobs to appropriate Condor clusters, and it is up to the Condor software to place them into a local queue, choose when and where in the local cluster to run the jobs, carefully monitor the progress of the jobs, and ultimately inform GT4 services upon their completion. 3.2. DataMiningGrid high-level services In addition to the core grid services provided by GT4, other highlevel WSRF compliant ready-to-use services have recently been developed under the DataMiningGrid project [5]. Here, we provide a brief overview of the Resource Broker and the Information Integrator Service, which are used extensively in our personalized FDL application. These services support the parallel execution of a variety of batch-style programs on arbitrary machines in the grid environment. 3.2.1. Resource broker The Resource Broker service [13] is responsible for the execution of software resources, such as the DL harvesting application, as stand alone applications anywhere in the grid environment. It provides matching between the request for application execution, which is also called a job in grid terminology, and the available computational and data resources in the grid. It takes as input the computational requirements of the job (Central Processing Unit power, memory, disk space etc.) and data requirements of the job (data size, data transfer speed, data location etc.) and selects the most appropriate execution machine for the particular job. The job is passed on to the WS-GRAM service and executed either on an underlying Condor cluster or by using the GT4 Fork mechanism. The Resource Broker service is capable of job delegation to resources spanning over multiple administrative domains. The execution machines are automatically selected so that the inherent complexity of the underlying infrastructure is hidden from the users. The Resource Broker service performs the orchestration of automatic data and application transfers between the grid nodes, using the GridFTP and RFT components of GT4 for the transfers.

The DataMiningGrid Resource Broker can execute multi-jobs. Multi-jobs are collections of single jobs that are bound for parallel execution. In the DataMiningGrid, a multi-job usually consists of a single application, which is instantiated with different input parameters and/or input data sets. In the case of our FDL application a multi-job is formed by instantiating the DL-Harvester application (see Section 3.3) several times, each time with a different DL to be harvested. The individual jobs are then executed in parallel on various computational servers in the grid environment. Each job, therefore, represents the harvesting of one DL, while a multi-job represents the harvesting of several DLs in parallel. 3.2.2. Information integrator service The Resource Broker makes extensive use of the Information Integrator service, which is also provided by the DataMiningGrid and operates in connection to the MDS4 service provided by GT4. The Information Integrator service is designed to feed into other grid components and services, including services for discovery, replication, scheduling, troubleshooting, application adaptation, and so on. Its key role is to create and maintain a register of gridenabled applications. It facilitates the discovery of grid-enabled applications on the grid, and their later use through the Resource Broker service. 3.3. The OAIPMH protocol and a DL-Harvester application Metadata contained in DLs can be accessed over various protocols, such as Z39.50 [24] or the Open Archives Initiative Protocol for Metadata Harvesting (OAIPMH) [19]. OAIPMH is a simple protocol that allows data providers to expose their metadata for harvesting. It is specified by the Open Archives Initiative (OAI), which develops and promotes interoperability standards to facilitate the efficient dissemination of metadata on the WWW. The technological framework for this purpose, is the above mentioned OAIPMH protocol. This protocol is independent of both the type of content offered (e.g. free-text articles, multimedia) and the economic mechanisms surrounding that content, and it promises to have a big impact on opening up access to a wide range of digital materials. Currently (Feb. 2008), there are 771 OAIPMH compliant repositories listed on the OAI web page [26], 1075 on the OpenDOAR directory of academic open access repositories [27] and 1010 on the Registry of Open Access Repositories (ROAR) [28] web portal. With the growing acceptance of the OAI initiative the number of OAI-compliant repositories is rapidly increasing. The OAIPMH protocol supports metadata dissemination and harvesting in different metadata formats. The requested metadata records are returned as well-formed Extensible Markup Language (XML) instance documents that are valid according to a prescribed XML schema. The characters are encoded in an 8-bit UCS/Unicode Transformation Format (UTF-8), and the Hypertext Transfer Protocol (HTTP) is used for transport. As a minimum standard for interoperability, OAIPMH compliant DLs must be able to disseminate metadata in the Dublin Core (DC) format [25]. DC defines fifteen metadata elements for simple content description and discovery, such as: Title, Creator, Subject, Abstract, Publisher, etc. These kinds of metadata were used for the present study. The OAIPMH protocol supports both full harvesting and selective harvesting of DLs. Metadata harvesting is achieved through the ListRecords HTTP request. When this request is issued to a repository it returns a complete list of metadata records contained in that repository. If the repository is big, the list of metadata records may be too large, so several HTTP requests and responses are needed in order to achieve full harvesting. In this case:

J. Trnkoczy, V. Stankovski / Future Generation Computer Systems 24 (2008) 824832

827

Fig. 1. The grid-enabled DL-Harvester application in the DataMiningGrid test bed.

(1) The repository replies to the ListRecords request with an incomplete list and a resumption token. The number of metadata records included in the returned incomplete list is not defined by the protocol itself, so it varies depending on the repository implementation. (2) In order to assemble a complete list, the harvester needs to issue additional requests by using resumption tokens as arguments until the last record list with an empty resumption token is received. (3) A complete list of records is then formed by concatenating the separate lists collected from the sequence of requests. The OAIPMH protocol also provides specifications for selective harvesting. This makes it possible to limit harvesting requests to portions of the available metadata in a repository. Two types of harvesting criteria may be combined in an OAIPMH request: (1) datestamps to harvest only those records that have been created, deleted or modified within a specified date range and (2) set membership to harvest only records that belong to a certain category defined by the library. For the purpose of this study, we developed and grid-enabled a DL-Harvester application, which harvests a selected DL over the OAIPMH protocol. The developed DL-Harvester is a batch-style harvester application, which takes as input the Uniform Resource Identifier of a DL to be harvested, and additional input parameters that allow for selective metadata harvesting. The DL-Harvester application supports the control flow of the OAIPMH protocol by handling resumption tokens and concatenating response results, automatically. The DL-Harvester application could therefore be easily configured to perform either full harvesting or selective

harvesting. Nevertheless, in our use-case scenario each user is allowed to select his own set of DLs to be harvested. The userselected libraries are harvested on-the-fly, hence, full harvesting has to be performed (see [23] for details). It should be also noted that selective harvesting is often impossible because (1) support for deleted records is inconsistently implemented in existing DLs and (2) an instability of DL servers frequently causes problems in determining datestamps to re-sync the harvested metadata with the remote DL. Therefore, in practice, the only reliable way to ensure that aggregated metadata is up to date is to perform a new full harvesting cycle (see [29] for details). Due to the reasons listed above along with the fact that full harvesting is most time consuming, an experimental setup was designed in which only full harvesting was performed, rather than selective (incremental) harvesting of DLs. 4. Experimental setting 4.1. Resources and test bed For the purpose of the FDL application, we used a grid test bed, which was developed by the DataMiningGrid Consortium. The test bed spans three countries: the United Kingdom, Germany and Slovenia. A part of the test bed which is used in the present study is depicted in Fig. 1. It consists of 4 front-end servers with GT4 installations and local computational clusters with a varying number of computational machines (from 2080). Condor is used as a local scheduler that controls the local computational clusters. All four GT4 servers run core GT4 and high-level

828

J. Trnkoczy, V. Stankovski / Future Generation Computer Systems 24 (2008) 824832

Fig. 2. Harvesting multi-job for five Digital Libraries.

DataMiningGird services to support the execution of different gridenabled applications in the test bed. The DataMiningGrid test bed provides a number of capabilities, the most important being the following:

The ability to execute a variety of batch-style applications,


including the DL-Harvester application at any appropriate computational server in the grid. Over 25 grid-enabled applications are currently stored in executable repositories on various grid servers. Several of these applications may be used for designing sophisticated DL services. For example, along with the DL-Harvester, a computationally intensive distributed indexing algorithm was also grid-enabled and the end-users may at any time decide to run it on the corpus of harvested metadata. Meta-scheduling i.e. dynamic and automatic allocation of optimal computational servers in the grid environment, which is achieved through the use of the DataMiningGrid Resource Broker, the DataMiningGrid Information Integrator service and MDS4. Application and data movement across different administrative domains, which is achieved through the use of the GridFTP and RFT services. In addition to these, the DataMiningGrid test bed has a number of other capabilities, such as a Grid Security Infrastructure, which are extensively described elsewhere [21]. 4.2. Grid-enabling the DL-Harvester application In order to grid-enable the DL-Harvester application we followed a very simple two step procedure. In the first step, the actual DL-Harvester executable is uploaded on one of the grid servers. In the second step, an XML document that describes the DL-Harvester application is prepared and registered with the DataMiningGrid Information Integrator service, which passes the XML document to the associated MDS4 service. From this point forward the DL-Harvester application is ready to be used in the grid environment. The application and all its properties may later be easily found in the grid by searching MDS4. This information is also used by other grid services, such as the Resource Broker. The XML document that describes the DLHarvester application is, in fact, an instance of a generic Application Description Schema (ADS instance), which was developed recently

by the DataMiningGrid project. The ADS provides properties to describe applications in an uniform way so that they can later be executed in a grid environment. The ADS is described in detail elsewhere [21]. The ADS instance contains valuable information about the application domain, properties of the executable, a description of its input parameters and data, processing, storage and memory requirements, and information about the exact storage location of the DL-Harvester executable in the grid environment. The FDL application is implemented as a client to the Resource Broker service. The client first composes a multi-job to be executed on the grid. This is done by filling additional properties in the (DLHarvesters) ADS instance. For example, the URLs of all DLs to be harvested are included into the ADS instance, the storage location where harvested records will be concatenated is also specified and so on. The result is a fully populated XML instance, which represents the description of a multi-job to be executed on the grid. The client then issues this multi-job to the Resource Broker service (Step 1 in Fig. 1). Once it receives a multi-job, the Resource Broker service selects appropriate computational servers in the test bed. The GridFTP service is then called to transfer copies of the DLHarvester executable to all of the selected grid servers (Step 2). After the DL-Harvester transfer is complete, the Resource Broker submits the harvesting jobs to the WS-GRAM services (Step 3). While the jobs are executed, the Resource Broker keeps a record of their execution (e.g. time of submission, owner, and status) (Step 4). After execution is completed, the Resource Broker transfers the harvested metadata records and log files from the computational servers to a dedicated Storage Server by using the GridFTP (Step 5). The aggregated metadata can now be further processed or used. For example, it would be possible to run an indexing application, again using grid nodes to reduce processing time. As the last step, the Resource Broker service cleans up all of the temporarily generated files on the computational servers. This process completes the multi-job. 4.3. A possible execution scenario and performance measures Fig. 2 depicts a possible execution scenario for a harvesting multi-job with 5 DLs, while Table 1 defines a number of performance measures, which are used in this study. Stage-in time A is the time from the moment of submission of the multijob to the Resource Broker until the first job starts to run. This

J. Trnkoczy, V. Stankovski / Future Generation Computer Systems 24 (2008) 824832 Table 1 Definition of performance measures
n A B = B1 + B2 C1,2,...,n Cmax = max{C1 , C2 , C3 , C4 , . . . , Cn } D E = A + B1 + Cmax + B2 + D F= n C i=1 i
F G= n F T= E Ttheory = C F max

829

Number of jobs in a multi-job Stage-in time Additional time due to suboptimal scheduling and grid synchronization overhead Run times of the individual instances of the DL- Harvester application The longest lasting instance of the DL-Harvester application Stage-out time Multi-job run time Sequential multi-jobs run-time Average run time of a sequential job Actual speed-up Maximum theoretically achievable speed-up

time includes all the processing time needed to determine the execution machines and to transfer the DL-Harvester application to these machines. The additional time needed for execution due to the grid synchronization overhead, compared to an ideal case when all individual jobs execute within the time frame of the longest job is represented by time B. Each job overhead includes the Condor overhead time (time for job scheduling at the local computational cluster). The time Cmax represents the duration of the longest job in the multi-job, which usually corresponds to the largest DL within a set of DLs. Stage-out time D is the time from the end of the last job in the multi-job until the multijob is completed i.e. until the time when the results are made available. D is largely the time needed for the transfer of the harvested metadata records to the specified location where all of the results are merged (e.g. for subsequent indexing purposes). E represents the overall time from the multi-job submission until the multi-job completion. The theoretical speed-up is computed as the summation of all harvesting run-times (as if they were executed in sequence) divided by the longest job run-time and represents the theoretically maximum achievable speed-up (i.e. in case all libraries are harvested in parallel and their processing and data transfer overhead is equal to zero). 5. Performance measurements The performance measurements had two main goals: (1) to identify the speed-up factors with a growing number of DLs harvested in parallel, and (2) to assess the overhead introduced by the use of grid technology. As a first step in this study, we conducted a detailed analysis of the available OAIPMH compliant DLs. Although several thousand OAIPMH enabled DLs exist, only a limited number of these precisely comply with the standard and operate reliably without human intervention. The most common problem encountered is related to UTF-8 errors which result in non-valid XML documents returned by DLs. Other problems, include improper date stamping, bad resumption tokens etc. A report on problems with harvesting OAIPMH repositories can be found in [29]. Due to these problems, only 56 reliable OAIPMH compliant DLs were identified and used for the study. Traditionally, speed-up is measured by varying the number of jobs, which must be of the same size (i.e. their execution time is the same on the same computing node). This, however, was not possible to achieve in our scenario. A number of factors may significantly influence the harvesting time, e.g. the number of records harvested, the network bandwidth, and/or the number of users that simultaneously harvest the DL. Adding a long lasting harvesting job to a multi-job of several short lasting harvesting jobs would significantly influence the speed-up measurements, so we tried to avoid such a situation. This implied that we had to categorize the harvesting jobs into sets, jobs belonging to one set being at least comparable in their size.

Fig. 3. Dependence of the harvesting time on the number of harvested records.

The 56 selected DLs varied significantly with respect to the number of metadata records they contained. The smallest DL stored only 223 metadata records, while the largest DL stored 317884 metadata records. Experimentally, it was confirmed that the harvesting time of a DL, depends largely on the number of metadata records it contains (see Fig. 3). These results were obtained by taking into account 868 executions of the DL-Harvester. The Pearson R2 test is 0.8547. In our use case scenario, each job represents full harvesting of one DL. The 56 DLs were divided into three groups according to the number of records they contained. There were 32 small (S), 19 medium-sized (M) and 5 large (L) DLs. In total, 11 experiments were scheduled (see Table 2). The jobs within a multi-job were comparable in size, and consequently, it was possible to form multi-jobs of various sizes (big, medium, small). This experimental setting made it possible to investigate the influence of the size of the harvesting multi-job on the speed-up and grid overhead measures. The experiments were executed in a real-world grid test bed, that was used by several other users at the same time. Therefore, special care was taken not to execute the multi-jobs in a heavily over-loaded grid environment, which could influence the speedup measurements. The maximum number of DLs which could be harvested in parallel on the grid was 32 (in experiment S-32), so, we made sure that the number of unoccupied computational machines in the test bed was always higher than 32 at the time of execution. In total, 95 multi-jobs were run. A multi-job run time was compared with the time needed to run the jobs sequentially and the speed-up value was calculated. Table 3 shows average values of the measured parameters, since each experiment (i.e. a multijob) was repeated 10 times in the case of small and medium sized DLs and 5 times in the case of the large DLs. In the case of the large libraries, the experiment L-5 on an average run was almost 7 h, more precisely 24953 s.

830 Table 2 Experimental-set up Experiment S-1 S-5 S-10 S-32 M-1 M-5 M-10 M-19 L-1 L-3 L-5

J. Trnkoczy, V. Stankovski / Future Generation Computer Systems 24 (2008) 824832

No. of jobs in a multi-job (n) 1 5 10 32 1 5 10 19 1 3 5

No. of experiment repetitions (k) 10 10 10 10 10 10 10 10 5 5 5

Average no. of metadata records 971 901 949 952 5241 7636 7566 8327 31 705 140 329 154 118

DL size Small Small Small Small Medium Medium Medium Medium Large Large Large

Table 3 Speed-up measurement results Experiment S-1 S-5 S-10 S-32 M-1 M-5 M-10 M-19 L-1 L-3 L-5

A
48.8 57.6 63.7 77.0 53.7 66.4 66.5 74.0 65.8 53.6 66.2

B
13.3 29.6 39.9 88.9 15.4 43.8 34.3 55.4 12.4 33.8 52.4

max C
98.2 240.6 160.0 215.0 625.1 646.0 751.7 937.3 3056.0 12 535.0 24 492.8

D
34.1 39.7 49.0 75.9 35.6 54.4 77.6 125.0 52.6 195.8 342.4

E
194.4 367.5 312.6 456.8 729.8 810.6 930.1 1191.7 3186.8 12 818.2 24 953.8

F
98.2 526.8 680.2 1980.8 625.1 2396.1 4401.6 8370.1 3056.0 31 646.6 59 534.4

T
0.5 1.5 2.3 4.3 0.9 3.0 4.8 7.0 1.0 2.5 2.4

theory T
1.0 2.7 5.1 9.6 1.0 3.7 5.9 8.9 1.0 2.5 2.4

Fig. 4. Evaluation of the grid overhead with small, medium and large DLs.

In Fig. 4 it is possible to visually compare the grid-synchroniza tion overhead (in percentages relative to the total run time of the multi-job), which increases with the number of jobs within a multi-job. Grid synchronization overhead occurs because of the suboptimal scheduling policy of the Resource Broker. In optimal conditions all the jobs should be completed within time interval of the longest job in a multi-job, but this was not the case in our experiments as can be seen in Fig. 2. Also, as it was expected, the total grid overhead (including stage-in overhead, synchronization overhead and stage-out overhead) is much smaller in the experiments conducted with medium and large DLs, while in the case of small DLs the total grid overhead is very big. For example in experiment S-32, the total grid overhead represents cca. 80% of the total run time of the multi-job. Finally, Fig. 5 shows that the speed-up increases linearly with the growing number of harvested DLs. This increase is faster in

the case of medium and large size DLs. The linear approximation formulae are significant in the case of small and medium size DLs and are as follows:

for small DLs: T = 0.1142 n + 0.7704, Pearson R2 = 0.9617 for medium DLs: T = 0.3336 n + 0.9839, Pearson R2 = 0.9716 for large DLs: T = 0.3587 n + 0.8663, Pearson R2 = 0.7068.
6. Discussion and conclusions In this paper, we presented a grid-based application that addresses the performance, scalability and reliability requirements of existing Federated Digital Library solutions. The provision of new, sophisticated, reliable, personalized FDL solutions requires an infrastructure and services capable of solving complex problems. This kind of computational, data and informational complexity

J. Trnkoczy, V. Stankovski / Future Generation Computer Systems 24 (2008) 824832

831

Fig. 5. Speed-up measurements for small, medium and large DLs.

cannot be adequately addressed by pure Web service technology, which is demonstrated by the number of related research projects combining DL and grid technology (see Section 2 for more details). To the best of our knowledge, these projects have not yet published results, so that we may compare. Our results show that open, standard interfaces and WSRF-compliant services for grid computing may be used to address the investigated problems. At a technical level, we have achieved the execution of the DLHarvester application in a geographically distributed environment, without prior installation and have exploited the redundancy of computational servers in the grid environment in order to achieve application speed-up. Our DL-Harvester application is capable of performing selective harvesting according to dates, however, this feature was not used when performing the experiments. The obtained results are promising and indicate that a system like the one presented in this study may be useful for developing a production level FDL. Digital libraries differed largely in terms of their performance. Some parameters that influence the performance are: the number of metadata records in the DL, the different DL software implementations, the network bandwidth, the number of concurrent DL users etc. Due to these reasons the DL performance may vary on an hourly basis. We were mostly interested in the variations of speed-up and related grid overhead with a growing number of parallely harvested libraries. These variations were observed separately for the harvesting of libraries containing small, medium and large number of metadata records. The grid overhead was compared: (1) by comparing the differences of the stage-in, grid-synchronization overhead and stage-out times in the various experiments (the grid-synchronization overhead rate was lower in the case of the medium and large size DLs), and (2) by comparing the actually obtained speed-up results with the theoretical maximum achievable speed-up (the difference was small in the case of medium size libraries, and it was minimal in the case of the largest libraries, see Table 3). This implies that the use of grid technology is especially beneficial when individual jobs of a grid multi-job harvest large number of metadata records. In this case the relative total grid overhead remains low. The measurements show that speed-up increases approximately linearly with a growing number of harvested DLs. This increase is faster if the grid jobs are large, since in this case the relative grid overhead is small. Another observation is that the difference between the theoretical and actual speed-up increases

with the increasing number of jobs within a multi-job. The reason for this is the sub-optimal scheduling policy, which is used by the Resource Broker (see the example of 5 DLs presented in Fig. 2). The greater the number of jobs that are part of the multi-job, the higher the possibility that some of these jobs will be sub-optimally scheduled and consequently, it will take longer to execute the multijob. This could be improved, for example, by applying advanced scheduling system using the adaptive scheduling algorithms, such as described in [30]. Based on the study, it is possible to conclude that: Harvesting small digital libraries (from 200 to 2000 metadata records i.e. harvesting times in order of 100s of seconds) is not a problem suitable for global grids; The use of grid environments is beneficial with larger libraries (2000 and more records and harvesting times of more than 10 min); the larger the DLs the greater the benefits of using grid technology; In order to achieve good speed-up the jobs within a multi-job should be comparable in size. In the case of DL-Harvester this can be achieved by partitioning the harvesting task of a large DL to several jobs by using selective harvesting; The grid overhead increases with the number of jobs. This may be improved by improving the global scheduling policy, which is used by the Resource Broker. The implemented system prototype, which is based on latest DataMiningGrid (released in March 2007) and GT4 technologies, is generic as it also allows for the inclusion of arbitrary harvesting, indexing, ontology learning and other applications in the grid environment. This, in turn, will allow service providers to build innovative, scalable, high-performance FDL applications that were impossible to imagine in the past. As the next research step we are planning to set up a complete FDL service, where the harvesting process will be followed by indexing and search phases. We have shown that grid technology may be beneficial in the case of FDL applications, especially with the growing size and rapid increase of the number of such repositories. Improvements are still needed in the resource scheduling policies that may significantly reduce the grid synchronization overhead. We identified two main reasons why the use of grid technology is to improve future FDL services. The first reason is the use of distributed computational resources to speed-up the process of harvesting, indexing and other computationally intensive tasks, which allows for frequent harvesting and indexing and therefore keeps the FDL up-to-date. The second reason is the improved system reliability, since the grid system has no single point of failure.

832

J. Trnkoczy, V. Stankovski / Future Generation Computer Systems 24 (2008) 824832 [19] The open archives initiative, the open archives initiative protocol for metadata harvesting, Protocol Version 2.0 of 2002-06-14, Retrieved April 07, 2008 from http://www.openarchives.org/OAI/openarchivesprotocol.html. [20] J.M. Schopf, L. Pearlman, N. Miller, C. Kesselman, I. Foster, M. DArcy, A. Chervenak, Monitoring the grid with the Globus Toolkit MDS4, in: Proc. of SciDAC 2006, Scientific Discovery Through Advanced Computing, 2529 June 2006, Denver, Colorado, USA, Journal of Physics: Conference Series 46 (2006) 521526. [21] V. Stankovski, M. Swain, V. Kravtsov, T. Niessen, D. Wegener, J. Kindermann, W. Dubitzky, Grid-enabling data mining applications with DataMiningGrid: An architectural perspective, Future Generation Computer Systems 24 (4) (2008) 259279. [22] D. Thain, T. Tannenbaum, M. Livny, Distributed computing in practice: The condor experience, Concurrency and Computation: Practice & Experience 17 (24) (2005) 323356. [23] J. Trnkoczy, . Turk, V. Stankovski, A grid-based architecture for personalized federation of digital libraries, Library Collections, Acquisitions, and Technical Services 30 (34) (2006) 139153. [24] NISO standard: ANSI/NISO Z39.50 - Information Retrieval: Application Service Definition & Protocol Specification, Retrieved April 07, 2008 from http://www. niso.org/kst/reports/standards/. [25] NISO standard: ANSI/NISO Z39.85 - The Dublin Core Metadata Element Set, Retrieved April 07, 2008 from http://www.niso.org/kst/reports/standards/. [26] OAI Registered Data Providers, Retrieved February 21, 2008 from http://www. openarchives.org/Register/BrowseSites. [27] OpenDOAR directory of academic open access repositories, Retrieved February 21, 2008 from http://www.opendoar.org/. [28] Registry of Open Access Repositories (ROAR), Retrieved February 21, 2008 from http://roar.eprints.org/. [29] C. Lagoze, D. Krafft, T. Cornwell, N. Dushay, D. Eckstrom, J. Saylor, Metadata aggregation and automated digital libraries: A retrospective on the NSDL experience, in: International Conference on Digital Libraries, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, Chapel Hill, NC, USA, 2006, pp. 230239. [30] Yang Gao, Hongqiang Rong, Joshua Zhexue Huang, Adaptive grid job scheduling with genetic algorithms, Future Generation Computer Systems 21 (1) (2005) 151161.

Acknowledgement This work has been conducted under the DataMiningGrid project, Data Mining Tools and Services for Grid Computing Environments, research grant, EU IST-2004-004475. References
[1] G. Aloisio, M. Cafaro, I. Epicoco, Early experiences with the GridFTP protocol using the GRB-GSIFTP library, Future Generation Computer Systems 18 (8) (2002) 10531059. [2] D. Castelli, Digital libraries of the future - and the role of libraries, Library Hi Tech 24 (4) (2006) 496503. [3] K. Czajkowski, C. Kesselman, S. Fitzgerald, I. Foster, Grid information services for distributed resource sharing, in: Proc. 10th IEEE International Symposium on High-Performance Distributed Computing, 2001, p. 181. [4] K. Czajkowski, D. Ferguson, I. Foster, J. Frey, S. Graham, D. Snelling, S. Tuecke, From open grid services infrastructure to web services resource framework: Refactoring and evolution, Retrieved April 07, 2008 from http://www.globus. org/wsrf/specs/ogsi_to_wsrf_1.0.pdf. [5] DataMiningGrid (Data Mining Tools and Services for Grid Computing Environments) project, Retrieved April 07, 2008 from http://www.datamininggrid. org. [6] DELOS (Digital Library Architectures: Peer-to-peer, grid, and serviceorientation) network of excellence on digital libraries, Retrieved April 07, 2008 from http://www.delos.info/. [7] M. Feller, I. Foster, S. Martin, GT4 GRAM: A functionality and performance study, Retrieved April 07, 2008 from http://www.globus.org/alliance/ publications/papers/TG07-GRAM-comparison-final.pdf. [8] I. Foster, C. Kesselman, The Grid 2: Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publishers, San Francisco, CA, USA, 2004. [9] I. Foster, C. Kesselman, The Globus project: A status report, Future Generation Computer Systems 15 (56) (1999) 607621. [10] I. Foster, Globus Toolkit version 4: Software for service-oriented systems, in: IFIP Intl. Conf. on Network and Parallel Computing, in: Lecture Notes in Computer Science, vol. 3779, Springer, 2005, pp. 213. [11] I. Foster, C. Kesselman, J. Nick, S. Tuecke, The physiology of the Grid: An open grid services architecture for distributed systems integration, Retrieved April 07, 2008 from http://www.globus.org/alliance/publications/papers/ogsa.pdf. [12] G. Haya, F. Scholze, J. Vigen, Developing a grid-based search and categorization tool, High Energy Physics Libraries Webzine, Issue 8, October 2003. [13] V. Kravtsov, T. Niessen, V. Stankovski, A. Schuster, Service-based resource brokering for grid-based data mining, in: Proc. of the 2006 International Conference on Grid Computing and Applications, 2006, pp. 163169. [14] R.R. Larson, R. Sanderson, Grid based digital libraries: Cheshire3 and distributed retrieval, in: Proc. Fifth ACM/IEEE Joint Conf. on Digital Libraries Cyberinfrastructure for Research and Education Denver, CO, USA, Session: Tools & techniques track: searching and IR, 2005, pp. 112113. [15] R.K. Madduri, C.S. Hood, W.E. Allcock, Reliable file transfer in grid environments, in: Proceedings of the 27th Annual IEEE Conference on Local Computer Networks, 2002, pp. 737738. [16] K. Maly, M. Zubair, V. Chilukamarri, P. Kothari, GRID based federated digital library, in: Proc. of the 2nd Conference on Computing Frontiers, 2005, pp. 97105. [17] K. Maly, M. Zubair, X. Li, A high performance implementation of an OAI-based federation service, in: Proceedings of the 11th International Conference on Parallel and Distributed Systems, ICPADS05, vol. 01, 2005, pp. 769774. [18] G.B. Newby, K. Gamiel, N. Nassar, Secure information sharing and information retrieval infrastructure with GridIR, intelligence and security informatics, in: First NSF/NIJ Symposium, ISI, Tucson, AZ, USA, in: Lecture Notes in Computer Science, Springer, Berlin, 2003.

Jernej Trnkoczy studied telecommunications and was awarded his engineering degree in 2003 from the Faculty of Electrical Engineering, University of Ljubljana. He is employed as a researcher at the Laboratory for Digital Signal Processing (LDOS) and is also engaged in postgraduate study at the same Faculty. His research interests include distributed computing, grid and P2P technologies, and their applications in information retrieval systems. He has been involved in several grid and P2P related European projects, such as the EU IST DataMiningGrid project.

Vlado Stankovski was awarded his B.Sc and M.Sc. degrees in computer science from the University of Ljubljana in 1995 and 2000, respectively. He began his career in 1995 as consultant and later as project manager with the Fujitsu-ICL Corporation in Prague. From 19982002 he worked as researcher at the University Medical Centre in Ljubljana. From 2003 on, he is employed as researcher at the Department of Civil Informatics at the Faculty of Civil and Geodetic Engineering. Recently, he was the technical manager of the EU IST DataMiningGrid project. He specializes in semantic grid technologies.

Das könnte Ihnen auch gefallen