Sie sind auf Seite 1von 10

Semantic Map of Services for Structural Bioinformatics

Pierre Tufféry,1 Zoé Lacroix and Hervé Ménager2


1
Equipe de Bioinformatique Génomique et Moléculaire
Université Denis Diderot (Paris 7)
Tour 53-54 , 1er étage, case 7113
2, place jussieu, 75251 Paris Cedex 05, France
pierre.tuffery@ebgm.jussieu.fr
2
Scientific Data Management Lab
Arizona State University
Tempe AZ 85287-5706, USA
zoe.lacroix@asu.edu
and herve.menager@asu.edu

Abstract: We present a Semantic Map of resources for Structural Bioinformatics ap-


plied to proteins, i.e., various methods to predict and analyze protein structures in sil-
ico. Our map depicts resources on two levels: a logical level that provides a high-level
description of the scientific concepts using a domain ontology; a physical level, that de-
scribes the actual resources implementing these connections. Scientists can use our sys-
tem to express a query that captures their scientific aim, and are guided to identify the
resources best meeting their needs. It is intended to provide scientists a tool to register
and share knowledge about the available services in this field. Our approach addresses
the problem of semantic interoperability of scientific resources publicly available on the
web.

Keywords: Structural bioinformatics, Ontologies, Semantic interoperability.

1 Introduction

Nowadays, most scientific digital resources, including databases and programs, are accessible through
the Internet. Indeed, in various domains, such as Bioinformatics, it is possible to rely on this medium
to create new experimental protocols in silico. However, the high degree of distribution and hetero-
geneity of these resources [8] raises interoperability challenges on two levels. On a syntactic level, the
same information can be represented and exchanged using many different formats and protocols, de-
pending on the resource. We also have to consider a semantic level, that describes what these different
resources achieve: what kind of scientific task do they process, or what kinds of scientific objects do
they manipulate. This position paper describes the Semantic Map of Services for Structural Bioinfor-
matics, a graphical interface to a database of resources for in silico. The description of the resources
includes a semantic annotation, that characterizes the processed data using concepts from a domain-
specific ontology. The aim of this project is to facilitate the exploration of available resources in this
field, by presenting a conceptual representation of Structural Bioinformatics and a database of tools
and data sources.
We first describe the integration of resources in Structural Bioinformatics. Section 3 describes
examples of structural bioinformatics resources, explaining their purpose and offering a semantic
description of their usage. Section 4 depicts the general architecture of the system we propose to use in
order to build and publish this Semantic Map. Finally, Section 5 presents the structural bioinformatics
ontology which will be used to describe the given concepts, and explore our database, and investigates
the various representation problems that have to be addressed to provide a usable graphic interface.

2 Integrating Structural Bioinformatics resources

The Structural Bioinformatics field covers the prediction and analysis in silico of biological molecular
structures with the goal to understand functional mechanisms at a molecular level. Due to the signif-
icant effort of the scientific community, this field has dramatically evolved over the recent years, in
particular for proteins. The techniques available to predict and analyze protein structures are contin-
uously improving both in their focus and their performances while new algorithms are developed. In
such a context, the scientists face the increasing difficulty of identifying and accessing existing tools
and understanding how the results should be interpreted and contribute to scientific discovery.
Many efforts have been recently made in the direction of the interconnection of the services:
In projects such as BioMoby [19], it becomes possible to type the inputs and outputs of a service
according to some ontology of the data types. Several applications such as Taverna [17], GPipe [7]
describe ways to obtain operational workflows. While improving the interoperability of applications,
these approaches offer a partial solution to the problem. The characterization of services they provide
typically focuses on a syntactic characterization missing to depict their semantics. In particular, they
expect the scientists to a priori know the resources they want to use, and to specify them, instead of,
for instance, offering resources suitable for each task so that they can select the one that best meets
their needs.
Hence, a higher level of description of the services should be helpful to navigate through the large
panel of services available. Preliminary steps in similar directions have been made by attempts such
as the BioMetaDataBase, a database storing metadata that describes biological resources exploited
by BioNavigation [14], a graphical interface that lets users explore possible paths between these
resources to answer biological queries. From another perspective, ontologies are increasingly used in
Life Sciences to describe knowledge. For instance, the Gene Ontology [?] is an ontology specialized
in the field of Genetics and Molecular Biology, that provides a shared vocabulary, mainly aimed at
genes and gene products annotation. Our project exploits an ontology to describe the resources made
available to the scientists, so that scientists express a query that captures their scientific aim, and are
guided by the system to identify the resources best meeting their needs.
We propose a semantic map of services for structural bioinformatics, applied to pro-
teins. The goal of this map is to provide scientists support for the following:

– A semantic description of each service that captures an abstraction of the service rather than
a low level syntactic description. This description is expressed in terms of an ontology of
the services, linking items of the structural bioinformatics concepts ontology.
– Service identification, detailing for each their purpose, the type of the data on which they are
effective, and the type of result they provide.
– Exploration of available services, navigating through the graph composed of the possible inter-
connections between the services.

This map describes resources on two levels: a logical level that provides a high-level descrip-
tion of the scientific concepts connected by the resources; a physical level, that describes the actual
resources implementing these connections. It is intended to provide scientists a tool to register and
share knowledge about the available services in this field.
3 Semantic representation

Structural bioinformatics tools are complex resources that describe in silico experiments. The results
of these experiments are used to infer conclusions related to a biological reality. For instance, it is, in
some cases, possible to deduce the tertiary structure of a protein by comparing its amino acid sequence
(or parts of it) to sequences for which the structure is solved. Such sequences can be extracted from
the collection of protein structures, stored in data sources such as the Protein Data Bank [4]. Then, the
protein of known structure can be used as a template to model the structure of the query protein. All
these steps involve different resources and different methods. In this paper, we illustrate the approach
presented for applications such as sequence or structure similarity search tools, or tools related to
small compounds (drugs).

3.1 Similarity search tools

An algorithm for similarity search on amino acid sequences allows to retrieve from a database known
sequences that are the most similar to a new sequence. These tools use what are called pairwise align-
ment algorithms that compare the sequences pair by pair. Local alignments are pairwise alignment
algorithms that select similarities between regions of the compared sequences, while global align-
ment systematically compare the whole sequence.

BLAST [3] and Automat [5] are two different tools implementing local alignment algorithms.
These methods have to be provided with:

– an input amino acid sequence (nucleic acid is also possible of course, but we focus here on pro-
teins);
– a choice of database that will be searched;
– a set of arguments used to parameter the algorithm, for instance the filtering of its results.

The returned result is a list of sequences ordered and filtered with respect to their similarity with the
input sequence.
If we consider such algorithms from a purely conceptual point of view and for proteins only, they
capture a similarity relationship between the amino acid sequences that define proteins. In Figure 1,
the physical representation of the tool (e.g., BLAST) shown at the bottom, is mapped to the conceptual
relationship isSimilarTo defined between two instances of the scientific object AA Sequence.
In this example, we define an amino acid sequence as one of the characteristic properties of a Pro-
tein (a Protein hasSequence Amino Acid Sequence). The BLAST tool takes an amino acid sequence
as an input, and returns a collection of amino acid sequences, ordered with respect to their similarity
to the input sequence.

The graph shown in Figure 1 illustrates how the physical tool may be represented as a conceptual
relationship between scientific objects in a semantic map. This mapping allows a semantic characteri-
zation of the tool. A scientist looking for a tool that from an input sequence retrieves similar sequences
will be offered all tools available mapped to the relationship in the semantic map, thus BLAST and
all its variants, and Automat.
Figure 2 represents the BLAST tool, and its physical level inputs and outputs, including the data
from the PDB resource.
Figure 1. Mapping of PDB and BLAST

Figure 2. BLAST-PDB mapping with input and output

This type of analysis can be easily transposed to 3D structure similarity search. Here, the concepts
are similar, but the input data (3D structure instead of sequence) as well as the methods (Yakusa [6]
or SASearch [11] instead of blast) change.

3.2 Small compounds tools

The 3D structure of proteins is a fundamental aspect of their functionality. For instance, the existence
of an activation site on a protein will allow the fixation of other molecules, triggering a reaction from
the protein, and defining its biological function.
Small molecules can bind on this activation site, triggering or blocking these functions. The de-
termination of the molecules that could bind to a protein is a challenging problem, that can be tackled
in many ways. One such possible way is to screen banks of compounds for parts of protein structures
that are likely to be interaction sites (such regions are usually named “pockets”). To do so, one needs
the 3D structure of the protein, the 3D structure of the banks of compounds, a method to dock each
compound in the pocket and to score the interaction between the compound and the pocket. Finally,
chemical companies usually deliver catalogs of their compounds in the form of 1D or 2D represen-
tation. Hence, it is first necessary to generate 3D structures for the series of compounds. Ultimately,
one would also consider that each compound can adopt several conformations of low energy.

Figure 3. Mapping of small compounds related resources

A possible way of conceptualizing this field is reported on Figure 3. Here the concepts are:

– protein structure
– drug
– interaction pocket

The tools and databanks include drug databanks such as the NCI Open database compounds [1],
1D to 3D conversion tools for drugs (e.g., OMEGA [2]), pocket identification from protein structure,
(e.g., Dock [9]), compound screening by docking into the pocket, (e.g., PASS [12]), and the input
data consists in catalogs of small compounds, and a protein structure. Search tools can operate on
databanks of small molecules to determine good binding candidates.

4 Description of the system

Such a project represents a challenge, as it is based on a conceptual representation of the field itself.
To achieve a good description of the services it is fundamental that we describe the domain as ac-
curately as possible. Therefore, we plan to develop it using a highly collaborative system that lets
users of structural bioinformatics related tools register the related services themselves. The ontology
we currently use was issued from an initial seminar, followed by multiple interviews, to gather and
structure domain-specific knowledge. However, it still cannot be considered as definitive, and it is
expected from the users that they edit it to enrich or correct it with regard to the requirements of the
services they describe.
The database is divided in two parts: a set of resources descriptions, stored as XML files, and
an ontology, stored in OWL [15]. The description of a resource includes the semantic type of its
parameters, represented by links to the ontology.
This two level folded description can be visualized in a graphical interface. Using it, scientists
can both browse the resources and create query plans (or workflows) that utilize several different
pathways.
4.1 Services registration and Ontology Modifications

Both the registration of the services and their semantic description are accessible through the web,
thus offering the possibility to derive a more and more useful semantic ontology of the services
in a collaborative way as well as a map of the existing services. Participating scientists can, using
this interface, submit the registration of services. They can also post modification requests for the
ontology, so that this one reflects as accurately as possible the domain and the way it is perceived by
them.
The pollution of the services, as well as potential consistence issues that may rise from uncon-
trolled modifications of the ontology, are be avoided by a moderation mechanism. From this per-
spective, any modification of this ontology has to be validated by a comittee, that ensures that the
consistence of the ontology and its links with the declared services is maintained.

4.2 Navigation interface

The interface used to visualize the different services and their position in the ontology is an improved
version of the BioNavigation tool [14] (displayed in Figure 4). It is used to visualize the available
services and their position in an ontology expressing their meaning.

Figure 4. BioNavigation Interface

The ontology is composed of scientific objects, such as proteins, nucleic acids, or citations. These
scientific objects are related by directed edges, mapped to actual services or resources. The graph is
divided in two parts: a subgraph of concepts (logical graph), and a subgraph of services (physical
graph). This mode lets scientists explore the graph. They can navigate through its concepts and the
links between them, as well as the resources that implement them, and the mapping between the two
subgraphs. They also can visualize the properties (metadata) of the services (see Figure 5). These
properties can include for instance the location of a service or the authority that maintains it.
The user can query the graph using scientific concepts to retrieve possible paths between them.
Such queries are formulated as regular expressions that are given as an input to ESearch, an algo-
rithm that explores the paths between the different concepts of the graph and sorts them according to
different criteria related to the quality of the data or to the cost of the path.
Figure 5. Properties of a physical node on the BioNavigation interface

5 The Structural Bioinformatics ontology - Graph representation issues

The upper part of the Semantic Map is an ontology schema that describes the field of structural
bioinformatics. This schema describes the different scientific objects and concepts that constitute this
field (e.g., protein or ligand), as well as their relationships (e.g., isPartOf, isA or translatesTo).
Figure 6 shows an overview of the part of the current ontology that describes the concepts related
to protein structures. This ontology has been defined after a round of discussions with the scientists
of different teams at RPBS, so as to provide a starting point. It is intended to evolve according to
Semantic Map users proposals. Yet, this graph gives an outlook on the complexity of the ontology.
Different representation issues that have to be addressed, in order to propose a usable interface:

– The ontology includes many different types of relationships, that have different meanings: inher-
itance properties (i.e. isA), composition properties (i.e. isPartOf), etc.
– Similarly, this ontology is constituted by two kinds of concepts, some of them being biochemical
concepts (e.g., proteins or peptides), and some of them being structural bioinformatics concepts
(e.g., conformation or surface).
– Two concepts can be linked by multiple relationships, (e.g., a protein promotes or represses the
expression of a gene).
– A relationship can be recursive, i.e. its domain and range include the same concept (e.g., a simi-
larity relationship between two folds).
– The important size of the resulting interface, that can hinder its navigability.

The overall size and complexity of such a graph calls for the use of a highly powerful and cus-
tomizable interface. Ontology authoring tools such as Protégé [13] offer great usability and extensi-
bility to design and represent ontologies. However, they cannot represent graphs such as the Semantic
Map, that displays both the different concepts that represent the domain of knowledge and the various
resources that implement it. Furthermore, their complexity is not adapted to our potential users.
It is fundamental that the graphic representation is simple enough to be meaningful to the user.
We plan to use the ZVTM API [18] to implement our improved interface, available as a Java Applet
on the Semantic Map website. This interface will provide users with functionalities to assist navi-
gation such as zooms and radar views and let them identify the different types of relationships and
concepts, as well as the resources that “manipulate” them, using distinct display formats and specify-
ing whether they are displayed or not in the graph. The layout of the represented graph is computed
by the Graphviz tool [10] on the server.
Figure 6. Structural Bioinformatics Ontology - Structure Portion
6 Conclusion

In this paper, we presented a project focusing on the semantic integration of structural bioinformatics
related resources. This system organizes the registration and the description of structural bioinformat-
ics related resources in a database. Our database is specifically designed to facilitate their exploration,
based on a semantic level. This system is accessed using an interface that represents the two levels of
the network of resources, logical and physical, in a single graph, thus facilitating knowledge sharing
between scientists in this domain. It should be noted that, even though it our approach is currently
used to describe and explore structural bioinformatics resources, it is not limited to this field. In the
future, we plan to integrate this platform with an execution engine such as the SemanticBio workflow
engine [16], to provide a system that lets users explore resources, and connect them to design and
execute digital scientific protocols.
The material related to this project is available at http://bioserv.rpbs.jussieu.fr/
SemanticMap/

Acknowledgment

This research was partially supported by the National Science Foundation (grants IIS 0223042 and
IIS 0222847) and the Conservatoire National des Arts et Métiers. The authors wish to thank many
people for useful discussions about this project, in particular J. Chomilier, J. Pothier, B. Villoutreix,
J.-F. Zagury and the members of Equipe de Bioinformatique Génomique et Moléculaire (EBGM),
Paris, France.

Références
[1] Downloadable structure files of nci open database compounds.
http://cactus.nci.nih.gov/ncidb2/download.html.
[2] Openeye scientific software website. http://www.eyesopen.com.
[3] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. Basic local alignment search tool.
Journal of Molecular Biology, 215(5):403–410, 1990.
[4] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E.
Bourne. The protein data bank. Nucleic Acids Research, 28:235–242, 2000.
[5] H. Cantalloube, C. Nahum, A. Achour, T. Lehner, Isabelle Callebaut, Arsène Burny, B. Bizzini, Jean Paul
Mornon, D. Zagury, and J. F. Zagury. Automat: a novel software system for the systematic search for
protein (or DNA) similarities with a notable application to autoimmune diseases and AIDS. Computer
Applications in the Biosciences, 10(2):153–161, 1994.
[6] M. Carpentier, S. Brouillet, and J. Pothier. Yakusa: a fast structural database scanning method. Proteins,
61:137–151, 2005.
[7] Alexander Garcia Castro, Samuel Thoraval, Leyla J Garcia, and Mark A Ragan. Workflows in bioin-
formatics: meta-analysis and prototype implementation of a workflow generator. BMC Bioinformatics,
6(87), April 2005.
[8] Su Yun Chung and John C. Wooley. Challenges Faced in the Integration of Biological Information,
volume 1, chapter 2, pages 11–34. Morgan Kaufmann Publishing, 2003.
[9] Todd J. A. Ewing, Shingo Makino, A. Geoffrey Skillman, and Irwin D. Kuntz. DOCK 4.0: Search
strategies for automated molecular docking of flexible molecule databases. Journal of Computer-Aided
Molecular Design, 15(5):411–428, 2001.
[10] Emden R. Gansner and Stephen C. North. An open graph visualization system and its applications to
software engineering. Software — Practice and Experience, 30(11):1203–1233, 2000.
[11] F. Guyon, A.-C. Camproux, J. Hochez, and P. Tuffery. Sa-search: a web tool for protein structure mining
based on a structural alphabet. Nucleic Acids Research, 32:W545–W548, 2004.
[12] G. Patrick Brady Jr. and Pieter F. W. Stouten. Fast prediction and visualization of protein binding pockets
with PASS. Journal of Computer-Aided Molecular Design, 14(4):383–401, 2000.
[13] Holger Knublauch. An AI tool for the real world - Knowledge modeling with Protg, June 2003.
[14] Zoé Lacroix, Kaushal Parekh, Maria-Esther Vidal, Marelis Cardenas, and Natalia Marquez. BioNavi-
gation: Selecting Optimum Paths Through Biological Resources to Evaluate Ontological Navigational
Queries. In Bertram Ludäscher and Louiqa Raschid, editors, DILS, volume 3615 of Lecture Notes in
Computer Science, pages 275–283. Springer, 2005.
[15] Deborah L. McGuinness and Frank van Harmelen. OWL Web Ontology Language Overview. W3C
Recommendation, February 2004. http://www.w3.org/TR/owl-features/.
[16] Hervé Ménager and Zoé Lacroix. A workflow engine for the execution of scientific protocols. In ICDE
Workshops, 2006. Accepted for the IEEE Workshop on Workflow and Data Flow for Scientific Applica-
tions (SciFlow 2006).
[17] Thomas M. Oinn, Matthew Addis, Justin Ferris, Darren Marvin, Martin Senger, R. Mark Greenwood, Tim
Carver, Kevin Glover, Matthew R. Pocock, Anil Wipat, and Peter Li. Taverna: a tool for the composition
and enactment of bioinformatics workflows. Bioinformatics, 20(17):3045–3054, 2004.
[18] Emmanuel Pietriga. A toolkit for addressing hci issues in visual language environments. In IEEE Sym-
posium on Visual Languages and Human-Centric Computing, pages 145–152, September 2005.
[19] MD Wilkinson and M. Links. BioMOBY: an open-source biological web services proposal. Briefings in
Bioinformatics, 3(4):331–341, December 2002.

Das könnte Ihnen auch gefallen