Sie sind auf Seite 1von 465

B io informat ics Bioinformatics

The Morgan Morgan Kaufmann Kaufmann Series in Multimedia Multimedia Information Information and Systems Edward Edward A. Fox, Virginia Virginia Polytechnic University Series Editor

Bioinformatics: Bioinformatics, Managing Scientific Scientific Data Zoe Zo~ Lacroix and Terence Critchlow How to Build a Digital Library lan H. Witten and David Bainbridge Ian Digital Watermarking Ingemar J. Cox, Cox, Matthew Matthew L. A. Bloom Ingemar J. L. Miller, Miller, and and Jeffrey Jeffrey A. Bloom Readings in Multimedia Computing and Networking Edited by Kevin Jeffay and HongJiang Edited by Kevin Jeffay and HongJiang Zhang Zhang Introduction to Data Compression, Compression, Second Edition Khalid Sayood Multimedia Servers: Servers: Applications, Environments, Environments, and Design Dinkar Sitaram and Asit Dan Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition lan Ian H. Witten, Alistair Moffat, and Timothy C. Bell Bell Digital Compression for Multimedia: Principles Principles and Standards Jerry D. Gibson, Gibson, Toby Toby Berger, Berger, Tom Tom Lookabaugh, Lookabaugh, Dave Dave Lindbergh, Lindbergh, and Richard L. Baker Practical Digital Libraries: Libraries: Books, Bytes, Bytes, and Bucks Michael Lesk Readings in Information Retrieval Edited by by Karen Karen Sparck Sparck Jones Jones and and Peter Peter Willett Willett

Bioinformatics Managing Scientific Data

Edited by Edited Zoe Lacroix Z Q Laeroix ~ Arizona Arizona State University

Arizona Tempe, Arizona


And And

Terence Critchiow Terence Critchlow


Lawrence Livermore laboratory Livermore National Laboratory

Livermore, California Livermore. California


W t h 34 Contributing Authors With

M 0 R G A N KAUFHANN K A U F M A N N "UIlISHERS P U8 L IS H R S MORGAN


A N IM,I\INT [ M P R t N T Of OF AN
SAN F R A N; C 'I '' S; C0 O ....... .. <1;
L O.. N00" DON <0

ELSEVIER ELSEVIEI\

SCfENCE SCIENCE
BOSTON .OHOM

SAN DIIGO DlEGO I"M TOKYO TO'O

NEW Y Dn ORK .. n. ..

SYDNEY HOMn

Acquisitions Editor: Rick Rick Adams Adams Acquisitions Developmental Editor: Karyn Karyn Johnson Johnson Developmental Simon Crump Crump Publishing Services Manager: Simon Jodie Allen Alien Project Manager: Jodie Designer: Eric Eric Decicco Decicco Services: Graphic Graphic World World Publishing Publishing Services Services Production Services: International Typesetting Typesetting and and Composition Composition Composition: International Graphic World World Illustration Illustration Studio Studio Illustration: Graphic Printer: The The Maple-Vail Maple-Vail Book Book Manufacturing Manufacturing Group Group Phoenix Cover Printer: Phoenix companies to to distinguish distinguish their their products or Designations used by companies products are often often claimed as trademarks trademarks or registered trademarks. which Morgan Morgan Kaufmann aware of trademarks. In all instances in which Kaufmann Publishers is aware of a claim, the product names names appear appear in in initial initial capital or or all capital capital letters. Readers, Readers, however, should should contact contact the product the appropriate trademarks and appropriate companies for more complete information regarding trademarks and registration. Morgan Kaufmann Kaufmann Publishers Publishers Morgan An imprint imprint of of Elsevier Science An Elsevier Science 340 Pine Pine Street, Street, Sixth Sixth Floor Floor 340 San San Francisco, Francisco, CA CA 94104-3205 94104-3205 www.mkp.com
9 2003 2003 by by Elsevier Elsevier Science Science (USA) (USA) All All rights rights reserved reserved Printed America Printed in in the the United United States States of of America

07 07 06 06 05 05 04 04 03 03

5 5 4 4 3 3 2 2 1 1

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any means--electronic, mechanical, photocopying, or otherwise-without otherwise--without the prior written form or by any means-electronic, permission permission of of the the publisher. publisher.

Library Library of Congress Cataloging-in-Publication Data


Bioinformatics: Bioinformatics: managing managing scientific scientific data data / / edited edited by by Zoe Zo(~Lacroix Lacroix and and Terence Terence Critchlow. Critchlow. p. p. cm. cm. --- (Morgan (Morgan Kaufmann Kaufmann series series in in multimedia multimedia information information and and systems) systems) Includes Includes bibliographical bibliographical references references and and index. index. ISBN alk paper) ISBN 1-55860-829-X 1-55860-829-X (pbk. (pbk. ::alk paper) 1. 1. Bioinformatics. Bioinformatics. I. I. Lacroix, Lacroix, Zoe. Zo& 11. II. Critchlow, Critchlow, Terence. Terence. Ill. III. Series. Series. QH324.2.B55 QH324.2.B55 2003 2003 570'.285--dc21 570'.285--dc21

2003044603 2003044603

Library Libraryof of Congress Congress Control Control Number: Number: 2003044603 2003044603 ISBN: ISBN: 1-55860-829-X 1-55860-829-X
This This book book is is printed printed on on acid-free acid-free paper. paper.

kl!

J&iiII =

Contents Contents

Preface Preface

XIX

xix

1 1

Introduction Introduction

Zod Lacroix and Terence Terence Critchlow Zoe


1 .1 1.1 1.2 1.2 1 .3 1.3 1 .4 1.4 1 1 Problem and Scope 2 2 Biological Data Integration 4 4 Developing a Biological Data Data Integration System 7 7 1 .4.1 Specifications 1.4.1 Specifications 7 7 1 .4.2 Translating 1.4.2 Translating Specifications Specifications into into a a Technical Technical Approach Approach 1 .4.3 Development Process 9 1.4.3 Development Process 9 Evaluation 1 .4.4 1.4.4 Evaluation of of the the System System 9 9 References 1 0 10
Overview

Challenges the Integration Challenges Faced Faced in in the Integration of of Biological Biological Information Information
Su Yun Chung and John C. Wooley

1 11 1

2.1 2.1 2.2 2.2 2.3 2.3

The Life Life Science Science Discovery Process The Nature of of Biological Data

12 12 14 14

A n Information Integration Environment for Life Science An Science Discovery

2.4 2.4

15 15 2.3.1 Diversity 2.3.1 Diversity 15 15 2.3.2 Variability 2.3.2 Variability 17 17 Data 17 Data Sources in Life Science 17 2.4.1 Biological 18 2.4.1 Biological Databases Databases Are Are Autonomous Autonomous 18 Biological 2.4.2 2.4.2 Biological Databases Databases Are Are Heterogeneous Heterogeneous in in Data Data Formats Formats

1 8 18

vi

Contents

2.5 2.5

Biological 8 Biological Data Data Sources Sources Are Are Dynamic Dynamic 1 18 Computational Computational Analysis Analysis Tools Tools Require Require Specific Specific Input/Output Domain Knowledge Input/Output Formats Formats and and Broad Broad Domain Knowledge Challenges in Information Integration 19 19 2.5.1 Data 2.5.1 Data Integration Integration 21 21 2.5.2 Meta-Data 2.5.2 Meta-Data Specification Specification 24 24 Data 2.5.3 2.5.3 Data Provenance Provenance and and Data Data Accuracy Accuracy 25 25 2.5.4 Ontology 2.5.4 Ontology 27 27 Web 2.5.5 2.5.5 Web Presentations Presentations 30 30 Conclusion 31 31 References References 32 32
2.4.3 2.4.3 2.4.4

1 9 19

A A Practitioner's Practitioner's Guide Guide to to Data Data Management Management and and Data Data

Integration in in Bioinformatics Bioinformatics Integration


Barbara

3 5 35

A A.. Eckman Eckman

3 .1 3.1 3 .2 3.2

3.3 3.3

3.4 3.4

3.5 3.5

3 5 35 Data Management n Bioinformatics 36 Management iin 36 3.2.1 Data 3.2.1 Data Management Management Basics Basics 36 36 Two 3 .2.2 3.2.2 Two Popular Popular Data Data Management Management Strategies Strategies and Limitations 39 and Their Their Limitations 39 3.2.3 Traditional 3.2.3 Traditional Database Database Management Management 41 41 Dimensions Describing the Space o f Integration Solutions 45 of 45 3.3.1 A Motivating Motivating Use Use Case Case for for Integration Integration 45 45 3.3.1 A 3.3.2 Browsing 3.3.2 Browsing vs. vs. Querying Querying 46 46 Syntactic 3.3.3 3.3.3 Syntactic vs. vs. Semantic Semantic Integration Integration 48 48 3.3.4 Warehouse 3.3.4 Warehouse vs. vs. Federation Federation 49 49 Declarative 3.3.5 3.3.5 Declarative vs. vs. Procedural Procedural Access Access 49 49 3.3.6 Generic Hard-Coded 49 3.3.6 Generic vs. vs. Hard-Coded 49 3.3.7 Relational 3.3.7 Relational vs. vs. Non-Relational Non-Relational Data Data Model Model 50 50 Use Use Cases of of Integration Solutions 50 50 3.4.1 Browsing-Driven 3.4.1 Browsing-Driven Solutions Solutions 50 50 3 .4.2 Data 3.4.2 Data Warehousing Warehousing Solutions Solutions 52 52 Federated 3.4.3 3.4.3 Federated Database Database Systems Systems Approach Approach 54 54 3.4.4 Semantic 8 3.4.4 Semantic Data Data Integration Integration 5 58 Strengths and Weaknesses of of the Various Approaches to to Integration 60 60 3.5.1 Browsing 1 3.5.1 Browsing and and Querying: Querying: Strengths Strengths and and Weaknesses Weaknesses 6 61 3.5.2 Warehousing 3.5.2 Warehousing and and Federation: Federation: Strengths Strengths and and Weaknesses Weaknesses 62 62 Procedural 3.5.3 3.5.3 Procedural Code Code and and Declarative Declarative Query Query Language: Language: Strengths Strengths and and Weaknesses Weaknesses 63 63
Introduction

Contents Contents

VII ~ vii

3.6 3.6

3.7 3.7

Generic and and Hard-Coded Hard-Coded Approaches: Approaches: Generic Strengths and and Weaknesses Weaknesses 63 63 Strengths 3.5.5 Relational and and Non-Relational Non-Relational Data Data Models: Models: Strengths Strengths Relational 3.5.5 and Weaknesses Weaknesses 64 64 and Conclusion: A A Hybrid Hybrid Approach Approach to Integration Is Ideal 3.5.6 Conclusion: to Integration Is Ideal 3.5.6 Tough Problems i n Bioinformatics Integration 65 in Integration 65 Semantic Query Query Planning Planning Over Over Web Web Data Data Sources Sources 65 65 3.6.1 3.6.1 Semantic 3 .6.2 Schema Management Management 67 67 3.6.2 Schema Summary 69 69 Summary Acknowledgments 70 70 References 70 References 70 3.5.4 3.5.4

64 64

4 4

Issues Address While While Designing Designing a Biological Issues to to Address a Biological Information System System
Zoe Lacroix Zo8 Lacroix

75 7 5

4.1 4.1

Legacy 78 78
Biological Data Data 78 Biological 78 Biological Tools Workflows 79 79 Biological Tools and and Workflows 80 A Domain in Constant Evolution 80 Traditional Database Management Traditional Database Management and and Changes Changes 4.2. 1 4.2.1 4.2.2 Data 4.2.2 Data Fusion Fusion 82 82 Fully 4.2.3 4.2.3 Fully Structured Structured vs. vs. Semi-Structured Semi-Structured 82 82 4.2.4 Scientific 4.2.4 ScientificObject Object Identity Identity 84 84 4.2.5 Concepts 4.2.5 Concepts and and Ontologies Ontologies 85 85 Biological Queries 86 86 4.3.1 Searching 87 4.3.1 Searching and and Mining Mining 87 Browsing 89 4.3.2 4.3.2 Browsing 89 4.3.3 Semantics 4.3.3 Semantics of of Queries Queries 90 90 1 Tool-Driven 91 4.3.4 4.3.4 Tool-Driven vs. vs. Data-Driven Data-Driven Integration Integration 9 Query Processing 92 92 4.4. 1 Biological 4.4.1 Biological Resources Resources 92 92 4.4.2 Query 4.4.2 Query Planning Planning 94 94 4.4.3 Query 4.4.3 Query Optimization Optimization 95 95 Visualization Visualization 98 98 Multimedia 4.5.1 4.5.1 Multimedia Data Data 99 99 Browsing 4.5.2 4.5.2 Browsing Scientific Scientific Objects Ob ects 100 100 Conclusion 101 101 Acknowledgments 02 Acknowledgments 1 102 References 02 References 1 102 4.1 . 1 4.1.1 4 . 1 .2 4.1.2

4.2 4.2

80 80

4.3 4.3

4.4 4.4

4.5 4.5

4.6 4.6

VI I I

Contents

SRS: SRS" An An Integration Integration Platform Platform for for Databanks Databanks

and Analysis Analysis Tools Tools in in Bioinformatics Bioinformatics and


Integrating Flat File Databanks

1 09 109

Thure Etzold, Howard Howard Harris, and Simon Simon Beaulah


5.1 5.1 1 12 112 The 13 5 .1.1 5.1.1 The SRS SRS Token Token Server Server 1 113 5 . 1 .2 Subentry 16 5.1.2 Subentry Libraries Libraries 1 116 Integration 16 Integration of of XML Databases 1 116 What 18 5.2.1 5.2.1 What Makes Makes XML XML Unique? Unique? 1 118 5.2.2 How 5.2.2 How Are Are XML XML Databanks Databanks Integrated Integrated into into SRS? SRS? 120 120 Overview 5.2.3 Overview of of XML XML Support Support Features Features 121 121 5.2.3 5.2.4 How 5.2.4 How Does Does SRS SRS Meet Meet the the Challenges Challenges of of XML? XML? 122 122 Integrating Relational Databases 124 124 5.3.1 Whole Schema Schema Integration Integration 124 5.3.1 Whole 124 Capturing 5.3.2 5.3.2 Capturing the the Relational Relational Schema Schema 125 125 5.3.3 Selecting 26 5.3.3 Selecting a a Hub Hub Table Table 1 126 5.3.4 Generation SQL 127 5.3.4 Generation of of SQL 127 5.3.5 Restricting 5.3.5 Restricting Access Access to to Parts Parts of of the the Schema Schema 128 128 Query 5.3.6 5.3.6 Query Performance Performance to to Relational Relational Databases Databases 128 128 Viewing 5.3.7 5.3.7 Viewing Entries Entries from from a a Relational Relational Databank Databank 128 128 5.3.8 Summary 5.3.8 Summary 129 129 The SRS Query Language 129 129 5.4.1 SRS 5.4.1 SRS Fields Fields 130 130 Linking Databanks 130 130 5.5.1 Constructing 31 5.5.1 Constructing Links Links 1 131 5.5.2 The 32 5.5.2 The Link Link Operators Operators 1 132 The Object Loader 133 133 Creating 34 5.6.1 5.6.1 Creating Complex Complex and and Nested Nested Objects Objects 1 134 Support 35 5.6.2 5.6.2 Support for for Loading Loading from from XML XML Databanks Databanks 1 135 5.6.3 Using 5.6.3 Using Links Links to to Create Create Composite Composite Structures Structures 136 136 Exporting 5.6.4 5.6.4 Exporting Objects Objects to to XML XML 136 136 Scientific Analysis Tools 137 137 38 Processing 5.7.1 5.7.1 Processing of of Input Input and and Output Output 1 138 5.7.2 Batch 5.7.2 Batch Queues Queues 139 139 Interfaces to SRS 139 139 5.8.1 The 5.8.1 The Web Web Interface Interface 139 139 5.8.2 SRS 5.8.2 SRS Objects Objects 140 140 SOAP 5.8.3 5.8.3 SOAP and and Web Web Services Services 141 141 Automated Server Maintenance with SRS Prisma 141 141 Conclusion 143 143 References 144 144

5.2 5.2

5.3 5.3

5.4 5.4 5.5 5.5

5.6 5.6

5.7 5.7

5.8 5.8

5.9 5.9 5.10 5.10

Contents

ix

The The Kleisli Kleisli Query Query System System as as a a Backbone Backbone for for
Jing Chen, Chen, Su un Yun Chung, Chung, and Limsoon Wong

Bioinformatics Bioinformatics Data Data Integration Integration and and Analysis Analysis
Motivating Example Approach

1 47 147

6.1 6.1 6.2 6.2 6.3 6.3 6.4 6.4 6 .5 6.5 6.6 6.6 6.7 6.7

149 149

1 51 151

153 153 1 58 158 Warehousing Capability 1 63 163 Data 65 Data Sources 1 165 Optimizations Optimizations 167 167 6.7.1 Monadic 6.7.1 Monadic Optimizations Optimizations 169 169 6.7.2 Context-Sensitive 6.7.2 Context-Sensitive Optimizations Optimizations 171 171 6.7.3 Relational 74 6.7.3 Relational Optimizations Optimizations 1 174 User Interfaces 1 75 6.8 6.8 175 6.8.1 Programming Language Interface 75 6.8.1 Programming Language Interface 1 175 6.8.2 Graphical 79 6.8.2 Graphical Interface Interface 1 179 Other 6.9 79 6.9 Other Data Integration Technologies 1 179 6.9.1 SRS 79 6.9.1 SRS 1 179 6.9.2 DiscoveryLink 81 6.9.2 DiscoveryLink 1 181 6.9.3 Object-Protocol 82 6.9.3 Object-Protocol Model Model (OPM) (OPM) 1 182 83 6.10 6.10 Conclusions 1 183 References 1 84 184
Data Data Model and Representation Representation Query Capability

Complex Complex Query Query Formulation Formulation Over Over Diverse Diverse Information Information Sources Sources in in TAMBIS TAMBIS
Robert Stevens, Stevens, Carole Carole Goble, Goble, Norman W. W. Paton, Sean Bechhofer, Bechhofer, Gary Gary Ng, Patricia Patricia Baker, Baker, and Andy Brass

1 89 189

7. 1 7.1 7.2 7.2

The Ontology

1 92 192

7.3 7.3

7.4 7.4

1 95 195 7.2.1 Exploring the Ontology 1 95 7.2.1 Exploring the Ontology 195 7.2.2 Constructing Queries Queries 1 97 7.2.2 Constructing 197 The 7.2.3 7.2.3 The Role Role of of Reasoning Reasoning in in Query Query Formulation Formulation The The Query Processor 205 205 7.3. 1 The 7.3.1 The Sources Sources and and Services Services Model Model 206 206 7.3.2 The 7.3.2 The Query Query Planner Planner 208 208 7.3.3 The 11 7.3.3 The Wrappers Wrappers 2 211 Related Work 2 13 213
The The User Interface

202 202

,.. 9

~ ~ ..

. . . . . . . . . .

9 . . :

... .... . . . . .

..,,,~:~. : ..... ~

:, ~

.: 4

~ . . . . . , : , ~ , ~ , , ~ . ~ , , . , . . ~ . . . , ~ , ~ : : , ~ . , ~ , : ~ . . ~ , ~ . ~ , , ~ : ~ , : , : ~ , ~ ~ , ~ ~ ~ -_, .... . .: .

. .

. ~ : ~ , , - . , ~ . , ~ , , ~ , , . ~ . ~ , ~ ~

Contents

7.5

Information Integration in 7.4.1 7.4.1 Information Integration in Bioinformatics Bioinformatics 213 213 Knowledge 7.4.2 Knowledge Based Based Information Information Integration Integration 215 215 7.4.2 7.4.3 Biological Ontologies Ontologies 216 216 7.4.3 Biological Current Current and Future Developments in TAMBIS 217 Summary 219 7.5.1 7.5.1 Summary 219
Acknowledgments References

220

220

The Information Integration Integration System System K2 K2 The Information


Val Tannen, Susan B. Davidson, and Scott Harker

225 225

8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10

Approach Approach 229 Data Data Model Model and Languages 232 An Example 235 235 Internal Language 239 239 Data Data Sources User Interfaces

240

Query Optimization Optimization 242 242

243 243 Scalability 244 244 Impact 245 245 Summary 246 246 Acknowledgments Acknowledgments 247 References 247

P/FDM P/FDM Mediator Mediator for for a a Bioinformatics Bioinformatics Database Database Federation Federation
Graham Graham ]. J. L. Kemp and Peter

249 249

M. D. Gray
250 250

9.1 9.1

Approach 250 9.1.1 Alternative Architectures Architectures for Integrating Databases 9.1.1 Alternative for Integrating Databases The Data Model 9.1.2 The Functional Functional Data Model 252 9.1.3 Schemas 9.1.3 Schemas in in the the Federation Federation 254 254 Mediator Architecture 257 9.1.4 Mediator Architecture 9.1.5 Example Example 261 261 9.1.5 9.1.6 Query Query Capabilities Capabilities 264 9.1.7 Data Sources Data Sources 265 265 Analysis 266 266 Optimization 9.2.1 9.2.1 Optimization 267 9.2.2 User 9.2.2 User Interfaces Interfaces 268 268 Scalability 9.2.3 9.2.3 Scalability 271 271

9.2

Contents
~ ~ ~ ~ ` ~ ~ ~ ~ ~ ~ ~ ~ ~ ` ~ : ~ ` ~ ` ~ ! ~ ` ~ i ~ i ~ ~" ....... ~ % ' ~ ~ ~ . ~ . ~ ~ ........ ~i~..~,~i ~ ..... ~ ~ .i. . .~. . ~ . , . ~ % ~ ~ & ~ i ~ & ~ ~

X i xi

9.3

Conclusions Conclusions 272 272 Acknowledgment Acknowledgment 272 272 References References 272

1 0 10

Integration Challenges Integration Challenges in in Gene Gene Expression Expression Data Data Management Management

277

Victor M. Markowitz, John Campbell, I-Min A. Chen, Anthony Kosky, Krishna Palaniappan, and Thodoros Thodoros Topaloglou

10.1 10.1

Gene Expression Expression Data Management: Management: Background Background 278 278 10.1.1 10.1.1 Gene Gene Expression Expression Data Data Spaces 278 Spaces 278 10.1.2 Standards: Benefits 10.1.2 Standards: Benefits and and Limitations Limitations 281 281 The GeneExpress GeneExpress System System 282 282 10.2.1 10.2.1 GeneExpress GeneExpress System System Components Components 283 283 10.2.2 Issues 10.2.2 GeneExpress GeneExpress Deployment Deployment and and Update Update Issues

10.2 10.2

283 283

10.3 10.3

Managing Gene Gene Expression Expression Data: Data: Integration Integration Challenges Challenges 285 285 10.3.1 Expression Data: Data: Array 10.3.1 Gene Gene Expression Array Versions Versions 285 285 10.3.2 Expression Data: Data: Algorithms 10.3.2 Gene Gene Expression Algorithms and and Normalization Normalization 10.3.3 Data: Variability 10.3.3 Gene Gene Expression Expression Data: Variability 287 Data 288 10.3.4 Sample Sample Data 288 Annotations 289 10.3.5 10.3.5 Gene Gene Annotations

286 286

10.4

Integrating Integrating Third-Party Third-Party Gene Gene Expression Expression Data in GeneExpress GeneExpress 291 291 10.4.1 Data Exchange 10.4.1 Data Exchange Formats Formats 291 291 10.4.2 10.4.2 Structural Structural Data Data Transformation Transformation Issues Issues 293 293 10.4.3 Semantic Data Mapping Issues 10.4.3 Semantic Data Mapping Issues 293 293 10.4.4 Data Loading 10.4.4 Data Loading Issues Issues 296 10.4.5 Update Update Issues Issues 297 Summary 298 298 Acknowledgments 299 299 Trademarks

10.5

299 299

References References 300

1 1 11

DiscoveryLink DiscoveryLink

303

Laura M. Haas, Barbara Prasad Kodali, Barbara A. Eckman, Prasad ice, and Peter M. Schwarz Eileen T. T. Lin, Julia E. R Rice,
11.1 11.1
Approach Approach 306 306 11.1.1 11.1.1 Architecture Architecture 11.1.2 Registration Registration

309 309 313 313

XI I

Contents

11.2 11.2

Query Processing Overview 316 Query Optimization 11.2.1 11.2.1 Query Optimization 317 11.2.2 An An Example Example 319 11.2.3 11.2.3 Determining Determining Costs Costs 322

11.3 11.3 11.4 11.4

Ease of Use, Scalability, and Performance Performance 327 329 References 331 331
Conclusions Conclusions

1 2 12

A Model-Based Model-Based Mediator Mediator System System for for Scientific Scientific Data Data Management Management
Bertram L udascher, Amarnath Gupta, Ludascher, and Maryann E. Martone
12.1 12.1 12.2 12.2
Background 336 336 Scientific Data Data Integration Across Multiple Multiple Worlds: Examples and Challenges from the Neurosciences 338 From Terminology 12.2.1 12.2.1 From Terminology and and Static Static Knowledge Knowledge to to Process Process Context Context 340 Model-Based Model-Based Mediation Mediation 343 343 12.3.1 12.3.1 Model-Based Model-Based Mediation: Mediation: The The Protagonists Protagonists 343 12.3.2 12.3.2 Conceptual Conceptual Models Models and and Registration Registration of of Sources Sources at at the the Mediator Mediator 344 12.3.3 12.3.3 Interplay Interplay Between Between Mediator Mediator and and Sources Sources 349 Knowledge Knowledge Representation Representation for Model-Based Model-Based Mediation Mediation 351 351 12.4.1 12.4.1 Domain Domain Maps Maps 352 12.4.2 12.4.2 Process Process Maps Maps 357 Model-Based Model-Based Mediator Mediator System and Tools 360 12.5.1 12.5.1 The The KIND KIND Mediator Mediator Prototype Prototype 360 12.5.2 12.5.2 The The Cell-Centered Cell-Centered Database Database and and SMART SMART Atlas: Atlas:

335 335

12.3 12.3

12.4 12.4

12.5 12.5

12.6 12.6

Retrieval Retrieval and and Navigation Navigation Through Through Multi-Scale Multi-Scale Data Data 362 Related Work and Conclusion 364 12.6.1 12.6.1 Related Related Work Work 364 12.6.2 12.6.2 Summary: Summary: Model-Based Model-Based Mediation Mediation and and Reason-Able Reason-Able Meta-Data Meta-Data 365 365 Acknowledgments 366 366 References 366

Contents

xiii

1 3 13

Management Management Systems Systems


13.1 13.1

Compared Compared Evaluation Evaluation of of Scientific Scientific Data Data

37 1 371

Zoe Zod Lacroix Lacroix and and Terence Terence Critchlow Critchlow
Performance Model Model 371 371 13.1.1 Matrix 13.1.1 Evaluation Evaluation Matrix 13.1.2 13.1.2 Cost Cost Model Model 372 372 13.1.3 Benchmarks 13.1.3 Benchmarks 374 374 13.1.4 Survey 375 13.1.4 User User Survey 375

372 372

13.2 13.2

Evaluation Criteria 376 13.2.1 13.2.1 The The Implementation Implementation Perspective Perspective 13.2.2 13.2.2 The The User User Perspective Perspective 382 382

377

13.3 13.3

13.4

Tradeoffs 385 385 13.3.1 vs. Non-Materialized 13.3.1 Materialized Materialized vs. Non-Materialized 385 385 13.3.2 Data Distribution 13.3.2 Data Distribution and and Heterogeneity Heterogeneity 386 Data vs. Structured Data 13.3.3 Semi-Structured 13.3.3 Semi-Structured Data vs. Fully Fully Structured Data Text Retrieval Retrieval 388 388 13.3.4 13.3 .4 Text Integrating Applications Applications 389 13.3.5 Integrating 13.3.5 Summary 389 389 References

387 387

390 393 393

Concluding Remarks Remarks


Summary Summary 393 393 Looking Toward the Future 394

Biological Resources Appendix: Biological Glossary System System Information


SRS 425 425 Kleisli 425 425 TAMBIS TAMBIS 426 K2

397 397 407 407 425 425

426 427 427

PIFDM P/FDM Mediator Mediator 427 427 GeneExpress KIND KIND DiscoveryLink 428 428

428 428 431 431

Index Index

This Page Intentionally Left Blank

Contributors Contributors

Patricia Baker Department of Computer Science Science University of Manchester Manchester, United Kingdom
Simon Beaulah LION Bioscience Bioscience Ltd. Cambridge, United Kingdom Sean Bechhofer Department of Computer Science Science University of Manchester Manchester, United Kingdom Andy Brass Department of Computer Science Science University of Manchester Manchester, United Kingdom John John Campbell Campbell Gene Logic Logic Inc. Data Management Systems Systems Berkeley, Berkeley, California
1Min A. Chen I-Min

Jing Chen geneticXchange Inc. Inc. Menlo Park, California Su Yun Chung The Center for Research on Biological Biological Structure and Function University of California, San Diego La La Jolla, California California Terence Terence Critchlow Lawrence Livermore National Laboratory Livermore, California Susan B. Susan B. Davidson Department of Computer and Information Science Science University of Pennsylvania Philadelphia, Pennsylvania Barbara A. Eckman IBM Life Life Sciences Sciences West Chester, Chester, Pennsylvania Thure Etzold LION Bioscience Bioscience Ltd. Cambridge, United Kingdom

Gene Logic Logic Inc. Data Management Systems Systems Berkeley, Berkeley, California

xvi

Contributors Contributors

Carole Goble Department of Computer Science University of Manchester Manchester, United Kingdom
Peter M. D. Gray Department of Computing Science University of Aberdeen King's College Aberdeen, Scotland, United Kingdom Amarnath Gupta San Diego Diego Supercomputer Supercomputer Center University of California, California, San Diego San Diego, California Laura M. Haas IBM Silicon Valley Valley Lab San San jose, Jose, California California Howard Harris LION Bioscience Bioscience Ltd. Cambridge, United United Kingdom Kingdom Scott Scott Harker Harker GlaxoSmithKline King of Prussia, Pennsylvania Graham Graham J. J. L. L. Kemp Kemp Department of Computing Science Chalmers University of Technology G6teborg, Sweden

Zoe Zo~ Lacroix Arizona State University Tempe, Arizona Eileen T. Lin Silicon Valley Valley Lab IBM Silicon San San jose, Jose, California California Bertram Ludascher Lud~ischer San Diego San Diego Supercomputer Supercomputer Center Center University of California, San Diego San Diego, California Maryann E. Martone Department of Neurosciences University of California, San Diego San Diego, California Victor M. Markowitz Gene Logic Logic Ine. Inc. Data Management Systems Systems Berkeley, Berkeley, California Gary Ng Network Inference Ltd. London, United Kingdom Krishna Palaniappan Gene Logic Logic Inc. Data Management Management Systems Systems Berkeley, Berkeley, California Norman W. W. Paton Department of Computer Science Science University of Manchester Manchester, Manchester, United United Kingdom Kingdom julia Julia E. E. Rice Rice IBM Almaden Research Center San San jose, Jose, California California

Prasad Kodali IBM IBM Life Life Sciences Sciences Somers, New York
Anthony Anthony Kosky Kosky Gene Logic Gene Logic Inc. Inc. Data Management Systems Berkeley, Berkeley, California California

Contributors Contributors

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xvii xvii

Peter M. M. Schwarz Schwarz Peter IBM Almaden Research Center San Jose, Jose, California California San Robert Stevens Stevens Robert Department of Computer Science University of of Manchester Manchester University Manchester, United Kingdom Val Tannen Tannen Val Department of of Computer Computer Department and Information Information Science Science and University University of Pennsylvania Philadelphia, Pennsylvania Pennsylvania

Thodoros Topaloglou Gene Logic Inc. Data Management Systems Berkeley, California Berkeley, Limsoon Wong Institute Institute for Infocomm Research Singapore John C. Wooley Center for Research on Biological Structure and Function Function University of California, San Diego La Jolla, California

~:~ . . ~.

.: .

.. .

.~ .

About the Authors About the Authors

Dr. Zoe currently a Professor at Arizona State Uni Zo~ Lacroix is is currently a Research Research Assistant Assistant Professor at Arizona State UniUniversity of versity. received a Computer Science versity. She She received a PhD PhD in in Computer Science in in 1996 1996 from from the the University of

Paris XI research interests Paris XI (France). (France). Her Her research interests cover cover various various aspects aspects of of data data manage management, ment, and and she she has has published published more more than than 20 journal journal articles, articles, conference conference papers, papers, and and book book chapters. chapters. She She also also has has served served in in numerous numerous conference conference program program commit committees, tees, organized organized several several panels panels and and workshops, workshops, and and was was an an active active member member in in the the working working groups groups XML XML Query Query Language Language and and XML XML Forms Forms at at the the World World Wide Wide Web Web Consortium Lacroix has more Consortium (W3C). (W3C). Dr. Dr. Lacroix has been been involved involved in in bioinformatics bioinformatics for for more than than 7 years. years. She She has has interacted interacted with with the the Center Center of of Bioinformatics Bioinformatics at at the the Univer University and worked worked for sity of of Pennsylvania Pennsylvania and for two two biotechnology biotechnology companies, companies, Gene Gene Logic Logic Inc. Inc. and and SurroMed SurroMed lne. Inc. Her Her contributions contributions in in bioinformatics bioinformatics include include publications, publications, invited invited talks talks (Symposium (Symposium on on Bioinformatics Bioinformatics organized organized at at the the National National University University of middlewares, such of Singapore), Singapore), and and data data integration integration middlewares, such as as the the Object-Web Object-Web Wrapper, Wrapper, which is currently used which is currently used at at SmithKlineGlaxo. SmithKlineGlaxo.
Dr. Dr. Terence Critchlow Critchlow is is a a computer computer scientist scientist in in the the Center Center for for Applied Applied Scientific Scientific

Computing Computing at at Lawrence Lawrence Livermore Livermore National National Laboratory Laboratory (LLNL) (LLNL) and and leads leads the the DataFoundry project. His His involvement involvement in years DataFoundry project. in bioinformatics bioinformatics began began more more than than 7 years ago Computer Science ago as as part part of of a a collaboration collaboration between between the the University University of of Utah Utah Computer Science department and the Utah Human Genome Center. Since completing his disserta department and the Utah Human Genome Center. Since completing his dissertation joining LLNL tion and and joining LLNL in in 1997, 1997, he he has has been been an an active active member member of of the the research research community, community, publishing publishing in in both both computer computer science science and and informatics informatics forums, forums, giv giving in program ing invited invited talks, talks, participating participating in program committees, committees, and and organizing organizing the the XML XML Enabled Enabled Searches Searches in in Bioinformatics Bioinformatics workshop. workshop.

Preface Preface

Pu rpose a nd G oals Purpose and Goals


Bioinformatics Bioinformatics can can refer refer to to almost almost any any collaborative collaborative effort effort between between biologists biologists or or geneticists geneticists and and computer computer scientists scientists and and thus thus covers covers a a wide wide variety variety of of traditional traditional computer computer science science domains, domains, including including data data modeling, modeling, data data retrieval, retrieval, data data mining, mining, data data integration, integration, data data managing, managing, data data warehousing, warehousing, data data cleaning, cleaning, ontologies, ontologies, sim simulation, parallel computing, ulation, parallel computing, agent-based agent-based technology, technology, grid grid computing, computing, and and visual visualization. ization. However, However, applying applying each each of of these these domains domains to to biomolecular biomolecular and and biomedical biomedical applications applications raises raises specific specific and and unexpectedly unexpectedly challenging challenging research research issues. issues. In In this this book, book, we we focus focus on on data data management management and and in in particular particular data data integration, integration, as applies to genomics and as it it applies to genomics and microbiology. microbiology. This This is is an an important important topic topic because because data data are multiple sources, obtaining are spread spread across across multiple sources, preventing preventing scientists scientists from from efficiently efficiently obtaining the the information information required required to to perform perform their their research research (on (on average, average, a a pharmaceutical pharmaceutical company company uses uses 40 40 data data sources). sources). In In this this environment, environment, answering answering a a single single question question may may require require accessing accessing several several data data sources sources and and calling calling on on sophisticated sophisticated analysis analysis tools tools (e.g., (e.g., sequence sequence alignment, alignment, clustering, clustering, and and modeling modeling tools). tools). While While data data inte integration gration is is a a dynamic dynamic research research area area in in the the database database community, community, the the specific specific needs needs of of biologists biologists have have led led to to the the development development of of numerous numerous middleware middleware systems systems that that provide provide seamless seamless data data access access in in a a results-driven results-driven environment environment (eight (eight middleware middleware systems systems are are described described in in detail detail in in this this book). book). The is to The objective objective of of the the book book is to provide provide life life scientists scientists and and computer computer scientists scientists with view on biological data 1 ) identifying with a a complete complete view on biological data management management by: by: ((1) identifying specific specific (2) presenting existing solutions from issues in biological data management, issues in biological data management, presenting existing solutions from both both academia and providing a academia and industry, industry, and and (3) providing a framework framework in in which which to to compare compare these these systems. systems.

Book ence Book Audi Audience


This book book is intended to useful to This is intended to be be useful to a a wide wide audience. audience. Students, Students, teachers, teachers, bioin bioinformaticians, formaticians, researchers, researchers, practitioners, practitioners, and and scientists scientists from from both both academia academia and and industry industry may may all all benefit benefit from from its its material. material. lt It contains contains a a comprehensive comprehensive description description

xx
X X

Preface
::~:::'::~ :::~:'::::~::~:::~: :~ :~ :'~ ~ ~ ~::~' '::~:~::~'::~' :~::~::::::'~':::~::~::~ ................. ~:':::'~':'~:~'::~:::::::~:~:::~"~:~:~::~................................ ::~::::::::~'~...................... :::::::::::::::::::::::::::: ....................... '~' .................... ~:::::~=::':~...................................... ~ ................................. ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ................... :::::::.................. ::::::::::::::::::::::::::::: ..................... ~::'::::"~::::~::~:: :::~::==':'~ :=~:::~ ::::'~ :"::~:::~:~:::::~ :~:~::~*:~: :::~'~'~*~":':: ~:'~............................. ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ........... ~ \ ' ~ ' ~

of issues for of issues for biological biological data data management management and and an an overview overview of of existing existing systems, systems, making making it it appropriate appropriate for for introductory introductory and and instructional instructional purposes. purposes. Developers Developers not bioinformatics will not yet yet familiar familiar with with bioinformatics will appreciate appreciate descriptions descriptions of of the the numerous numerous challenges challenges that that need need to to be be addressed addressed and and the the various various approaches approaches that that have have been been developed developed to to solve solve them. them. Bioinformaticians Bioinformaticians may may find find the the description description of of existing existing systems list of systems and and the the list of challenges challenges that that remain remain to to be be addressed addressed useful. useful. Decision Decision makers makers will will benefit benefit from from the the evaluation evaluation framework, framework, which which will will aide aide in in their their selec selection tion of of the the integration integration system system that that fits fits best best the the need need of of their their research research laboratory laboratory or or company. company. Finally, Finally, life life scientists, scientists, the the ultimate ultimate users users of of these these systems, systems, may may be be interested interested in in understanding understanding how how they they are are designed designed and and evaluated. evaluated.

Topics n d Orga n ization Topics a and Organization


The s follows: The book book is is organized organized a as follows: Four Four introductory introductory chapters chapters are are followed followed by by eight eight chapters chapters presenting presenting systems, systems, an an evaluation evaluation chapter, chapter, a a summary, summary, a a glossary, glossary, and and an an appendix. appendix. The The introduction introduction further further refines refines the the focus focus of of this this book book and and provides provides a a working working definition definition of of bioinformatics. bioinformatics. It It also also presents presents the the steps steps that that lead lead to to the the development development of of an an information information system, system, from from its its design design to to its its deployment. deployment. Chapter Chapter 2 introduces introduces the integration of the challenges challenges faced faced by by the the integration of biological biological information. information. Chapter Chapter 3 refines refines these these challenges challenges into into use use cases cases and and provides provides life life scientists scientists a a translation translation of of their their needs issues. Chapter needs into into technical technical issues. Chapter 4 illustrates illustrates why why traditional traditional approaches approaches often often fail meet life fail to to meet life scientists' scientists' needs. needs. The The following following eight eight chapters chapters each each present present an an approach approach that that was was designed designed and and developed developed to to provide provide life life scientists scientists integrated integrated access access to to data data from from a a variety variety of distributed, heterogeneous sources. The approaches provide of distributed, heterogeneous data data sources. The presented presented approaches provide a a comprehensive chapters is comprehensive overview overview of of current current technology. technology. Each Each of of these these chapters is written written by by the inventors of the main main inventors of the the presented presented system, system, specifies specifies its its requirements, requirements, and and provides provides the chosen a a description description of of both both the chosen approach approach and and its its implementation. implementation. Because Because of of the the self-contained nature of these chapters, they may be read in any order. Chapter self-contained nature of these chapters, they may be read in any order. Chapter 13 13 provides users users and developers with evaluate presented presented systems. provides and developers with a a methodology methodology to to evaluate systems. Such Such a a methodology methodology may may be be used used to to select select the the system system most most appropriate appropriate for for an an organization, organization, to to compare compare systems, systems, or or to to evaluate evaluate a a system system developed developed in-house. in-house. The state-of-the-art, existing The summary summary reiterates reiterates the the state-of-the-art, existing solutions solutions and and new new challenges challenges that that need need to to be be addressed. addressed. The The appendix appendix contains contains a a list list of of useful useful biological biological resources resources (databases, (databases, orga organizations, and tables. The nizations, and applications) applications) organized organized in in three three tables. The acronyms acronyms commonly commonly used to them used in used to to refer refer to them and and used in the the chapters chapters of of this this book book are are spelled spelled out, out, and and current URLs are current URLs are provided provided so so that that readers readers can can access access complete complete information. information.

Preface

xxi
Each Each of of the the chapters chapters uses uses various various technical technical terms. terms. Because Because these these terms terms involve involve expertise spelling of expertise in in life life science science and and computer computer science, science, a a glossary glossary providing providing the the spelling of acronyms acronyms or or short short definitions definitions is is provided provided at at the the end end of of the the book. book.

Acknowl edg m e nts Acknowledgments


Such Such a a book book requires requires hard hard work work from from a a large large number number of of individuals individuals and and organiza organizations, and tions, and although although we we are are not not able able to to explicitly explicitly acknowledge acknowledge everyone everyone involved, involved, we would like contributions. we would like to to thank thank as as many many as as possible possible for for their their contributions. We obviously indebted individuals who We are are obviously indebted to to those those individuals who contributed contributed chapters, chapters, as would not have been been as as this this book book would not have as informative informative without without them. them. Most Most of of these these contributions contributions came came in in the the form form of of detailed detailed system system descriptions. descriptions. Whereas Whereas there there are many bioinformatics integration systems currently available, are many bioinformatics data data integration systems currently available, we we selected selected several book. We several of of the the larger, larger, better-known better-known systems systems to to include include in in this this book. We are are fortunate fortunate that working on these projects willing and able to their that key key individuals individuals working on these projects were were willing and able to devote devote their time descriptions of time and and energy energy to to provide provide detailed detailed descriptions of their their systems. systems. The The fact fact that that these these contributors contributors include include the the key key architects architects of of the the systems systems makes makes them them much much more more insightful would otherwise possible. We insightful than than would otherwise be be possible. We are are also also fortunate fortunate that that Su Su Yun Yun Chung, Barbara Eckman sights Chung, John John Wooley, Wooley, and and Barbara Eckman were were able able to to contribute contribute their their in insights on on a a life life scientist scientist perspective perspective of of bioinformatics. bioinformatics. Beyond Beyond this this obvious obvious group, group, others others contributed, contributed, directly directly and and indirectly, indirectly, to to the the final version version of this book. book. We would like final of this We would like to to thank thank our our reviewers reviewers for for their their extremely extremely helpful suggestions suggestions and our publishers tireless work helpful and our publishers for for their their support support and and tireless work bring bringing manuscript reviewers ing everything everything together. together. The The manuscript reviewers included: included: Johann-Christoph Johann-Christoph Freytag, Freytag, Humboldt-Universitiit Humboldt-Universit~it zu zu Berlin; Berlin; Mark Mark Graves, Graves, Berlex; Berlex; Michael Michael Hucka, Hucka, California California Institute Institute of of Technology; Technology; Sean Sean Mooney, Mooney, Stanford Stanford University; University; and and Shalom Shalom (Dick) Ph.D., The Enterprise Group. (Dick) Tsur, Tsur, Ph.D., The Real-Time Real-Time Enterprise Group. We We would would also also like like to to thank thank Tom Rajan for Tom Slezak Slezak and and Krishna Krishna Rajan for contributions contributions that that were were not not able able to to be be included included in the the final final version book. in version of of this this book. Finally, Terence Critchlow Finally, Terence Critchlow would would like like to to thank thank Carol Carol Woodward Woodward for for ongoing ongoing moral support, for providing moral support, and and Pete Pete Eltgroth Eltgroth for providing the the resources resources he he used used to to perform perform this work. He also like this work. He would would also like to to extend extend his his appreciation appreciation to to Lawrence Lawrence Livermore Livermore National National Laboratory Laboratory for for their their support support of of his his effort effort and and to to acknowledge acknowledge that that this this work was was partially auspices of DOE by work partially performed performed under under the the auspices of the the U.S. U.S. DOE by LLNL LLNL under under contract contract No. No. W-740S-ENG-48. W-7405-ENG-48.

This Page Intentionally Left Blank

CHAPTER

Introduction Introduction
Zoa Zob Lacroix Lacroix and and Terence Terence Critchlow Critchlow

1 .1 1.1

OVE RVI EW OVERVIEW


Bioinformatics Bioinformatics and and the the management management of of scientific scientific data data are are critical critical to to support support life life science discovery. models of proteins, cells, science discovery. As As computational computational models of proteins, cells, and and organisms organisms become become increasingly increasingly realistic, realistic, much much biology biology research research will will migrate migrate from from the the wet wetlab silico, lab to to the the computer. computer. Successfully Successfully accomplishing accomplishing the the transition transition to to biology biology in silica, however, requires access however, requires access to to a a huge huge amount amount of of information information from from across across the the research research community. available from community. Much Much of of this this information information is is currently currently available from publicly publicly acces accessible data sible data sources, sources, and and more more is is being being added added daily. daily. Unfortunately, Unfortunately, scientists scientists are are not not currently currently able able to to identify identify easily easily and and exploit exploit this this information information because because of of the the variety variety of of semantics, semantics, interfaces, interfaces, and and data data formats formats used used by by the the underlying underlying data data sources. sources. Providing Providing biologists, biologists, geneticists, geneticists, and and medical medical researchers researchers with with integrated integrated access access to all of to all of the the information information they they need need in in a a consistent consistent format format requires requires overcoming overcoming a a large number number of large of technical, technical, social, social, and and political political challenges. challenges. As As a a first first step step in in helping helping to to understand understand these these issues, issues, the the book book provides provides an an overview overview of of the the state state of of the the art art of of data data integration integration and and interoperability interoperability in in genomics. genomics. This is This is accomplished accomplished through through a a detailed detailed presentation presentation of of systems systems currently currently in in use use and and under under development development as as part part of of bioinformatics bioinformatics efforts efforts at at several several organizations organizations from academia. While from both both industry industry and and academia. While each each system system is is presented presented as as a a stand-alone stand-alone chapter, chapter, the the same same questions questions are are answered answered in in each each description. description. By By highlighting highlighting a a variety systems, we hope not only to variety of of systems, we hope not only to expose expose the the different different alternatives alternatives that that are are ac actively tively being being explored, explored, but but more more importantly, importantly, to to give give insight insight into into the the strengths strengths and and weaknesses of each approach. Given that an ideal bioinformatics environment weaknesses of each approach. Given that an ideal bioinformatics environment re remains mains an an unattainable unattainable dream, dream, compromises compromises need need to to be be made made in in the the development development of of any any real-world real-world system. system. Understanding Understanding the the tradeoffs tradeoffs inherent inherent in in different different ap approaches, proaches, and and combining combining that that knowledge knowledge with with specific specific organizational organizational needs, needs, is is the the best best way way to to determine determine which which alternative alternative is is most most appropriate appropriate for for a a given given situation. situation. Because Because we we hope hope this this book book will will be be useful useful to to both both computer computer scientists scientists and and life life scientists scientists with with varying varying degrees degrees of of familiarity familiarity with with bioinformatics, bioinformatics, three three intro introductory discussion in establish a ductory chapters chapters put put the the discussion in context context and and establish a shared shared vocabulary. vocabulary. The by this The challenges challenges faced faced by this developing developing technology technology for for the the integration integration of of biological biological

~ % ~ ~ i ~ : ~ % ~ % % ~ % ~ :

~~%~i

i~I

Introduction Introduction

information are are presented presented in in Chapter Chapter 2. 2. The The complexity complexity of of use use cases cases and and the the information variety of of techniques techniques needed needed to to support support these these needs needs are are exposed exposed in in Chapter Chapter 3. 3. variety This chapter chapter also also discusses discusses the the translation translation from from specification specification to to design, design, including including This the most most common common issues issues raised raised when when performing this transformation transformation in in the the life life the performing this sciences domain. domain. The The difficulty difficulty of of face-to-face face-to-face communication communication between between demanddemand sciences ing users users and and developers developers is is evoked evoked in in Chapter Chapter 4, 4, in in which which examples examples are are used used to to ing highlight the the difficulty difficulty involved involved in in directly transferring existing data management management highlight directly transferring existing data approaches to to bioinformatics bioinformatics systems. systems. These These chapters chapters describe describe the the nuances nuances that that approaches differentiate real-world bioinformatics bioinformatics from from technology technology transferred transferred from from other other dodo differentiate real-world mains. Whereas these nuances may be skeptically viewed as simple j ustifications mains. Whereas these nuances may be skeptically viewed as simple justifications for working working on on solved solved problems, problems, they they are are important important because because bioinformatics bioinformatics occurs occurs for in the the real real world, world, complete complete with with its its ugly ugly realities, realities, not not in in an an abstract abstract environment environment in where convenient convenient assumptions assumptions can can be used to where be used to simplify simplify problems. problems. These introductory chapters book, the These introductory chapters are are followed followed by by the the heart heart of of this this book, the descriptions of bioinformatics systems. descriptions of eight eight distinct distinct bioinformatics systems. These These systems systems are are the the re results of between the the database suits of collaborative collaborative efforts efforts between database community community and and the the genomics genomics community to to develop develop technology process of community technology to to support support scientists scientists in in the the process of scientific scientific discovery. Systems such as Kleisli Kleisli (Chapter (Chapter 6) were discovery. Systems such as were developed developed in in the the early early stages stages of of bioinformatics and and matured matured through through meetings Interconnection of Molecu bioinformatics meetings on on the the Interconnection of Molecular Databases (the (the first first of University lar Biology Biology Databases of the the series series was was organized organized at at Stanford Stanford University in the the San San Francisco Francisco Bay Area, August August 9-12, Others, such in Bay Area, 9-12, 1994). Others, such as as DiscoveryLink DiscoveryLink (Chapter (Chapter 11), 11), are are recent recent efforts efforts to to adapt adapt sophisticated sophisticated data data management management technol technology ogy to to specific specific challenges challenges facing facing bioinformatics. bioinformatics. Each Each chapter chapter has has been been written written by by the the primary primary contributor(s) contributor(s) to to the the system system being being described. described. This This perspective perspective provides provides precious precious insight insight into into the the specific specific problem problem being being addressed addressed by by the the system, system, why why the the particular weakness it have. To particular architecture architecture was was chosen, chosen, its its strengths, strengths, and and any any weakness it may may have. To provide provide an an overall overall summary summary of of these these approaches, approaches, advantages advantages and and disadvantages disadvantages of summarized and of each each are are summarized and contrasted contrasted in in Chapter Chapter 13. 13.

1 .2 1.2

PROBLEM D SCOPE PROBLEM AN AND


In In the the last last decade, decade, biologists biologists have have experienced experienced a a fundamental fundamental revolution revolution from from tra traditional ditional research research and and development development (R&D) (R&D) consisting consisting in in discovering discovering and and understanding pathways, and understanding genes, genes, metabolic metabolic pathways, and cellular cellular mechanisms mechanisms to to large-scale, large-scale, computer-based computer-based R&D R&D that that simulates simulates the the disease, disease, the the physiology, physiology, the the molecular molecular mechanisms, mechanisms, and and the the pharmacology pharmacology [1]. [1]. This This represents represents a a shift shift away away from from life life science's empirical roots, in which it was an iterative and intuitive process. Today science's empirical roots, in which it was an iterative and intuitive process. Today it it is is systematic systematic and and predictive predictive with with genomics, genomics, informatics, informatics, automation, automation, and and minia miniaturization turization all all playing playing a a role role [2]. This This fusion fusion of of biology biology and and information information science science

3 is expected to continue and expand for is expected to continue and expand for the the foreseeable foreseeable future. future. The The first first conse consequence quence of of this this revolution revolution is is the the explosion explosion of of available available data data that that biomolecular biomolecular researchers researchers have have to to harness harness and and exploit. exploit. For For example, example, an an average average pharmaceutical pharmaceutical company bases [1], each company currently currently uses uses information information from from at at least least 40 40 data databases each contain containing ing large large amounts amounts of of data data (e.g., (e.g., as as of of June June 2002, 2002, GenBank GenBank [3, [3, 4] 4] provides provides access access to 20,649,000,000 20,649,000,000 bases bases in in 1 17,471,000 sequences) that that can can be be analyzed analyzed using using a a to 7,471,000 sequences) variety complex tools tools such such as variety of of complex as FASTA FASTA [5], BLAST BLAST [6], [6], and and LASSAP [7].. LASSAP [7] has become Over past several several years, become both both an an all allOver the the past years, bioinformatics has encompassing encompassing term term for for everything everything relating relating to to computer computer science science and and biology, biology, and and a a 1 very trendy one. 1 There There are are a a variety variety of of reasons reasons for for this this including: including: ((1) As computa computavery trendy one. 1 ) As tional biology evolves tional biology evolves and and expands, expands, the the need need for for solutions solutions to to the the data data integration integration problems it problems it faces faces increases; increases; (2) (2) the the media media are are beginning beginning to to understand understand the the impli impli15 or cations of the genomics revolution that has been going on for the last cations of the genomics revolution that has been going on for the last or more more years; ((3) 3 ) the surrounding the years; the recent recent headlines headlines and and debates debates surrounding the cloning cloning of of animals animals and (4) to and humans; humans; and and (4) to appear appear cutting cutting edge, edge, many many companies companies have have relabeled relabeled the the work work that that they they are are doing doing as as bioinformatics, bioinformatics, and and similarly similarly many many people people have have be become come bioinformaticians bioinformaticians instead instead of of geneticists, geneticists, biologists, biologists, or or computer computer scientists. scientists. As bioinAs these these events events have have occurred, occurred, the the generally generally accepted accepted meaning meaning of of the the word word bioin forma tics has formatics has grown grown from from its its original original definition definition of of managing managing genomics genomics data data to to include include topics topics as as diverse diverse as as patient patient record record keeping, keeping, molecular molecular simulations simulations of of pro protein tein sequences, sequences, cell cell and and organism organism level level simulations, simulations, experimental experimental data data analysis, analysis, and and analysis analysis of of journal journal articles. articles. A A recent recent definition definition from from the the National National Institutes Institutes of of Health Health (NIH) (NIH) phrases phrases it it this this way: way:
Bioinformatics Bioinformatics is is the the field field of of science science in in which which biology, biology, computer c o m p u t e r science, science, and and in information technology merge to form a single single discipline. The ultimate goal of the sights as well as to create a field is to enable the discovery of new biological in insights global perspective from which unifying principles in biology can be discerned. [8) [8]

This developThis definition definition could could be be rephrased rephrased as: as: Bioinformatics is the design and develop Using this this definition, ment of of computer-based technology that supports life science. Using definition, bioinformatics bioinformatics tools tools and and systems systems perform perform a a diverse diverse range range of of functions functions including: including: data data collection, collection, data data mining, mining, data data analysis, analysis, data data management, management, data data integration, integration, simulation, visualization. Computer-aided simulation, statistics, statistics, and and visualization. Computer-aided technology technology directly directly sup supporting medical applications is excluded from this definition and is referred porting medical applications is excluded from this definition and is referred to to as medical informatics. describing as medical informatics. This This book book is is not not an an attempt attempt at at authoritatively authoritatively describing

1 1.. The The sentence sentence claims claims that that computer computer science science is is relating relating to to biology. biology. Whenever Whenever one one refers refers to to this this "rela "relationship," one bioinformatics. tionship," one uses uses the the term term bioinformatics.

IIntroduction ntroduction

the the gamut gamut of of information information contained contained in in this this field. field. Instead, Instead, it it focuses focuses on on the the area area of of genomics genomics data data integration, integration, access, access, and and interoperability interoperability as as these these areas areas form form the the cornerstone of of the the field. field. However, However, most most of of the the presented approaches are are generic generic cornerstone presented approaches integration similar scientific integration systems systems that that can can be be used used in in many many similar scientific contexts. contexts. This emphasis is in in line line with with the the original original focus focus of of bioinformatics, bioinformatics, which which was was on on This emphasis is the the creation creation and and maintenance maintenance of of data data repositories repositories (flat (flat files files or or databases) databases) to to store store biological sequences. The biological information, information, such such as as nucleotide nucleotide and and amino amino acid acid sequences. The develop development repositories mostly mostly involved involved schema issues (data ment of of these these repositories schema design design issues (data organization) organization) and and the the development development of of interfaces interfaces whereby whereby scientists scientists could could access, access, submit, submit, and and re revise data. data. Little effort was was devoted traditional data issues vise Little or or no no effort devoted to to traditional data management management issues such languages, optimization, such as as storage, storage, indexing, indexing, query query languages, optimization, or or maintenance. maintenance. The The number of publicly available scientific scientific data data repositories repositories has has grown grown at at an an exponen exponennumber of publicly available public biomolecular tial to the the point where, in in 2000, 2000, there there were were thousands thousands of of public biomolecular tial rate, rate, to point where, data data sources. sources. In In 2003, 2003, Baxevanis Baxevanis listed listed 372 372 key key databases databases in in molecular molecular biology biology only only [9]. [9]. Because Because these these sources sources were were developed developed independently, independently, the the data data they they con contain tain are are represented represented in in a a wide wide variety variety of of formats, formats, are are annotated annotated using using a a variety variety of of methods, methods, and and may may or or may may not not be be supported supported by by a a database database management management system. system.

1 .3 1.3

B I O LO G I CAL DATA IINTEGRATION NTEG RATI O N BIOLOGICAL


Data Data integration integration issues issues have have stymied stymied computer computer scientists scientists and and geneticists geneticists alike alike for for years, and critical to the last 20 years, the last and yet yet successfully successfully overcoming overcoming them them is is critical to the the success success of of genomics research as transitions from wet-lab activity genomics research as it it transitions from a a wet-lab activity to to an an electronic-based electronic-based activity activity as as data data are are used used to to drive drive the the increasingly increasingly complicated complicated research research performed performed on computers. This research is understand not on computers. This research is motivated motivated by by scientists scientists striving striving to to understand not only only the the data data they they have have generated, generated, but but more more importantly, importantly, the the information information implicit implicit in relationships between individual components. in these these data, data, such such as as relationships between individual components. Only Only through through this this understanding understanding will will scientists scientists be be able able to to successfully successfully model model and and simulate simulate entire entire genomes, cells, and genomes, cells, and ultimately ultimately entire entire organisms. organisms. Whereas Whereas the the need need for for a a solution solution is is obvious, obvious, the the underlying underlying data data integration integration issues are about the issues are not not as as clear. clear. Chapter Chapter 4 goes goes into into detail detail about the specific specific computer computer science subtly different science problems, problems, and and how how they they are are subtly different from from those those encountered encountered in in other science. Many other areas areas of of computer computer science. Many of of the the problems problems facing facing genomics genomics data data integration integration are are related related to to data data semantics-the semantics~the meaning meaning of of the the data data represented represented in in a a data data source-and source~and the the differences differences between between the the semantics semantics within within a a set set of of sources. sources. These These differences differences can can require require addressing addressing issues issues surrounding surrounding concept concept identifica identification, data transformation, and concept overloading. Concept identification tion, data transformation, and concept overloading. Concept identification and resolution has has two two components: components: identifying identifying when when data data contained contained in in different different data data sources sources refer refer to to the the same same object object and and reconciling reconciling conflicting conflicting information information found found in in

11.3 .3

Biological Data Integration

5 5

these these sources. sources. Addressing Addressing these these issues issues should should begin begin by by identifying identifying which which abstract abstract concepts concepts are are represented represented in in each each data data source. source. Once Once shared shared concepts concepts have have been been identified, conflicting information can be easily located. As a simple example, identified, conflicting information can be easily located. As a simple example, two two sources sources may may have have different different values values for for an an attribute attribute that that is is supposed supposed to to be be the the same. same. One One of of the the wrinkles wrinkles that that genomics genomics adds adds to to the the reconciliation reconciliation process process is is that that there there may not be a "right" answer. Consider that a sequence representing the same gene may not be a "right" answer. Consider that a sequence representing the same gene should should be be identical identical in in two two different different data data sources. sources. However, However, there there may may be be legiti legitimate mate differences differences between between two two sources, sources, and and these these differences differences need need to to be be preserved preserved in in the the integrated integrated view. view. This This makes makes a a seemingly seemingly simple simple query, query, "return "return the tbesequence sequence associated with this gene," more complex than it first appears. associated witb tbis gene," more complex than it first appears. In In the the case case where where the the differences differences are are the the result result of of alternative alternative data data formats, formats, data data transformations transformations may may be be applied applied to to map map the the data data to to a a consistent consistent format. format. Whereas mapping may be simple from a technical perspective, determining Whereas mapping may be simple from a technical perspective, determining what what it it is is and and when when to to apply apply it it relies relies on on the the detailed detailed representation representation of of the the concepts concepts and and appropriate appropriate domain domain knowledge. knowledge. For For example, example, the thetranslation translation of ofa aprotein protein sequence sequence from from a a single-character single-character representation representation to to aathree-character three-character representation representation defines defines a a corresponding mapping between the two representations. Not all transformations corresponding mapping between the two representations. Not all transformations are are easy easy to to perform-and performmand some some may may not not be be invertible. invertible. Furthermore, Furthermore, because because of of concept overloading, it is often difficult to determine whether or not two abstract concept overloading, it is often difficult to determine whether or not two abstract concepts concepts really really have have the the same same meaning-and meaningmand to to figure figure out out what what to to do do if if they they do do not. For example, although two data sources may both represent genes as not. For example, although two data sources may both represent genes as DNA DNA sequences, sequences, one one may may include include sequences sequences that that are arepostulated postulated to to be begenes, genes,whereas whereas the the other may only include sequences that are known to code for proteins. Whether other may only include sequences that are known to code for proteins. Whether or or not notthis thisdistinction distinction is isimportant important depends dependson onaaspecific specificapplication application and andthe thesemantics semantics that that the the unified unified view view is is supporting. supporting. The The number number of of subtly subtly distinct distinct concepts concepts used used in genomics and the use of the same name to refer to multiple variants in genomics and the use of the same name to refer to multiple variants makes makes overcoming overcoming these these conflicts conflicts difficult. difficult. Unfortunately, the semantics Unfortunately, the semantics of of biological biological data data are are usually usually hard hard to to define define precisely because they are not explicitly stated but are implicitly included precisely because they are not explicitly stated but are implicitly included in in the the database database design. design. The The reason reason is is simple: simple: At At aa given given time, time, within within aa single single research research community, community, common common definitions definitions of of various various terms terms are are often often well well understood understood and and have precise meaning. As a result, the semantics of a data source have precise meaning. As a result, the semantics of a data source are are usually usually understood understood by by those those within within that that community community without without needing needing to to be beexplicitly explicitly de defined. However, genomics (much less all of biology or life science) is not a single, fined. However, genomics (much less all of biology or life science) is not a single, consistent consistent scientific scientific domain; domain; it itis iscomposed composed of ofdozens dozens of ofsmaller, smaller,focused focused research research communities. This would not be a significant issue if researchers only communities. This would not be a significant issue if researchers onlyaccessed accesseddata data from from within within aasingle single domain, domain, but but that that is isnot not usually usuallythe the case. case.Typically, Typically,researchers researchers require require integrated integrated access accessto to data data from from multiple multiple domains, domains, which which requires requires resolving resolving terms that have slightly different meanings across the communities. terms that have slightly different meanings across the communities. This This is isfurther further complicated complicated by by the the observations observations that that the the specific specific community community whose whose terminology terminology

Introduction Introduction

is is being being used used by by the the data data source source is is usually usually not not explicitly explicitly identified identified and and that that the the terminology terminology evolves evolves over over time. time. For For many many of of the the larger, larger, community community data data sources, sources, the handles protein the domain domain is is obvious-the obvious~the Protein Protein Data Data Bank Bank (PDB) (PDB) handles protein structure structure information, the provides protein information, the Swiss-Prot Swiss-Prot protein protein sequence sequence database database provides protein sequence sequence information terminology used information and and useful useful annotations, annotations, etc.-but etc.~but the the terminology used may may not not be be current combination of domains. The current and and can can reflect reflect a a combination of definitions definitions from from multiple multiple domains. The terminology terminology used used in in smaller smaller data data sources, sources, such such as as the the drosophila drosophila database, database, is is typ typically model. Because ically selected selected based based on on a a specific specific usage usage model. Because this this model model can can involve involve using using concepts concepts from from several several different different domains, domains, the the data data source source will will use use whatever whatever definitions intuitive, mixing definitions are are most most intuitive, mixing the the domains domains as as needed. needed. Biology Biology also also demonstrates demonstrates three three challenges challenges for for data data integration integration that that are are com common in in evolving evolving scientific scientific domains domains but but not not typically typically found found elsewhere. elsewhere. The The first first mon is is the the sheer sheer number number of of available available data data sources sources and and the the inherent inherent heterogeneity heterogeneity of of their their contents. contents. The The World World Wide Wide Web Web has has become become the the preferred preferred approach approach for for dis disseminating scientific scientific data data among among researchers, and as as a a result, result, literally literally hundreds hundreds seminating researchers, and of of small small data data sources sources have have appeared appeared over over the the past past 10 10 years. years. These These sources sources are are typically "labor of typically a a "labor of love" love" for for a a small small number number of of people. people. As As a a result, result, they they often often lack support and provide detailed lack the the support and resources resources to to provide detailed documentation documentation and and to to respond respond to to community community requests requests in in a a timely timely manner. manner. Furthermore, Furthermore, if if the the principal principal supporter supporter leaves, the site usually becomes completely unsupported. Some of these leaves, the site usually becomes completely unsupported. Some of these sources sources contain contain data data from from a a single single lab lab or or project, project, whereas whereas others others are are the the definitive definitive reposi repositories information (e.g., tories for for very very specific specific types types of of information (e.g., for for a a specific specific genetic genetic mutation). mutation). Not Not only only do do these these sources sources complicate complicate the the concept concept identification identification issue issue previously previously mentioned mentioned (because (because they they use use highly highly specialized specialized data data semantics), semantics), but but their their number number make infeasible to make it it infeasible to incorporate incorporate all all of of them them into into a a consistent consistent repository. repository. Second, data formats formats and and data data access Second, the the data access methods methods (associated (associated interfaces) interfaces) change providers extend change regularly. regularly. Many Many data data providers extend or or update update their their data data formats formats approximately approximately every every 6 months, months, and and they they modify modify their their interfaces interfaces with with the the same same frequency. frequency. These These changes changes are are an an attempt attempt to to keep keep up up with with the the scientific scientific evolution evolution occurring occurring in in the the community community at at large. large. However, However, a a change change in in a a data data source source represen representation tation can can have have a a dramatic dramatic impact impact on on systems systems that that integrate integrate that that source, source, causing causing the the integration integration to to fail fail on on the the new new format format or or worse, worse, introducing introducing subtle subtle errors errors into into the the systems. systems. As As a a result result of of this this problem, problem, bioinformatics bioinformatics infrastructures infrastructures need need to to be be more more flexible flexible than than systems systems developed developed for for more more static static domains. domains. Third, Third, the the data data and and related related analysis are are becoming becoming increasingly increasingly complex. complex. As As the the nature nature of of genomics genomics research research evolves evolves from from a a predominantly predominantly wet-lab wet-lab activity activity into scientists' need into knowledge-based knowledge-based analysis, analysis, the the scientists' need for for access access to to the the wide wide variety variety of increases dramatically. address this of available available information information increases dramatically. To To address this need, need, information information needs various heterogeneous heterogeneous data needs to to be be brought brought together together from from various data sources sources and and pre presented sented to to researchers researchers in in ways ways that that allow allow them them to to answer answer their their questions. questions. This This means means

,,,,. _,,,,,,,,__ ,,w" __,,,,_<w '0",,,,,,,-,, __ - "", """ ,"""""" ,,"', =_w""',<" "'="''''''''=A,''' oili "" ,""' ""'",, WDJ, "'''<''''''';H<"''

1.4

Developing ntegration System Developing a a Biological Biological Data Data IIntegration

providing access not only to sequence data providing access not only to the the sequence data that that is is commonly commonly stored stored in in data data sources also to information such data, expres sources today, today, but but also to multimedia multimedia information such as as expression expression data, expression simulation results. results. Furthermore, sion pathway pathway data, data, and and simulation Furthermore, this this information information needs needs to organisms under to be be available available for for a a large large number number of of organisms under a a variety variety of of conditions. conditions.

1 .4 1.4

DEVE LO PI N G A B I O LO G I CAL DATA DEVELOPING BIOLOGICAL IINTEGRATION NTEG RATI O N SYSTE M SYSTEM
The integration and has to The development development of of a a biological biological data data integration and management management system system has to there is overcome overcome the the difficulties difficulties outlined outlined in in Section Section 1.3. However, However, there is no no obvious obvious best best approach approach to to doing doing this, this, and and thus thus each each of of the the systems systems presented presented in in this this book book addresses these addresses these issues issues differently. differently. Furthermore, Furthermore, comparing comparing and and contrasting contrasting these these systems systems is is extremely extremely difficult, difficult, particularly particularly without without a a good good understanding understanding of of how how they developed. This is because they were were developed. This is because the the goals goals of of each each system system are are subtly subtly different, different, as the system requirements defined defined at design process. as reflected reflected by by the system requirements at the the outset outset of of the the design process. Understanding Understanding the the development development environment environment and and motivation motivation behind behind the the initial initial system is critical understanding the system constraints constraints is critical to to understanding the tradeoffs tradeoffs that that were were made made later later in design process reasons why. in the the design process and and the the reasons why.

1.4.1 1.4.1

Specifications Specifications
The fa The design design o of a system system starts starts with with collecting collecting requirements requirements that that express, express, among among other other things: things: 9 Who Who the the users users of of the the system system will will be be 9 What What functionality functionality the the system system is is expected expected to to have have 9 How How this this functionality functionality is is to to be be viewed viewed by by the the users users 9 The The performance performance goals goals for for the the system system System (or specifications) describe the System requirements requirements (or specifications) describe the desired desired system system and and can can be be seen seen as as a a contract contract agreed agreed upon upon by by the the target target users users (or (or their their surrogates) surrogates) and and the the developers. Furthermore, these developers. Furthermore, these requirements requirements can can be be used used to to determine determine if if a a delivered delivered system system performs performs properly. properly. The profile is concise description description of users for The user user profile is a a concise of who who the the target target users for a a system system are are and assumed to have. Specifying and what what knowledge knowledge and and experience experience they they can can be be assumed to have. Specifying the the involves agreeing user profile profile involves agreeing on on the the level level of of computer computer literacy literacy expected expected of of users users user (e.g., Are there programmers helping the scientists access the data? Are the users (e.g., Are there programmers helping the scientists access the data? Are the users expected will expected to to know know any any programming programming language?), language?), the the type type of of interface interface the the users users will

IIntroduction ntroduction

have have (e.g., (e.g., Will Will there there be be a a visual visual interface? interface? A A user user customizable customizable interface?), interface?), the the security security issues issues that that need need to to be be addressed, addressed, and and a a multitude multitude of of other other concerns. concerns. Once Once the the user user profile profile is is defined, defined, the the tasks tasks the the system system is is supposed supposed to to perform perform must must be be analyzed. analyzed. This This analysis analysis consists consists in in listing listing all all the the tasks tasks the the system system is is expected expected to u s e cases, cases, and and involves involves answering answering questions questions such such as: as: to perform, perform, typically typically through through use What What are are the the sources sources the the system system is is expected expected to to integrate? integrate? Will Will the the system system allow allow users ? If users to to express express queries queries? If so, so, in in what what form form and and how how complex complex will will they they be? be? Will Will the the system users to navigate scientific system incorporate incorporate scientific scientific applications? applications? Will Will it it allow allow users to navigate scientific objects? objects ? Finally, Finally, technical technical issues issues must must be be agreed agreed upon. upon. These These issues issues include include the the plat platforms forms the the system system is is expected expected to to work work on on (i.e., (i.e., UNIX, UNIX, Microsoft, Microsoft, Macintosh), Macintosh), its its scalability scalability (i.e., (i.e., the the amount amount of of data data it it can can handle, handle, the the number number of of queries queries it it can can simultaneously support, simultaneously support, and and the the number number of of data data sources sources that that can can be be integrated), integrated), and and its its expected expected efficiency efficiency with with respect respect to to data data storage storage size, size, communication communication over overhead, and and data data integration integration overhead. overhead. head, The The collection collection of of these these requirements requirements is is traditional traditional to to every every engineering engineering task. task. However, However, in in established established engineering engineering areas areas there there are are often often intermediaries intermediaries that that initially evaluate initially evaluate the the needs needs for for new new technology technology and and significantly significantly facilitate facilitate the the def definition this is inition of of system system specifications. specifications. Unfortunately, Unfortunately, this is not not the the case case in in life life sciences. sciences. Although Although technology technology is is required required to to address address complex complex user user needs, needs, the the scientists scientists generally generally directly directly communicate communicate their their needs needs to to the the system system designers. designers. While While com communication in different is inherently munication between between specialists specialists in different domains domains is inherently difficult, difficult, bioin bioinformatics faces faces an an additional underlying science formatics additional challenge-the challengemthe speed speed at at which which the the underlying science is evolving. A common result of this is become is evolving. A common result of this is that that both both scientists scientists and and developers developers become frustrated. Scientists are frustrated able to with frustrated. Scientists are frustrated because because systems systems are are not not able to keep keep up up with their ever-changing requirements, and re their ever-changing requirements, and developers developers are are frustrated frustrated because because the the requirements keep keep changing changing on them. The only way problem is is quirements on them. The only way to to overcome overcome this this problem to have have an an intermediary A common can be be formu to intermediary between between the the specialists. specialists. A common goal goal can formulated and achieved forging a a bridge bridge between communities and lated and achieved by by forging between the the communities and accurately accurately representing requirements and and constraints constraints of both sides. representing the the requirements of both sides.

1.4.2 1 . 4.2

Translating Tra n s l ati n g Specifications S pecifications into i nto a a Technical Tec h n ica l Approach Approach
Once the the specifications specifications have have been been agreed agreed upon, upon, they they can can b e translated translated into into a a set set Once be of approaches. approaches. This This can can be be thought thought of of as as an an optimization optimization problem problem in in which which the the of hard constraints constraints define define a a feasibility feasibility region, region, and and the the goal goal is is to to minimize minimize the the cost cost of of hard the system system while while maximizing maximizing its its usefulness usefulness and and staying staying within within that that region. region. Each Each the attribute in the the system system description description can can be be mapped mapped to to a a dimension. dimension. Existing Existing data data attribute in management approaches approaches can can then then be be mapped mapped to to overlapping overlapping regions regions in in this this space. space. management

ical Data ntegration System Biological Data IIntegration 1.4 Developing a Biolog

Once sa Once the the optimal optimal location location has has been been identified, identified, these these approaches approaches can can be be used used a as a starting point point for starting for the the implementation. implementation. Obviously, Obviously, this this problem problem is is not not always always formally formally specified, specified, but but considering considering it it in in this this way way provides provides insight insight into into the the appropriate appropriate choices. choices. For For example, example, in in the the dimen dimension sion of of storage storage costs, costs, two two alternatives alternatives can can be be considered: considered: materializing materializing the the data data and materializing it. and not not materializing it. The The materialized materialized approach approach collects collects data data from from various various sources loads them closely related sources and and loads them into into a a single single system. system. This This approach approach is is often often closely related to to a a data data warehousing warehousing approach approach and and is is favored favored when when the the specifications specifications include include characteristics characteristics such such as as data data curation, curation, infrequent infrequent data data updates, updates, high high reliability, reliability, and and high high levels levels of of security. security. The The non-materialized non-materialized approach approach integrates integrates all all the the resources resources by by collecting collecting the the requested requested data data from from the the distributed distributed data data sources sources at at query query execu execution time. time. Thus, Thus, if if the the specifications specifications require require up-to-date up-to-date data data or or the the ability to easily easily tion ability to include include new new resources resources in in the the integration, integration, a a non-materialized non-materialized approach approach would would be be appropriate. more appropriate.

1 . 4.3 1.4.3

Developm ent Process Development Process


The .4.2, The system system development development implements implements the the approaches approaches identified identified in in Section Section 1 1.4.2, possibly possibly extending extending them them to to meet meet specific specific constraints. constraints. System System development development is is often often an an iterative iterative process process in in which which the the following following steps steps are are repeatedly repeatedly performed performed as as capabilities capabilities are are added added to to the the system: system: .. various software 9 Code Code design: design: describing describing the the various software components/objects components/objects and and their their respective respective capabilities capabilities .. actually writing 9 Implementation: Implementation: actually writing the the code code and and getting getting it it to to execute execute properly properly

9 Testing: Testing: evaluating evaluating the the implementation, implementation, identifying identifying and and correcting correcting bugs bugs ..
.. 9 Deployment: Deployment: transferring transferring the the code code to to a a set set of of users users The The formal formal deployment deployment of of a a system system often often includes includes an an analysis analysis of of the the tests tests and and training training the the users. users. The The final final phases phases are are the the system system migration migration and and the the operational operational process. process. More More information information on on managing managing a a programming programming project project can can be be found found in in Managing a Programming Project-Processes Project~Processes and People People [10]. [10].

1 .4.4 1.4.4

Eva l u ati o n of Evaluation of the the System System


Two systems may same specifications follow the Two systems may have have the the same specifications and and follow the same same approach approach yet yet end radically different end up up with with radically different implementations. implementations. The The eight eight systems systems presented presented in in follow various the the book book (Chapters (Chapters 5 through through 12) 12) follow various approaches. approaches. Their Their design design and and im implementation plementation choices choices lead lead to to vastly vastly different different systems. systems. These These chapters chapters provide provide few few

10

IIntroduction ntroduction

details numerous design details on on the the numerous design and and implementation implementation decisions decisions and and instead instead focus focus on main characteristics characteristics of systems. This will provide provide some insight into on the the main of their their systems. This will some insight into the array of still developing the vast vast array of tradeoffs tradeoffs that that are are possible possible while while still developing feasible feasible systems. systems. There evaluated. One There are are several several metrics metrics by by which which a a system system can can be be evaluated. One of of the the most most obvious obvious is is whether whether or or not not it it meets meets its its requirements. requirements. However, However, once once the the specifica specifications are many many characteristics tions are are satisfied, satisfied, there there are characteristics that that reflect reflect a a system's system's perfor performance. used to systems that mance. Although Although similar similar criteria criteria may may be be used to compare compare two two systems that have have the the same same specifications, specifications, these these same same criteria criteria may may be be misleading misleading when when the the specifica specifications tions differ. differ. As As a a result, result, evaluating evaluating systems systems typically typically requires requires insight insight into into the the system system design design and and implementation implementation and and information information on on users' users' satisfaction. satisfaction. Although Although such such a difficult difficult task task is is beyond beyond the the scope scope of of this this book, book, in in Chapter Chapter 13 13 we we outline outline a a set set of of a criteria criteria that that can can be be considered considered a a starting starting point point for for such such an an evaluation. evaluation.

.2 ..

,.-2., "

R E F E R E NCES REFERENCES
[ 1] [1] [2] [3] [4] [ 5] [5] M. Peitsch. "From Genome to Protein Space . " Presentation a t the Fifth Annual Space." at Symposium in Bioinformatics, Singapore, October October 2000. D " Presentation at the Fifth D.. Valenta. "Trends in Bioinformatics: An Update. Update." Annual Symposium in Bioinformatics, Singapore, October 2000. D. Benson, I. Karsch-Mizrachi, D. Lipman, et a!. " Nucleic Acids al. "GenBank. "GenBank." Research 3 1 , no. 1 ) :23-27, http://www.ncbi.nlm.nih.gov/Genbank. Research 31, 1 (2003 (2003):23-27, "Growth of GenBank." (2003 ): (2003): http://www.ncbi.nlm.nih.gov/Genbanklgenbankstats.html. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html. Pearson and and D D.. Lipman. "Improved Tools for Biological Sequence Comparison. " Proceedings Comparison." Proceedings of of the National Academy of of Sciences Sciences of of the United 9 8 8 ) : 2444-2448. States of 1988): 2444-2448. of America 85, no. 8 (April 1
W.

[6]

S. Altschul, W. '' W. Gish, W. W. Miller, et a!. al. "Basic Local Alignment Search Too!' Tool."

Journal of 1 5, no. 3 (October 1 990): 403-4 1 0, of Molecular Biology 2 215, 1990): 403-410,
http://www.ncbi.nlm.nih.gov/BLAST. [7] E. Glenet and J-J. Codani. "LASSAP: "LASSAP: A Large Scale Scale Sequence Sequence Comparison Package." Bioinformatics 1 3 , no. 2 ( 1 997): 1 3 7-143. 13, (1997): 137-143. NCBI. "Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources. " A Science Resources." Science Primer Primer (November 2002): http://www4.ncbi.nlm.nih.gov/About/primer/bioinformatics.html A. Baxevanis. Baxevanis. "The Molecular Biology Database Collection: 2003 Update." Nucleic Acids Research ) : 1-12, Research 31, no. 1 1 (2003 (2003): http://nar.oupjournals.orglcgi/contentlfull/3 11111 . http://nar.oupj ournals.org/cgi/content/full/31/1/1.

[8]

[9]

[ 1 0] P. Metzger and J. Boddie. Managing a Programming Project-Processes Project--Processes and [10] 996. People. Upper Saddle River, River, NJ: Prentice Prentice Hall, 1 1996.

CHAPTER CHAPTER

2 2

Integration ogical Integration of of Biol Biological

Chal l enges Faced Challenges Faced in in the the

Information Inform ation


Su Vun Yun Chung Chung and and John John C. Wooley Su C. Wooley
Biologists, Biologists, in in attempting attempting to to answer answer a a specific specific biological biological question, question, now now frequently frequently choose their choose their direction direction and and select select their their experimental experimental strategies strategies by by way way of of an an initial initial computational computational analysis. analysis. Computers Computers and and computer computer tools tools are are naturally naturally used used to to col collect lect and and analyze analyze the the results results from from the the largely largely automated automated instruments instruments used used in in the the biological biological sciences. sciences. However, However, far far more more pervasive pervasive than than this this type type of of requirement, requirement, the the very very nature nature of of the the intellectual intellectual discovery discovery process process requires requires access access to to the the latest latest version version of of the the worldwide worldwide collection collection of of data, data, and and the the fundamental fundamental tools tools of of bioinformatics bioinformatics now now are are increasingly increasingly part part of of the the experimental experimental methods methods themselves. themselves. A A driving driving force force for life science data into for life science discovery discovery is is turning turning complex, complex, heterogeneous heterogeneous data into useful, useful, orga organized nized information information and and ultimately ultimately into into systematized systematized knowledge. knowledge. This This endeavor endeavor is is simply simply the the classic classic pathway pathway for for all all science, science, Data Data =} =~ Information Information =} =~ Knowledge Knowledge =} =~ Discovery, history of only brainpower brain power and and Discovery, which which earlier earlier in in the the history of biology biology required required only pencil and paper but pencil and paper but now now requires requires sophisticated sophisticated computational computational technology. technology. In In this this chapter, chapter, we we consider consider the the challenges challenges of of information information integration integration in in biol biology ogy from from the the perspective perspective of of researchers researchers using using information information technology technology as as an an integral integral part part of of their their discovery discovery processes. processes. We We also also discuss discuss why why information information integration integration is is so so important important for for the the future future of of biology biology and and why why and and how how the the obstacles obstacles in in biology biology differ differ substantially substantially from from those those in in the the commercial commercial sector-that sector~that is, is, from from the the expec expectations tations of of traditional traditional business business integration. integration. In In this this context, context, we we address address features features specific and their specific to to the the biological biological systems systems and their research research approaches. approaches. We We then then discuss discuss the the burning burning issues issues and and unmet unmet needs needs facing facing information information integration integration in in the the life life sci sciences. ences. Specifically, Specifically, data data integration, integration, meta-data meta-data specification, specification, data data provenance provenance and and data data quality, quality, ontology, ontology, and and Web Web presentations presentations are are discussed discussed in in subsequent subsequent sections. sections. These These are are the the fundamental fundamental problems problems that that need need to to be be solved solved by by the the bioinformatics bioinformatics community so community so that that modern modern information information technology technology can can have have a a deeper deeper impact impact on on the biological discovery. the progress progress of of biological discovery. This This chapter chapter raises raises the the challenges challenges rather rather than than trying establish specific, trying to to establish specific, ideal ideal solutions solutions for for the the issues issues involved. involved.

12

2 2

Chal lenges Faced ntegration of Biological IInformation nformation Challenges Faced in in the IIntegration

2.1 2 .1

SCIENCE DISCOVERY PROCESS THE LIFE SCI E NCE DISCOVE RY PROCESS


hypothesis-driven approach approach In the last half of the 20th century, a highly focused, hypothesis-driven known known as as reductionist reductionist molecular molecular biology biology gave gave scientists scientists the the tools tools to to identify identify and and characterize characterize molecules molecules and and cells, cells, the the fundamental fundamental building building blocks blocks of of living living systems. systems. To To understand understand how how molecules, molecules, and and ultimately ultimately cells, cells, function function in in tissues, tissues, organs, organs, organisms, populations, biologists organisms, and and populations, biologists now now generally generally recognize recognize that that as as a a commu community nity they they not not only only have have to to continue continue reductionist reductionist strategies strategies for for the the further further elucida elucidation of of the the structure structure and and function function of of individual individual components, components, but but they they also also have have tion to to adopt adopt a a systems-level systems-level approach approach in in biology. biology. Systems Systems analysis analysis demands demands not not just just knowledge knowledge of of the the parts-genes, partsmgenes, proteins, proteins, and and other other macromolecular macromolecular entities-but entitiesmbut also knowledge knowledge of molecular parts also of the the connection connection of of these these molecular parts and and how how they they work work together. In In other words, the pendulum of together. other words, the pendulum of bioscience bioscience is is now now swinging swinging away away from from reductionist reductionist approaches approaches and and toward toward synthetic synthetic approaches approaches characteristic characteristic of of sys systems tems biology biology and and of of an an integrated integrated biology biology capable capable of of quantitative quantitative and/or and/or detailed detailed qualitative obviously will qualitative predictions. predictions. A A synthetic synthetic or or integrated integrated view view of of biology biology obviously will depend For depend critically critically on on information information integration integration from from a a variety variety of of data data sources. sources. For example, example, neuroinformatics neuroinformatics includes includes the the anatomical anatomical and and physiological physiological features features of of the bases the nervous nervous system, system, and and it it must must interact interact with with the the molecular molecular biological biological data databases to to facilitate facilitate connections connections between between the the nervous nervous system system and and molecular molecular details details at at the the 1 In level genes and proteins. 1 phylogeny and level of of genes and proteins. In phylogeny and evolution evolution biology, biology, comparative comparative ge genomics is making new impacts impacts on on evolutionary evolutionary studies. studies. Over nomics is making new Over the the past past two two decades, decades, research in in evolutionary evolutionary biology biology has has come come to to depend depend on on sequence comparisons at at research sequence comparisons the gene and protein level, and and in in the future, it it will will depend depend more more and and more more on on the gene and protein level, the future, tracking not not just just DNA DNA sequences sequences but but how how entire entire genomes genomes evolve evolve over over time time [1]. [ 1 ] . In In tracking ecology there is an opportunity ultimately to study the sequences of all genomes ecology there is an opportunity ultimately to study the sequences of all genomes involved in in an an entire entire ecological ecological community. community. We We believe believe integration bioinformatics involved integration bioinformatics will will be be the the backbone backbone of of 21st-century 2 1st-century life life sciences sciences research. research. Research discovery discovery and and synthesis synthesis will driven by by the complex information information Research will be be driven the complex arising intrinsically from biology itself and from the diversity and heterogeneity arising intrinsically from biology itself and from the diversity and heterogeneity of experimental experimental observations. observations. The The database database and and computing computing activities activities will will need need of to be be integrated integrated to to yield yield a a cohesive cohesive information information infrastructure infrastructure underlying underlying all all of of to biology. A A conceptual conceptual example example of of how how biological biological research research has has increasingly increasingly come come biology. to depend depend on on the the integration integration of of experimental experimental procedures procedures and and computation computation activactiv to ities is is illustrated illustrated in in Figure Figure 2.1. 2.1. A A typical typical research research project project may may start start with with a a colcol ities lection of known known or or unknown genomic sequences sequences (see (see Genomics in Figure 2. 1 ). lection of unknown genomic Genomics in Figure 2.1). For unknown unknown sequences, sequences, one one may may conduct conduct a a database database search search for for similar similar sequences sequences For
For information information about about neuroinformatics, neuroinformatics, refer to the the Human Human Brain Brain Project Project at at the the National Institute 1. For refer to National Institute of Mental Mental Health Health (http-//www.nimh.nih.gov/neuroinformatics/abs.cfm). (http;llwww.nimh.nih.govlneuroinformaticslabs.cfm). of

2 .1 2.1

T h e Life Science Discovery Process The

1 3 13
Databases Databases

Databases

Genomics
Sequence Gene Finding GenomeComparisons

Gene Exeression Profiles


Mlcroarray experiments (LlMS outputs)

Proteomics
Protein Expression Structures Functions Interactions

Sllstems Biolo911
Regulatory Network Metabolic Pathway Protein Pathway Cellular Process

Computational Analysis Tools

Computational Analysis Tools

Computational Analysis Tools

2.1 2 .1

Information-driven Information-driven discovery. discovery.

FIG URE FIGURE

or or use use various various gene-finding gene-finding computer computer algorithms algorithms or or genome genome comparisons comparisons to to predict predict the putative genes. To probe expression profiles of these genes/sequences, the putative genes. To probe expression profiles of these genes/sequences, high highdensity density microarray microarray gene gene expression expression experiments experiments may may be be carried carried out. out. The The analysis analysis of of expression expression profiles profiles of of up up to to 100,000 100,000 genes genes can can be be conducted conducted experimentally, experimentally, but but this this requires requires powerful powerful computational computational correlation correlation tools. tools. Typically, Typically, the the first first level of array experiment experiment (labora level of experimental experimental data data stream stream output output for for a a micro microarray (laboratory tory information information management management system system [UMS] [LIMS] output) output) is is a a list list of of genes/sequences/ genes/sequences/ identification profile. Patterns identification numbers numbers and and their their expression expression profile. Patterns or or correlations correlations within within the massive data obvious by manual inspection. inspection. Different the massive data points points are are not not obvious by manual Different computa computational clustering used simultaneously simultaneously to tional clustering algorithms algorithms are are used to reduce reduce the the data data complexity complexity and relationships among among genes/sequences and to to sort sort out out relationships genes/sequences according according to to their their expression expression levels levels or or changes changes in in expression expression levels. levels. These These clustering clustering techniques, techniques, however, however, have have to to deal deal with with a a high-dimensional high-dimensional data data element element space; space; the the possibility possibility for for correlation correlation by by chance chance is is high high because because a a set set of of genes genes clustered clustered together together does does not not necessarily necessarily imply imply participation participation in in a a common common biological clustering results, biological process. process. To To back back up up the the clustering results, one one may may proceed proceed to to pro proteomics . 1 ) to to connect available teomics (see (see Figure Figure 2 2.1) connect the the gene gene expression expression results results with with available protein expression patterns, protein expression patterns, known known protein protein structures structures and and functions, functions, and and protein proteinprotein interaction interaction data. data. Ultimately, macro protein Ultimately, the the entire entire collection collection of of interrelated interrelated macromolecular considered in molecular information information may may be be considered in the the context context of of systems systems biology biology (see (see Figure Figure 2.1), 2.1), which which includes includes analyses analyses of of protein protein or or metabolic metabolic pathways, pathways, regu regulatory complex cellular cellular processes. latory networks, networks, and and other, other, more more complex processes. The The connections connections

14

2 2

Chal lenges Faced ntegration of Biological IInformation nformation Challenges Faced in the IIntegration

and and interactions interactions among among areas areas of of genomics, genomics, gene gene expression expression profiles, profiles, proteomics, proteomics, and and systems systems biology biology depend depend on on the the integration integration of of experimental experimental procedures procedures with with database searches and algorithms and database searches and the the applications applications of of computational computational algorithms and analysis analysis tools. tools. As As one one moves moves up up in in the the degree degree of of complexity complexity of of the the biological biological processes processes under under study, study, our our understanding understanding at at each each level level depends depends in in a a significant significant way way on on the the levels levels beneath beneath it. it. In In every every step, step, database database searches searches and and computational computational analysis analysis of of the the data data part of the discovery process. As we choose complex systems for are an integral part study, study, experimentally experimentally generated generated data data must must be be combined combined with with data data derived derived from from data bases and and computationally derived models best interpreta databases computationally derived models or or simulations simulations for for best interpretation. hand, modeling simulation of tion. On On the the other other hand, modeling and and simulation of protein-protein protein-protein interactions, interactions, protein protein pathways, pathways, genetic genetic regulatory regulatory networks, networks, biochemical biochemical and and cellular cellular processes, processes, and and normal normal and and disease disease physiological physiological states states are are in in their their infancy infancy and and need need more more experimental experimental observations observations to to fill fill in in missing missing quantitative quantitative details details for for mature mature efforts. efforts. In In this this close close interaction, interaction, the the boundaries boundaries between between experimentally experimentally generated generated data data and computation ally generated data are blurring. Thus, accelerating progress now and computationally generated data are blurring. Thus, accelerating progress now requires multi disciplinary teams Thus, in requires multidisciplinary teams to to conduct conduct integrated integrated approaches. approaches. Thus, in silica silico discovery, is, experiments experiments carried discovery, that that is, carried out out with with a a computer, computer, is is fully fully complemen complementary tary to to traditional traditional wet-laboratory wet-laboratory experiments. experiments. One One could could say say that that an an information information infrastructure, will infrastructure, coupled coupled with with continued continued advances advances in in experimental experimental methods, methods, will facilitate facilitate computing computing an an understanding understanding of of biology. biology.

2 .2 2.2

AN IINFORMATION AN N FOR MATION IINTEGRATION NTEG RATI O N E NV I R O N M E NT F OR L I F E SCI E NCE DISCOVE RY ENVIRONMENT FOR LIFE SCIENCE DISCOVERY
Biological Biological data data sources sources represent represent the the collective collective research research efforts efforts and and products products of of the the life life science science communities communities throughout throughout the the world. world. The The growth growth of of the the Internet Internet and and the the availability availability of of biological biological data data sources sources on on the the Web Web have have opened opened up up a a tremen tremendous biologists to questions and problems in dous opportunity opportunity for for biologists to ask ask questions and solve solve problems in unprece unprecedented ways. To harness these these community assemble all dented ways. To harness community resources resources and and assemble all available available information information to to investigate investigate specific specific biological biological problems, problems, biologists biologists must must be be able able to to find, find, extract, extract, merge, merge, and and synthesize synthesize information information from from multiple, multiple, disparate disparate sources. sources. Convergence will Convergence of of biology, biology, computer computer science, science, and and information information technology technology (IT) (IT)will accelerate disciplinary endeavor. accelerate this this multi multidisciplinary endeavor. The The basic basic needs needs are: are:
1. On 1. On demand demand access access and and retrieval retrieval of of the the most most up-to-date up-to-date biological biological data data and and the the ability bases ability to to perform perform complex complex queries queries across across multiple multiple heterogeneous heterogeneous data databases to to find find the the most most relevant relevant information information

2.3 2.3

The Nature of of Biological Data

15

2. Access Access to to the the best-of-breed best-of-breed analytical analytical tools tools and and algorithms algorithms for for extraction extraction of of useful information from the massive volume and diversity of biological data
3. A robust information integration infrastructure that connects various com 3. computational putational steps steps involving involving database database queries, queries, computational computational algorithms, algorithms, and and application software application software

This multidisciplinary approach demands close collaboration and clear under understanding standing between between people with with extremely different different domain domain knowledge and and skill skill sets. The IT professionals provide the knowledge of syntactic aspects of data, data bases, and databases, and algorithms, algorithms, such such as as how how to to search, search, access, access, and and retrieve retrieve relevant relevant information, manage and maintain robust data bases, develop information inte databases, integration gration systems, systems, model biological biological objects, and and support support a a user-friendly user-friendly graphical graphical interface that that allows the end user to view and analyze the data. The biologists pro provide knowledge of biological data, semantic aspects of data bases, and scientific databases, algorithms. Interpreting biological relationships requires an understanding of the biological meaning of the data beyond the physical file file or table layout. Particularly, the the effective effective usage usage of of scientific scientific algorithms algorithms or or analytical analytical tools tools (e.g., (e.g., sequence sequence align alignment, ment, protein protein structure structure prediction, prediction, and and other other analysis analysis software) software) depends depends on on having having a a working working knowledge knowledge of of the the computer computer programs programs and and of of biochemistry, biochemistry, molecular molecular biology, biology, and and other other scientific scientific disciplines. disciplines. Before Before we we can can discuss discuss biological biological informa information tion integration, integration, we we need need first first to to consider consider the the specific specific nature nature of of biological biological data data and and data sources.

2 .3 2.3

T H E NAT U R E OF B I O LOG I CAL DATA THE NATURE BIOLOGICAL


The The advent of of automated and and high-throughput high-throughput technologies technologies in in biological biological research research and and the the progress progress in in the the genome genome projects projects has has led led to to an an ever-increasing ever-increasing rate rate of of data data acquisition and acquisition and exponential exponential growth growth of of data data volume. volume. However, However, the the most most striking striking feature feature of of data data in in life life science science is is not not its its volume volume but but its its diversity diversity and and variability. variability.

2.3. 1 2.3.1

D ive rsity Diversity


The organized in loose hier The biological biological data data sets sets are are intrinsically intrinsically complex complex and and are are organized in loose hierarchies archies that that reflect reflect our our understanding understanding of of the the complex living systems, systems, ranging from from genes and genes and proteins, proteins, to to protein-protein protein-protein interactions, interactions, biochemical biochemical pathways pathways and and regulatory regulatory networks, networks, to to cells cells and and tissues, tissues, organisms organisms and and populations, populations, and and finally finally the the ecosystems ecosystems on on earth. earth. This This system system spans spans many many orders orders of of magnitudes magnitudes in in time time and and space space and and poses poses challenges challenges in in informatics, informatics, modeling, modeling, and and simulation simulation equivalent equivalent to to

16 16
El
(/)

2 2

Challenges Chal lenges Faced Faced in in the the Integration I ntegration of of Biological Biolog ica l Information Information

1 0' 0 1 0

1 0'

b:l Xl .... 3
Z E
:::I 0

_ 0 (Ij

Cl

(/)

1 0

1 0'

1 0'

E 1 0'

1 0'

1 0

tt-::::::J...-.

1 0 8 1 0. 8 1 01 0.3 1 0 Time Scale (seconds) 3 10 1 Og evolutIOnary GeologIc &


timelC8.1es

2 .2 2.2

Notional representation representation of of the the vast vast and and complex complex biological biological world. world. Notional

FIGURE or or beyond beyond any any other other scientific scientific endeavor. endeavor. A A notional notional description description of of the the vast vast scale scale of of complexity, population, population, time, time, and and space space in in the the biological biological systems systems is is given given in in Figure Figure complexity, 2.2 [2] [2]. Reflecting the the complexity complexity of of biological biological systems, systems, the the types types of of biological biological data data 2.2 . Reflecting are highly highly diverse. diverse. They They range range from from the the plain plain text text of of laboratory laboratory records records and and liter literare ature three-dimensional atomic ature publications, publications, nucleic nucleic acid acid and and protein protein sequences, sequences, three-dimensional atomic structures of of molecules, molecules, and and biomedical biomedical images images with with different different levels levels of of resolutions, resolutions, structures to to various various experimental experimental outputs outputs from from technology technology as as diverse diverse as as microarray microarray chips, chips, gels, gels, light light and and electronic electronic microscopy, microscopy, Nuclear Nuclear Magnetic Magnetic Resonance Resonance (NMR), (NMR), and and mass mass spectrometry. spectrometry. The The horizontal horizontal abscissa abscissa in in Figure Figure 2.2 2.2 shows shows time time scales scales rang ranging ing from from femtoseconds femtoseconds to to eons eons that that represent represent the the processes processes in in living living systems systems from from chemical chemical and and biochemical biochemical reactions, reactions, to to cellular cellular events, events, to to evolution. evolution. The The vertical vertical ordinate ordinate shows shows the the numerical numerical scale, scale, the the range range of of number number of of atoms atoms involved involved in in molecular molecular biology, biology, the the number number of of macromolecules macromolecules in in cellular cellular biology, biology, the the num number of of cells cells in in physiological physiological biology, biology, and and the the number number of of organisms organisms in in population population ber biology. The The third third dimension dimension indicated indicated by by rectangles rectangles illustrates illustrates the the hierarchical hierarchical na nabiology. ture of of biology biology from from subcellular subcellular structures structures to to ecosystems. ecosystems. The The fourth fourth dimension, dimension, ture

2.4

Data Sou rces in Life Science Sources

17

indicated y ovals, n mode ling indicated b by ovals, represents represents the the current current state state of of computation computation biology biology iin modeling and and simulation simulation of of biological biological systems. systems.

2.3.2 2.3.2

Va ria b i l ity Variability


Different Different individuals individuals and and species species vary vary tremendously, tremendously, so so naturally naturally biological biological data data does also. For example, structure and function of organs vary across age does also. For example, structure and function of organs vary across age and and gender, gender, in in normal normal and and different different disease disease states, states, and and across across species. species. Essentially, Essentially, all all features features of of biology biology exhibit exhibit some some degree degree of of variability. variability. Biological Biological research research is is in in an an expanding expanding phase, phase, and and many many fields fields of of biology biology are are still still in in the the developing developing stages. stages. Data Data for for these these systems systems are are incomplete incomplete and and very very often often inconsistent. inconsistent. This This presents presents a objects. a great great challenge challenge in in modeling modeling biological biological objects.

2 .4 2.4

ma_

DATA RCES IN IN L I F E SCI E NCE DATA SOU SOURCES LIFE SCIENCE

In technology and In response response to to current current advances advances in in technology and research research scope, scope, massive massive amounts amounts of of data data are are routinely routinely deposited deposited in in public public and and private private databases. databases. In In parallel, parallel, there there is is a a proliferation proliferation of of computational computational algorithms algorithms and and analysis analysis tools tools for for data data analysis analysis and bases are and visualization. visualization. Because Because most most data databases are accompanied accompanied by by specific specific computa computational analysis and tional algorithms algorithms or or tools tools for for analysis and presentation presentation and and vice vice versa, versa, we we use use the the database or term data source source to to refer refer to to a a database or computational computational analysis analysis tool tool or or both. both. term data There 000 life There are are more more than than 1 1000 life science science data data sources sources scattered scattered over over the the Internet Internet (see the Public Catalog Catalog of (see the Biocatalog Biocatalog and and the the Public of Databases), Databases), and and these these data data sources sources vary widely in scope Finding the vary widely in scope and and content. content. Finding the right right data data sources sources alone alone can can be be a a challenge. challenge. Searching Searching for for relevant relevant information information largely largely relies relies on on a a Web Web informa information published catalog Journal tion retrieval retrieval system system or or on on published catalog services. services. Each Each January, January, the the Journal of molecular biology of Nucleic Nucleic Acid Acid Research Research provides provides a a yearly yearly update update of of molecular biology database database collections. bases alone collections. The The current current issue issue lists lists 335 335 entries entries in in molecular molecular biology biology data databases alone [3]. [3]. Various Various Web Web sites sites provide provide a a catalog catalog and and links links to to biological biological data data sources sources (see (see "biocat" dbcat" cited cited previously). "biocat" and and " "dbcat" previously). In In addition addition to to the the public public sources, sources, there there are private, proprietary proprietary data sources created are numerous numerous private, data sources created by by biotechnology biotechnology or or phar pharmaceutical maceutical companies. companies. The The scope scope of of the the public public data data sources sources ranges ranges from from the the comprehensive, comprehensive, multidis multidisciplinary, ciplinary, community community informatics informatics center, center, supported supported by by government government public public funds funds and and sustained sustained by by teams teams of of specialists, specialists, to to small small boutique boutique data data sources sources by by indi individual investigators. The bases varies vidual investigators. The content content of of data databases varies greatly, greatly, reflecting reflecting the the broad broad disciplines disciplines and and sub-disciplines sub-disciplines across across life life sciences sciences from from molecular molecular biology biology and and cell cell

18

2 2

Challenges Faced in the Integration of Biological Information


A ,= ", , ,, , , - ,' , ,, " " -cl!' _ ' ' ""', ", """ ,, ,,,,,,,-,,, "cr&,

biology, to to medicine medicine and and clinical clinical trials, trials, to to ecology ecology and and biodiversity. biodiversity. A A sampling sampling of of biology, various public public biological biological databases data bases is is given given in in the the Appendix. Appendix. various

2.4. 1 2.4.1

B i o l o g i ca l Databases Data bases Are Are Autonomous Auto n o m o u s Biological


Biological data data sources sources represent represent a a loose loose collection collection of of autonomous autonomous Web Web sites, sites, each each Biological with its its own own governing governing body body and and infrastructure. infrastructure. These These sites sites vary vary in in almost almost every every with possible instance instance such such as as computer computer platform, platform, access, access, and and data data management management system. system. possible Much of of the the available available biological biological data data exist exist in in legacy legacy systems systems in in which which there are no no Much there are structured information information management management systems. systems. These These data sources are inconsistent structured data sources are inconsistent at the the semantic semantic level, level, and and more more often than not, not, there there is is no no adequate adequate attendant attendant at often than meta-data specification. Until recently, biological biological data bases were meta-data specification. Until recently, databases were not not designed designed for for interoperability [4]. [4] . interoperability

2 . 4.2 2.4.2

B i o l o g i c a l Data bases Are ete rogeneous Biological Databases Are H Heterogeneous i n Data Data Form ats in Formats
Data in public public or or proprietary proprietary data bases are heterogeneous data Data elements elements in databases are stored stored in in heterogeneous data formats ranging from simple files to fully structured database systems that are formats ranging from simple files to fully structured database systems that are of ofhoc, application-specific, ten ten ad hoc, application-specific, or or vendor-specific. vendor-specific. For For example, example, scientific scientific litera literature, ture, images, images, and and other other free-text free-text documents documents are are commonly commonly stored stored in in unstructured unstructured or semi-structured semi-structured formats formats (plain (plain text text files, HTML or or XML XML files, files, binary binary files). files). or files, HTML Genomic, microarray microarray gene gene expression, expression, and and proteomic proteomic data data are are routinely routinely stored stored Genomic, in spreadsheet programs in conventional conventional spreadsheet programs or or in in structured structured relational relational databases databases (Or (Oracle, data depository have implemented acle, Sybase, Sybase, DB2, DB2, Informix). Informix). Major Major data depository centers centers have implemented various National Center various data data formats formats for for operations; operations; the the National Center for for Biotechnology Biotechnology Infor Information (NCBI) has adopted the highly nested data system ASN. l (Abstract Syntax mation (NCBI) has adopted the highly nested data system ASN.1 (Abstract Syntax Notation) Notation) for for the the general general storage storage of of gene, gene, protein, protein, and and genomic genomic information information [5]; [5]; the the United United States States Department Department of of Agriculture Agriculture (USDA) (USDA) Plant Plant Genome Genome Data Data and and In Information formation Center Center has has adopted adopted the the object-oriented, object-oriented, A A C. C. elegans elegans Data Data Base Base (Ace (Ace DB) DB) data data management management systems systems and and interface interface [6]. [6].

2 . 4.3 2.4.3

B i o l og ica l Data o u rces Are a m ic Biological Data S Sources Are Dyn Dynamic
In In response response to to the the advance advance of of biological biological research research and and technology, technology, the the overall overall fea features tures of of biological biological data data sources sources are are subjected subjected to to continuous continuous changes changes including including data bases spring data content content and and data data schema. schema. New New data databases spring up up at at a a rapid rapid rate rate and and older older databases databases disappear. disappear.

2.5 2.5

Challenges iin n IInformation nformation Integration

19

2.4.4 2 . 4. 4

Computational Analysis Tools Require Co m putati o n a l Ana lysis Too l s Req u i re S pecific IInput/Output n put/Output Formats nd B road Specific Formats a and Broad Domain Do m a i n Knowledge
Computational software software packages packages often often require require specific specific input input and and output output data data Computational formats formats and and graphic graphic display display of of results, results, which which pose pose serious serious compatibility compatibility and and inter interoperability program is operability issues. issues. The The output output of of one one program is not not readily readily suitable suitable as as direct direct input input subsequent database search. Development of a stan stanfor the next program or for a subsequent dard data data exchange exchange format format such such as as XML XML will will alleviate alleviate some some of of the the interoperability interoperability dard issues. Understanding Understanding application application semantics semantics and and the the proper proper usages usages of of computer computer soft software is a a major major challenge. challenge. Currently, Currently, there are more 500 software packages ware is there are more than than 500 software packages for molecular or analysis analysis tools for molecular biology alone (reviewed in the Biocatalog at the European Bioinformatics Institute Institute [EBI] Web site site given given previously). previously). These These pro proEuropean Bioinformatics [EBI] Web grams are extremely extremely diverse, ranging from from nucleic nucleic and and protein protein sequence sequence analysis, analysis, grams are diverse, ranging genome genome comparison, comparison, protein protein structure structure prediction, prediction, biochemical biochemical pathway pathway and and ge genetic analysis, and modeling and netic network network analysis, and construction construction of of phylogenetic phylogenetic trees, trees, to to modeling and simulation simulation of of biological biological systems systems and and processes. processes. These These programs, programs, developed developed to to solve specific specific biological biological problems, problems, rely rely on on input input from from other other domain domain knowledge knowledge solve such applied mathematics, such as as computer computer science, science, applied mathematics, statistics, statistics, chemistry, chemistry, and and physics. physics. ab initio initio prediction prediction based For example, protein folding folding can be approached For example, protein approached using ab on principles (physics) or knowledge-based (computer (computer science) science) thread threadon first first principles or on on knowledge-based ing methods methods [7]. [7] . Many of these these software software packages, packages, particularly those available available ing Many of particularly those through institutions, lack lack adequate algo through academic academic institutions, adequate documentation documentation describing describing the the algorithm, functionality, functionality, and and constraints of the the program. Given the the multidisciplinary multidisciplinary rithm, constraints of program. Given nature and the scope of domain domain knowledge, usage of of a analysis nature and the scope of knowledge, proper proper usage a scientific scientific analysis program requires requires significant (human) expertise. It is a daunting task for program significant (human) expertise. It is a daunting task for the the end end users to to choose evaluate the the proper proper software programs for analyses, so users choose and and evaluate software programs for analyses, so they they will able to interpret the will be be able to understand understand and and interpret the results. results.

2.5 2.5

C HALLE N G ES IN I N INFORMATION I N FO R MATI O N INTEGRATION I NTEG RATI O N CHALLENGES

With the the expansion expansion of of the the biological biological data data sources sources available available across across the the World World With Wide Web, Web, integration integration is is a a new, new, major major challenge challenge facing facing researchers researchers and and instiinsti Wide tutions that that wish wish to to explore these rich rich deposits of information. information. Data Data integration integration tutions explore these deposits of is an an ongoing ongoing active active area area in in the the commercial commercial world. world. However, However, information information integraintegra is tion in in biology biology must must consider consider the the characteristics characteristics of of the the biological biological data data and and data data tion and 2.4):(1) 2.4): (1) diverse diverse data data are are as discussed discussed in in the the previous previous two two sections sections (2.3 and sources as sources stored in in autonomous autonomous data data sources sources that that are are heterogeneous in data data formats, formats, data data stored heterogeneous in

20 20

2 2

C h a l l enges Faced ntegration of ical IInformation nformation Challenges Faced in the IIntegration of Biolog Biological

Challenge: Integ rate Databases and Scientific Algorithms

Genomics (GenBank, GOB)

Literatures (MEOLlNE. USPTO)

Information Integration

Pharmacogenomics (dbSNP. LocusLink)


2 .3 2.3

Proteomics (POB. KEGG)

FIGURE

Integration of experimental data, data derived from multiple database queries, and applications of scientific algorithms and computational analysis tools (Refer to the Appendix for the definitions of acronyms). management management systems, systems, data data schema, schema, and and semantics; semantics; (2) analysis analysis of of biological biological data data requires requires both both database database query query activities activities and and proper proper usage usage of of computational computational analy analysis sis tools; tools; (3) (3) a a broad broad spectrum spectrum of of knowledge knowledge domains domains divide divide traditional traditional biological biological disciplines. disciplines. For For a a typical typical research research project, project, a a user user must must be be able able to to merge merge data data derived derived from from multiple, multiple, diverse, diverse, heterogeneous heterogeneous sources sources freely freely and and readily. readily. As As illustrated illustrated in in 2.3, the LIMS output from microarray gene expression experiments must Figure Figure the LIMS output from microarray gene expression experiments must be be interpreted interpreted and and analyzed analyzed in in the the context context of of the the information information and and tools tools avail available able across across the the Internet, Internet, including including genomic genomic data, data, literature, literature, clinical clinical data, data, analysis analysis algorithms, bases may algorithms, etc. etc. In In many many cases, cases, data data retrieved retrieved from from several several data databases may be be selected, selected, filtered, filtered, and and transformed transformed to to prepare prepare input input data data sets sets for for particular particular analytic analytic algorithms algorithms or or applications. applications. The The output output of of one one program program may may be be submitted submitted as database search. as input input to to another another program program and/or and/or to to another another database search. The The integra integration computational steps tion process process involves involves an an intricate intricate network network of of multiple multiple computational steps and and data biology faces data flow. flow. Information Information integration integration in in biology faces challenges challenges at at the the technology technology level level for for data data integration integration architectures architectures and and at at the the semantic semantic level level for for meta-data meta-data specification, specification, maintenance maintenance of of data data provenance provenance and and accuracy, accuracy, ontology ontology develop development knowledge sharing ment for for knowledge sharing and and reuse, reuse, and and Web Web presentations presentations for for communication communication and and collaboration. collaboration.

2.5 2 .5

Challenges in in IInformation nformation Integration ~

2 1 21

.1 2.. 5 5. 1

Data IIntegration Data nteg ratio n


First-generation bioinformatics bioinformatics solutions solutions for for data data integration integration employ employ a a series series of of First-generation non-interoperable and and non-scalable non-scalable quick quick fixes fixes to to translate translate data data from from one one format format non-interoperable into another. another. This This means means writing writing programs, programs, usually in programming programming language language into usually in such as as Perl, Perl, to to access, access, parse, parse, extract, extract, and and transform transform necessary necessary data data for for particu particusuch lar applications. Writing lar applications. Writing a a translation translation program program requires requires intensive intensive coding coding efforts efforts and structures of source data bases. These and knowledge knowledge of of the the data data and and structures of the the source databases. These ad hoc point-to-point point-to-point solutions solutions are are very very inefficient inefficient and and are are not not scalable scalable to to the the large large num num2 ber of of data data sources sources to to be be integrated. integrated. This This is is dubbed dubbed the the N N 2 factor because because it it ber would (N-l )/2 programs would require require N N (N-1)/2 programs to to connect connect N N data data sources. sources. If If one one particular particular data data source source changes changes formats, formats, all all of of the the programs programs involved involved with with this this data data source source upgraded. Upgrades must be be upgraded. Upgrades are are inevitable inevitable because because changes changes in in Web Web page page services services must and schema schema are are very common for for biological data sources. sources. and very common biological data The The second second generation generation of of data data integration integration solutions solutions provides provides a a more more struc structured environment for code code re-use and flexible, flexible, scalable, scalable, robust tured environment for re-use and robust integration. integration. Over Over the past decade, enormous efforts and and progress have been been made made in in many many data data inte intethe past decade, enormous efforts progress have gration systems. They roughly divided gration systems. They can can be be roughly divided into into three three major major categories categories according according to and architectures: warehousing approach, approach, the distributed or to access access and architectures: the the data data warehousing the distributed or fed federated approach, and the mediator mediator approach. approach. However, erated approach, and the However, the the following following fundamental fundamental functions robust data functions or or features features are are desirable desirable for for a a robust data integration integration system: system: 1. and retrieving relevant data data from from a a broad range of of disparate 1. Accessing Accessing and retrieving relevant broad range disparate data data s ources sources

the retrieved into designated model for for integration 2. Transforming Transforming the retrieved data data into designated data data model integration
3. data model for abstracting data and and pre3. Providing Providing a a rich rich common common data model for abstracting retrieved retrieved data pre senting integrated data user applications senting integrated data objects objects to to the the end end user applications

4. Providing a a high-level language to queries across 4. Providing high-level expressive expressive language to compose compose complex complex queries across multiple data and to to facilitate facilitate data multiple data sources sources and data manipulation, manipulation, transformation, transformation, and and integration integration tasks tasks
5. Managing Managing query query optimization optimization and and other complex issues issues 5. other complex

The The Data Data Warehouse W arehouse Approach Approach


The data data warehouse warehouse approach approach assembles assembles data data sources sources into into a a centralized centralized system system The with with a a global global data data schema schema and and an an indexing indexing system system for for integration integration and and naviganaviga tion. The The data data warehouse warehouse world world is is dominated dominated by by relational relational database database management management tion. systems offer the systems (RDBMS), (RDBMS), which which offer the advantage advantage of of a a mature mature and and widely widely accepted accepted database technology technology and and a a high high level level standard standar query query language language (SQL) (SQL) [8]. [8]. These These database systems have have proven proven very very successful successful in in commercial commercial enterprises, enterprises, health health care, care, and and systems

22

2 2

Challenges Faced in the C h a l l enges Faced in the

nformation Integration of Biological IInformation

government government sectors sectors for for resource resource management management such such as as payroll, payroll, inventory, inventory, and and records. records. They They require require reliable reliable operation operation and and maintenance, maintenance, and and the the underlying underlying databases are under under a a controlled controlled environment, environment, are are fairly fairly stable, stable, and and are are structured. structured. data bases are The biological data The biological data sources sources are are very very different different from from those those contained contained in in the the com commercial databases. The mercial databases. The biological biological data data sources sources are are much much more more dynamic dynamic and and un unpredictable, and and few few of of the the public public biological biological data data sources sources use use structured structured database database predictable, management the sheer management systems. systems. Given Given the sheer volume volume of of data data and and the the broad broad range range of of bi biological ological databases, databases, it it would would require require substantial substantial effort effort to to develop develop any any monolithic monolithic data biological information data warehouses warehouses encompassing encompassing diverse diverse biological information such such as as sequence sequence and the various various functions poly and structure structure and and the functions of of biochemical biochemical pathways pathways and and genetic genetic polymorphisms. number of of data databases in a a data data warehouse warehouse grows, grows, the the cost cost of of morphisms. As As the the number bases in storage, maintenance, and storage, maintenance, and updating updating data data will will be be prohibitive. prohibitive. A A data data warehouse warehouse has an that the has an advantage advantage in in that the data data are are readily readily accessed accessed without without Internet Internet delay delay or or bandwidth bandwidth limitation limitation in in network network connections. connections. Vigorous Vigorous data data cleansing cleansing to to remove remove potential errors, errors, duplications, inconsistency can potential duplications, and and semantic semantic inconsistency can be be performed performed be before in the the warehouse. warehouse. Thus, limited data fore entering entering data data in Thus, limited data warehouses warehouses are are popular popular solutions solutions in in the the life life sciences sciences for for data data mining mining of of large large databases, databases, in in which which carefully carefully prepared data prepared data sets sets are are critical critical for for success success [9]. [9].
The The Federation Federation Approach Approach

The distributed or approaches do The distributed or federated federated integration integration approaches do not not require require a a centralized centralized persistent database, and underlying data persistent database, and thus thus the the underlying data sources sources remain remain autonomous. autonomous. The The federated maintain a model and federated systems systems maintain a common common data data model and rely rely on on schema schema mapping mapping to to translate translate heterogeneous heterogeneous source source database database schema schema into into the the target target schema schema for for integration. data dictionary used to integration. A A data dictionary is is used to manage manage various various schema schema components. components. In In the the life life science science arena, arena, in in which which schema schema changes changes in in data data sources sources are are frequent, frequent, the could be the maintenance maintenance of of a a common common schema schema for for integration integration could be costly costly in in large large federated federated systems. systems. As As the the database database technology technology progresses progresses from from relational relational toward toward object-oriented 1 0] , many object-oriented technology technology [ [10], many distributed distributed integration integration solutions solutions employ employ object-oriented object-oriented paradigms paradigms to to encapsulate encapsulate the the heterogeneity heterogeneity of of underlying underlying data data sources sources in in life life science. science. These These systems systems typically typically rely rely on on client-server client-server architectures architectures and and software software platforms platforms or or interfaces interfaces such such as as Common Common Object Object Request Request Broker Broker Architecture CORBA), an Architecture ((CORBA), an open open standards standards by by the the Object Object Management Management Group Group (OMG) 1 , 12]. (OMG) to to facilitate facilitate interoperation interoperation of of disparate disparate components components [1 [11, 12].

The Mediator Mediator Approach Approach The


The The most most flexible flexible data data integration integration designs designs adopt adopt a a mediator mediator approach approach that that in introduces troduces an an intermediate intermediate processing processing layer layer to to decouple decouple the the underlying underlying heteroge heterogeneous distributed neous distributed data data sources sources and and the the client client layer layer of of end end users users and and applications. applications.

2.5

C h a l l enges in I nformation I ntegration

23

The The mediator mediator layer layer is is a a collection collection of of software software components components performing performing the the task task of of data data integration. integration. The The concept concept was was first first introduced introduced by by Wiederhold Wiederhold to to provide provide flex flexible modular integration of information systems ible modular solutions solutions for for integration of large large information systems with with multiple multiple knowledge domains domains [13, [13, 14]. 14]. knowledge Most Most database database mediator mediator systems systems use use a a wrappers wrappers layer layer to to handle handle the the tasks tasks of of data data access, access, data data retrieval, retrieval, and and data data translation. translation. The The wrappers wrappers access access specified specified data data sources, extract sources, extract selected selected data, data, and and translate translate source source data data formats formats into into a a common common data data model model designated designated for for the the integration integration system. system. The The mediator mediator layer layer performs performs the the core core function function of of data data transformation transformation and and integration integration and and communicates communicates with with the the wrappers wrappers and and the the user user application application layer. layer. The integration system provides an internal common common data model for The integration system provides an internal data model for abstraction abstraction of incoming data sources. Thus, Thus, the of incoming data derived derived from from heterogeneous heterogeneous data data sources. the internal internal data model must data model must be be sufficiently sufficiently rich rich to to accommodate accommodate various various data data formats formats of of existing biological data sources, sources, which which may may include include unstructured unstructured text text files, semiexisting biological data files, semi structured structured XML XML and and HTML HTML files, files, and and structured structured relational, relational, object-oriented, object-oriented, and and nested nested complex complex data data models. models. In In addition, addition, the the internal internal data data model model facilitates facilitates struc structuring integrated integrated biological biological objects to present present to to the the user user application application layer. layer. The The flat, flat, turing objects to tabular model encounter tabular forms forms of of the the relational relational model encounter severe severe difficulty difficulty in in model model complex complex and and hierarchical hierarchical biological biological systems systems and and concepts. concepts. XML XML and and other other object-oriented object-oriented models model biological models are are more more natural natural in in model biological systems systems and and are are gaining gaining popularity popularity in in the the community. community. In mediator layer also provides In addition addition to to the the core core integration integration function, function, the the mediator layer also provides services services such such as as filtering, filtering, managing managing meta-data, meta-data, and and resolving resolving semantic semantic inconsis inconsistency bases. Ideally, tency in in source source data databases. Ideally, instead instead of of relying relying on on low-level low-level programming programming efforts, efforts, a a full full integration integration system system supports supports a a high-level high-level query query language language for for data data transformation composition of transformation and and manipulation. manipulation. This This would would greatly greatly facilitate facilitate the the composition of complex queries complex queries across across multiple multiple data data sources sources and and the the management management of of architecture architecture layers and layers and software software components. components. The The advantage advantage of of the the mediator mediator approach approach is is its its flexibility, flexibility, scalability, scalability, and and mod modularity. ularity. The The heterogeneity heterogeneity and and dynamic dynamic nature nature of of the the data data sources sources is is isolated isolated from from the the end end user user applications. applications. Wrappers Wrappers can can readily readily handle handle data data source source schema schema changes. New changes. New data data sources sources can can be be added added to to the the system system by by simply simply adding adding new new wrappers. Scientific simply treated wrappers. Scientific analytical analytical tools tools are are simply treated as as data data sources sources via via wrap wrappers pers and and can can be be seamlessly seamlessly integrated integrated with with database database queries. queries. This This approach approach is is most most suitable suitable for for scientific scientific investigations investigations that that need need to to access access the the most most up-to-date up-to-date data data and issue queries and issue queries against against multiple multiple heterogeneous heterogeneous data data sources sources on on demand. demand. There There are are many many flavors flavors of of mediator mediator approaches approaches in in life life science science domains, domains, which which differ differ in in database database technologies, technologies, implementations, implementations, internal internal data data models, models, and and query query languages. The languages. The Kleisli Kleisli system system provides provides an an internal, internal, nested, nested, complex complex data data model model and 15-17] . and a a high-power high-power query query and and transformation transformation language laslguage for for data data integration integration [ [15-17].

24

2 2

Chal lenges Faced ntegratio n of ical IInformation nformation Challenges Faced in the IIntegration of Biolog Biological

many design principles with with Kleisli in supporting a complex The K2 system shares many data model, but it it adopts adopts more more object-oriented object-oriented features features [ [18, 19] (see (see Chapter Chapter 8 8). data model, but 1 8, 19] ). The The Object-Protocol Object-Protocol Model Model (OPM) (OPM) supports supports a a rich rich object object model model and and a a global global schema 1 ] . The schema for for data data integration integration [20, [20, 2 21]. The IBM IBM DiscoveryLink DiscoveryLink middleware middleware system system is database technology SQL3 [22, is rooted rooted in in the the relational relational database technology and and supports supports a a full full SQL3 [22, 23] 23] (see Chapter Chapter 1 11). The Transparent Transparent Access Access to to Multiple Multiple Bioinformatics Bioinformatics Information Information (see 1 ). The Sources (TAMBIS) global ontology Sources (TAMBIS) provides provides a a global ontology to to facilitate facilitate queries queries across across multiple multiple data sources [24, [24, 25] (see Chapter data sources 25] (see Chapter 7). 7). The The Stanford-IBM Stanford-IBM Manager Manager of of Multiple Multiple Information (TSIMMIS) is Information Sources Sources (TSIMMIS) is a a mediation mediation system system for for information information integration integration with with its its own own data data model, model, the the Object-Exchange Object-Exchange Model Model (OEM), (OEM), and and query query language language [26] [26]..

2 . 5. 2 2.5.2

M eta-Data S pecification Meta-Data Specification


Meta-data s data Meta-data iis data describing describing data, data, that that is, is, data data that that provides provides documentation documentation on on other other data data managed managed within within an an application application or or environment. environment. In database environment, In a a structured structured database environment, the the meta-data meta-data are are formally formally included included in in the data schema schema and and type type definition. definition. However, However, few few of of the the biomedical biomedical data databases use the data bases use commercial, commercial, structured structured database database management management systems. systems. The The majority majority of of biological biological data in collections data are are stored stored and and managed managed in collections of of flat flat files files in in which which the the structure structure and and meaning meaning of of the the data data are are not not well well documented. documented. Furthermore, Furthermore, most most biological biological data data are are presented presented to to the the end end users users as as loosely loosely structured structured Web Web pages, pages, even even with with those those data bases that databases that have have underlying underlying structured structured database database management management systems systems (DBMS). (DBMS). Many Many biological biological data data sources sources provide provide keyword-search keyword-search querying querying interfaces interfaces with specified Boolean Boolean combinations with which which a a user user can can input input specified combinations of of search search terms terms to to access access the the underlying underlying data. data. Formulating Formulating effective effective Boolean Boolean queries queries requires requires domain domain expertise bases. Without expertise and and knowledge knowledge of of the the contents contents and and structure structure of of the the data databases. Without meta-data specification, users are likely to meta-data specification, users are likely to formulate formulate queries queries that that return return no no answers answers or or return return an an excessively excessively large large number number of of irrelevant irrelevant answers. answers. In In such such unstructured unstructured or environments, the introduction of or semi-structured semi-structured data data access access environments, the introduction of meta-data meta-data in in the the data bases across across the would be information gathering databases the Web Web would be important important for for information gathering and and to to enhance user's ability to capture enhance the the user's ability to capture the the relevant relevant information information independent independent of of data data formats. formats. The need for The need for adequate adequate meta-data meta-data specification specification for for scientific scientific analytical analytical algo algorithms rithms and and software software tools tools is is particularly particularly acute. acute. Very Very little little attention attention has has been been given given to meta-data especially those available in to meta-data specification specification in in existing existing programs, programs, especially those available in the the public domain academic institutions. general, they adequate docu public domain from from academic institutions. In In general, they lack lack adequate documentation mentation on on algorithms, algorithms, data data formats, formats, functionality, functionality, and and constraints. constraints. This This could could lead misunderstanding of lead to to potential potential misunderstanding of computational computational tools tools by by the the end end users. users. For For example, programs are example, sequence sequence comparison comparison programs are the the most most commonly commonly used used tools tools to to

2.5 2.5

Challenges in in IInformation nformation Integration

25

search bases. There search similar similar sequences sequences in in data databases. There are are many many such such programs programs in in the the public public domains. The Basic Local Alignment Tool (BLAST) (BLAST) uses heuristic ap apand private domains. proximation proximation algorithms algorithms to to search search for for related related sequences sequences against against the the databases databases [27]. [27]. BLAST bases and BLAST has has the the advantage advantage of of speed speed in in searching searching very very large large data databases and is is a a widely widely used tool. tool. Very Very often often it it is is an an overly overly used used tool tool in in the the molecular molecular biology biology community. community. used The BLAST BLAST program program trades trades speed speed for for sensitivity sensitivity and and may may not not be be the the best best choice choice The Smith-Waterman dynamic programming programming algorithm, which for all purposes. The Smith-Waterman strives for for optimal optimal global global sequence sequence alignment, alignment, is is more more sensitive sensitive in in finding finding distantly distantly strives related related sequences sequences [28]. [28]. However, However, it it requires requires substantial substantial computation computation power power and and a much much slower slower search search speed speed (50-fold (50-fold or or more). more). Recently, Recently, a a number number of of other other pro proa grams have have been been developed developed using using hidden hidden Markov Markov models, models, Bayesian Bayesian statistics, statistics, and and grams neural ] . In neural networks networks for for pattern pattern matching matching [29 [29]. In addition addition to to algorithmic algorithmic differences, differences, these these programs programs vary vary in in accuracy, accuracy, statistical statistical scoring scoring system, system, sensitivity, sensitivity, and and per performance. Without Without an an adequate adequate meta-data specification, it it would be a a challenge challenge formance. meta-data specification, would be for users to program for for users to choose choose the the most most appropriate appropriate program for their their application, application, let let alone alone to use the optimal optimal parameters parameters to interpret the results properly and evaluate the interpret results statistical significance of the search search results. statistical In In summary, summary, with with the the current current proliferation proliferation of of biological biological data data sources sources over over the the Internet Internet and and new new data data sources sources constantly constantly springing springing up up around around the the world, world, there there is an an urgent urgent need need for for better better meta-data to enhance enhance our our ability ability to to find is meta-data specification specification to find relevant information across the understand the semantics of relevant information across the Web, Web, to to understand the semantics of scientific scientific application tools, tools, and and to to integrate information. Ultimately, Ultimately, the communication application integrate information. the communication and sharing of biological data data will will follow follow the and sharing of biological the concept concept and and development development of of the the 2 Semantic Web Web [30]. The Resource Resource Description Description Format Format ( RDF) schema schema developed Semantic [30]. 2 The (RDF) developed by the Semantic Web offers a a general model for for meta-data applications such such that that by the Semantic Web offers general model meta-data applications data sources on the the Web linked and understood by by both humans and data sources on Web can can be be linked and be be understood both humans and computers. 3 computers.3

2.5.3 2.5.3

Data Prove n a nce a nd Data racy Data Provenance and Data Accu Accuracy
As databases data bases move move to to the the next next stage stage of of development, development, more more and and more more secondary secondary As with value-added value-added annotations annotations will will be be developed. developed. Many Many of of the the data data databases data bases with providers will also become data consumers. Data provenance and data accuracy providers will also become data consumers. Data provenance and data accuracy become become major major issues issues as as the the boundaries boundaries among among primary primary data data generated generated experimenexperimen tally, data data generated generated through through application application of of scientific scientific analysis analysis programs, programs, and and data data tally,

2. 2. See See also also http://www.w3.org/2OO1/sw. http://www.w3.org/2001/sw. 3 . The RDF RDF Schema Schema is is given given and and discussed discussed at at http://www.w3.org/RDF/overview.html http://www.w3.org/RDF/overview.html and and 3.The h ttp'//www.w3, .orglDesignIssuesISemantic.html. org/D esign Issues/Semantic.h tml. http://www.w3

26 6

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

2 2

Challenges ntegration of ica l IInformation nformation Challenges Faced Faced in the IIntegration of Biolog Biological

derived derived from from database database searches searches will will be be blurred. blurred. When When users users find find and and examine examine a a set will have concerned about set of of data data from from a a given given database, database, they they will have to to be be concerned about where where the the data came from and how data came from and how the the data data were were generated. generated. One example example of of this this type type of of difficulty difficulty can can be be seen seen with with the the genome genome annotation annotation One pipeline. pipeline. The The raw raw experimental experimental output output of of DNA DNA sequences sequences needs needs to to be be character characterized and and analyzed into useful useful information. may involve ized analyzed to to turn turn into information. This This may involve the the application application of of sequence sequence comparison comparison programs programs or or a a sequence sequence similarity similarity search search against against existing existing sequence bases to similar sequences sequence data databases to find find similar sequences that that have have been been studied studied in in other other species species to infer infer functions. functions. For For genes/sequences genes/sequences with with unknown unknown function, function, gene gene prediction prediction to programs programs can can be be used used to to identify identify open open reading reading frames, frames, to to translate translate DNA DNA sequences sequences into into protein protein sequences, sequences, and and to to characterize characterize promoter promoter and and regulatory regulatory sequence sequence mo motifs. For database searches tifs. For genes/sequences genes/sequences that that are are known, known, database searches may may be be performed performed to to retrieve information from bases for retrieve relevant relevant information from other other data databases for protein protein structure structure and and pro protein family family classification, polymorphism and disease, literature tein classification, genetic genetic polymorphism and disease, literature references, references, and on. The and so so on. The annotation annotation process process involves involves computational computational filtering, filtering, transforming, transforming, and manipulating manipulating of of data, and it it frequently frequently requires human efforts efforts in in correction correction and data, and requires human and and curation. curation. Thus, Thus, most most curated curated databases databases contain contain data data that that have have been been processed processed with with specific bases. Describing specific scientific scientific analysis analysis programs programs or or extracted extracted from from other other data databases. Describing the provenance piece of issue. These bases the provenance of of some some piece of data data is is a a complex complex issue. These annotated annotated data databases offer offer rich rich information information and and have have enormous enormous value, value, yet yet they they often often fail fail to to keep keep an an adequate provenance of 1]. adequate description description of of the the provenance of the the data data they they contain contain [3 [31]. With bases become With increasingly increasingly annotated annotated content, content, data databases become interdependent. interdependent. Errors acquisition and handling in database can Errors caused caused by by data data acquisition and handling in one one database can be be propagated propagated quickly into into other other databases, or data data updated updated in in one one database database may may not not be be immequickly databases, or imme diately diately propagated propagated to to the the other other related related databases. databases. At At the the same same time, time, differences differences in in annotations databases because annotations of of the the same same object object may may arise arise in in different different databases because of of the the ap application of of different different scientific scientific algorithms algorithms or or to to different different interpretations interpretations of of results. results. plication Scientific Scientific analysis analysis programs programs are are well well known known to to be be extremely extremely sensitive sensitive to to input input datasets used in datasets and and the the parameters parameters used in computation. computation. For For example, example, a a common common prac practice in of an an unknown tice in annotation annotation of unknown sequence sequence is is to to infer infer that that similar similar sequences sequences share share common common biochemical biochemical function function or or a a common common ancestor ancestor in in evolution. evolution. The The use use of of different different algorithms algorithms and and different different cut-off cut-off values values for for similarity similarity could could potentially potentially yield results for for remotely remotely related yield different different results related sequences. sequences. Other Other forms forms of of evidence evidence are to resolve resolve the the inconsistency. biological reasoning are required required to inconsistency. This This type type of of biological reasoning also also points to another points to another problem. problem. Biological Biological conclusions conclusions derived derived by by inference inference in in one one database be propagated database will will be propagated and and may may no no longer longer be be reliable reliable after after numerous numerous transitive assertions. transitive assertions. Data touches the Data provenance provenance touches the issue issue of of data data accuracy accuracy and and reliability. reliability. It It is is critical critical that bases provide that data databases provide meta-data meta-data specification specification on on how how the the data data are are generated generated and and

'=J,"m"""'_,='_ _ """ ="""'_

2.5 ....... Chal lenges I nformation !nte I ntegration Cha tengeso!n in !nf~176 rat!~ ...........................................................................................................................................................................................................................
"""'"_"""_' '''''"''''' '''''_ ''''' W ' ' '''' _ ''''''

27

27

derived. This This has has to to be be as as rigorous rigorous as as the the traditional traditional standards standards for for experimental experimental derived. data which the the experimental methods, conditions, conditions, and data for for which experimental methods, and material material are are provided. provided. Similarly, Similarly, computationally computationally generated generated data data should should be be documented documented with with the the com computational putational conditions conditions involved, involved, including including algorithms, algorithms, input input datasets, datasets, parameters, parameters, constraints, on. constraints, and and so so on.

2 .5.4 2.5.4

O ntol ogy Ontology


On On top top of of the the syntactic syntactic heterogeneity heterogeneity of of data data sources, sources, one one of of the the major major stumbling stumbling blocks blocks in in information information integration integration is is at at the the semantic semantic level. level. In In naming naming and and termi terminology alone, there bases and nology alone, there are are inconsistencies inconsistencies across across different different data databases and within within the the same database. major literature same database. In In the the major literature database database MEDLINE, MEDLINE, multiple multiple aliases aliases for for genes norm, rather the exception. in which genes are are the the norm, rather than than the exception. There There are are cases cases in which the the same same name name refers refers to to different different genes genes that that share share no no relationship relationship with with each each other. other. Even Even the the bases, largely term term gene itself itself has has different different meanings meanings in in different different data databases, largely because because it it has has different different meanings meanings in in various various scientific scientific disciplines; disciplines; the the geneticists, geneticists, the the molecular molecular biologists, biologists, and and the the ecologists ecologists have have different different concepts concepts at at some some levels levels about about genes. genes. The The naming naming confusion confusion partly partly stems stems from from the the isolated, isolated, widely widely disseminated disseminated na nature At the ture of of life life science science research research work. work. At the height height of of molecular molecular cloning cloning of of genes genes in in the the 1980s 990s, research 1980s and and 1 1990s, research groups groups that that cloned cloned a a new new gene gene had had the the privilege privilege of of nam naming laboratories working ing the the gene. gene. Very Very often, often, laboratories working on on very very different different organisms organisms or or bio biological logical systems systems independently independently cloned cloned genes genes that that turned turned out out to to encode encode the the same same pro protein. tein. Consequently, Consequently, various various names names for for the the same same gene gene are are populated populated in in the the published published scientific Biological scientists scientific literature literature and and in in databases. databases. Biological scientists have have grown grown accustomed accustomed to to the the naming naming differences. differences. This This becomes becomes an an ontology ontology issue issue when when information information and and knowledge knowledge are are represented represented in in electronic electronic form form because because of of the the necessity necessity of of commu communication nication between between human human and and computers computers and and between between computer computer and and computer. computer. For the biological sciences community, idea and For the biological sciences community, the the idea and the the use use of of the the term term ontology is relatively is relatively new, new, and and it it generates generates controversy controversy and and confusion confusion in in discussions. discussions.
What What Is Is an an Ontology? Ontology?

The The term term ontology ontology was was originally originally a a philosophical philosophical term term that that referred referred to to "the "the sub subject ject of of existence." existence." The The computer computer science science community community borrowed borrowed the the term term ontology to refer to to refer to a a "specification "specification of of a a conceptualization" conceptualization" for for knowledge knowledge sharing sharing in in ar artificial tificial intelligence intelligence [32] [32].. An An ontology ontology is is defined defined as as a a description description of of concepts concepts and and relationships domain of knowledge. relationships that that exist exist among among the the concepts concepts for for a a particular particular domain of knowledge. In the world of structured information and data bases, ontologies in In the world of structured information and databases, ontologies in life life science science provide controlled vocabularies object classes, provide controlled vocabularies for for terminology terminology as as well well as as specifying specifying object classes, relations, relations, and and functions. functions. Ontologies Ontologies are are essential essential for for knowledge knowledge sharing sharing and and com communications munications across across diverse diverse scientific scientific disciplines. disciplines.

28

2 2

Chal lenges Faced ntegration of nformation Challenges Faced in the IIntegration of Biological Biological IInformation

Throughout Throughout the the history history of of the the field, field, the the biology biology community community has has made made a a contin continuous classifications and uous effort effort to to strive strive for for consensus consensus in in classifications and nomenclatures. nomenclatures. The Linnaean Linnaean system system for for naming naming of of species species and and organisms organisms in in taxonomy taxonomy is is one one of of The the oldest oldest ontologies. The nomenclature nomenclature committee committee for for the the International International Union Union of of the ontologies. The Pure and and Applied Applied Chemistry Chemistry (IUPAC) (IUPAC) and and the the International International Union Union of of Biochemistry Biochemistry Pure and Molecular Biology (IUBMB) make recommendations recommendations on on organic, organic, biochemiand Molecular Biology (IUBMB) make biochemi cal, and molecular biology symbols, and cal, and molecular biology nomenclature, nomenclature, symbols, and terminology. terminology. The The National National Library of of Medicine Medicine Medical Medical Subject Subject Headings Headings (MeSH) (MESH) provides provides the the most most compre compreLibrary hensive vocabularies for biomedical literature clinical records. records. The hensive controlled controlled vocabularies for biomedical literature and and clinical The Systematized International, a Systematized Nomenclature Nomenclature of of Medicine Medicine International, a division division of of the the College College of American Pathologists, oversees the of American Pathologists, oversees the development development and and maintenance maintenance of of a a com comprehensive prehensive and and multi-axial multi-axial controlled controlled terminology terminology for for medicine medicine and and clinical clinical infor information known as mation known as SNOMED. SNOMED. Development standards is is and Development of of standards and always always has has been been complex complex and and contentious contentious because getting getting agreement agreement has a long long and and slow slow process. process. The The computer computer and and because has been been a IT communities communities dealt dealt with with software software standards standards long long before before the the life life science science commu commuIT nity. nity. Recently, Recently, the the Object Object Management Management Group Group (OMG), (OMG), an an established established organiza organization in in the the IT community, established established a a life life sciences sciences research research group group ( (LSR) to im imtion IT community, LSR) to prove prove communication communication and and interoperability interoperability among among computational computational resources resources in in life life sciences.4 technology adoption sciences. 4 LSR LSR uses uses the the OMG OMG technology adoption process process to to standardize standardize models models and interfaces for software tools, and interfaces for software tools, services, services, frameworks, frameworks, and and components components in in life life sciences research. sciences research. Because of longer history history and Because of its its longer and diverse diverse scientific scientific disciplines disciplines and and constituents, constituents, developing developing standards standards in in the the life life science science community community is is harder harder than than doing doing so so in in the the information technology community. Besides Besides the the great great breadth breadth of of academic academic and information technology community. and research the life life sciences, fields of research communities communities in in the sciences, some some fields of biology biology are are a a century century or more older older than than molecular Thus, the problems are and or more molecular biology. biology. Thus, the problems are sociological sociological and technological. requires a stability and and technological. Standardization Standardization further further requires a certain certain amount amount of of stability certainty knowledge content field. In certainty in in the the knowledge content of of the the field. In contrast, contrast, the the level, level, extent, extent, and and nature nature of of biological biological knowledge knowledge is is still still extensively, extensively, even even profoundly, profoundly, dynamic dynamic in in content. attached to content. The The meaning meaning attached to a a term term may may change change over over time time as as new new facts facts are are discovered that related to standardize the discovered that are are related to that that term. term. So So far, far, the the attempts attempts to to standardize the gene names alone gene names alone have have met met a a tremendous tremendous amount amount of of resistance resistance across across different different biological communities. Committee (HGNC) biological communities. The The Gene Gene Nomenclature Nomenclature Committee (HGNC) led led by by the the Human Genome (HUGO) made tremendous progress Human Genome Organization Organization (HUGO) made tremendous progress to to standardize standardize

4. This This is is discussed discussed on on the the OMG OMG Web Web site: site: http://Isr.omg.org. http://Isr.omg.org. OMG OMG is is an an open-membership, open-membership, not notfor-profit for-profit consortium consortium that that produces produces and and maintains maintains computer computer industry industry specifications specifications for for interoperable interoperable enterprise enterprise application. application.

2.5 Challenges in in Information I nformation Integration 2.5

29

29

gene names names for for humans humans with with the the support support of of the the mammalian mammalian genetics genetics community community gene [33] . However, However, the the attempt attempt to to expand expand the the naming naming standard standard across across other other species species [33]. turned out out to to be be more more difficult difficult [34]. [34] . Researchers Researchers working working in in different different organisms organisms turned or fields fields have have their their own own established established naming naming usages, usages, and and it it takes takes effort effort to to convert convert or to a a new new set set of of standards. standards. to An domain-knowledge specific specific and An ontology ontology is is domain-knowledge and context context dependent. dependent. For For exex ample, the the term term v vector differs (not (not surprisingly surprisingly or or problematically) problematically) in in meaning meaning ample, e c t o r differs between its usage usage in in biology biology and and in in the the physical physical sciences, sciences, as as in in a a mathematical mathematical between its vector. However, However, within within biology, biology, the the specific specific meaning meaning of of a a term also can can be be quite quite vector. term also vector to mean mean a a vehicle, vehicle, as as in in cloning cloning vector, vector, different: Molecular Molecular biologists biologists use use v different: e c t o r to whereas parasitologists parasitologists use use v vector to refer refer to to an an organism organism as as an an agent agent in in transmistransmis whereas e c t o r to sion of of disease. disease. Thus, Thus, the the development development of of ontologies ontologies is is a a community community effort effort and and sion the adoption of of a have wide endorsement and and participaparticipa the adoption a successful successful ontology ontology must must have wide endorsement tion of of the the users. users. The The ecological and biodiversity biodiversity communities have made tion ecological and communities have made major major efforts in in developing developing meta-data meta-data standards, standards, common taxonomy, and and structural structural vo efforts common taxonomy, vocabulary for Web site help of Science Foundation Foundation cabulary for their their Web site with with the the help of the the National National Science 5 The and other The molecular molecular biology biology community and other government government agencies agencies [35]. [35]. 5 community encom encompasses much more more diverse collection of researchers in in passes a a much diverse collection of sub-disciplines, sub-disciplines, and and for for researchers the molecular biology reaching a consensus is is much much the molecular biology domain, domain, reaching a community-wide community-wide consensus harder. To these issues, issues, there there is is a a flurry flurry of movements to to harder. To circumvent circumvent these of grassroots grassroots movements develop ontologies in specific specific areas areas or or research such as analysis, gene develop ontologies in research such as sequence sequence analysis, gene expression, protein expression, protein pathways, pathways, and and so so on on [36]. [36]. 6 6 These These group group or or consortium consortium efforts efforts usually use case case and open source source approach usually adopt adopt a a use and open approach for for community community input. input. The The ontologies are not meant to be mandatory, but instead they serve as a reference ontologies are not meant to be mandatory, but instead they serve as a reference framework forward for development. For framework to to go go forward for further further development. For example, example, one one of of the the major major efforts efforts in in molecular molecular biology biology is is the the Gene Gene Ontology Ontology (GO) (GO) consortium, consortium, which which stems stems from human genome. genome. Its from the the annotation annotation projects projects for for the the fly fly genome genome and and the the human Its goal goal is to design a set of structured, controlled vocabularies to describe genes and gene is to design a set of structured, controlled vocabularies to describe genes and gene products products in in organisms organisms [37] [37].. Currently, Currently, the the GO GO consortium consortium is is focused focused on on building building three three ontologies ontologies for for molecular molecular function, function, biological biological process, process, and and cellular cellular compo components, nents, respectively. respectively. These These ontologies ontologies will will greatly greatly facilitate facilitate queries queries across across genetic genetic and bases. The consortium started and genome genome data databases. The GO GO consortium started with with the the core core group group from from the the genome data bases for genome databases for the the fruit fruit fly, fly, FlyBase; FlyBase; budding budding yeast, yeast, Saccharomyces Saccharomyces Genome Genome Database Database (SGD); (SGD); and and mouse mouse genome genome database database (MGD). (MGD). It It is is gaining gaining momentum momentum with with growing growing participants participants from from other other genome genome databases. databases. With With such such a a grassroots grassroots

5 5.. See See also also http://www.nbii.gov/disciplines/systematics.html, http'//www.nbii.gov/disciplines/systematics.html,a a general general systematics systematics site, site, and and http://www. http.//www. fgdc.gov, fgdc.gov, for for geographic geographic data. data. 6. See http.//www.mged.org. See the the work work by by the the gene gene expression expression ontology ontology working working group group at at http://www.mged.org.

30 ...................................................................2 ..........::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ..................................... 30

Chal lenges Faced in the I ntegration of Biological I nformation

approach, interactions interactions between between different different domain domain ontologies ontologies are are critical critical in in future future approach, development. brain ontology ontology will will inevitably relate to development. For For example, example, brain inevitably relate to ontologies ontologies of of other other anatomical anatomical structures structures or or at at the the molecular molecular level level will will share share ontologies ontologies for for genes and proteins A sample life science genes and proteins [38]. [38]. A sample collection collection of of ontology ontology resources resources in in life science is listed in in the Appendix. is listed the Appendix. A consistent consistent vocabulary vocabulary is is critical critical in in querying querying across across multiple data sources. A multiple data sources. However, given the diverse specialization of However, given the diverse domains domains of of knowledge knowledge and and specialization of scientific scientific disciplines, it is not foreseeable foreseeable that that in global, common disciplines, it is not in the the near near future future a a global, common ontology ontology covering broad broad biological will be be developed. developed. Instead, Instead, in in biomedical biomedical covering biological disciplines disciplines will research alone, there will be multiple multiple ontologies ontologies for for genomes, genomes, gene gene expression, expression, research alone, there will be proteomes, research in in proteomes, and and so so on. on. Semantic Semantic interoperability interoperability is is an an active active area area of of research computer multiple biological biological disci computer science science [39]. [39]. Information Information integration integration across across multiple disciplines and close collaborations plines and sub-disciplines sub-disciplines would would depend depend on on the the close collaborations of of domain domain experts and professionals to to develop develop algorithms algorithms and and flexible flexible approaches approaches to to experts and IT IT professionals bridge the the gaps gaps among among multiple biological ontologies. ontologies. bridge multiple biological

2 5 .5 2 ..5 .5

We W e b b Prese P r e s e n tntations ations


Much f the Much o of the biological biological data data is is delivered delivered to to end end users users via via the the Web. Web. Currently, Currently, the the biological Web resemble a medieval city-states, biological Web sites sites resemble a collection collection of of rival rival medieval city-states, each each with with its its own own design, design, accession accession methods, methods, query query interface, interface, services, services, and and data data presentation presentation format format [40]. [40]. Much Much of of the the data data retrieval retrieval efforts efforts in in information information integration integration rely rely on on brittle, brittle, screen screen scraping scraping methods methods to to parse parse and and extract extract data data from from HTML HTML files. files. In In an redundancy and an attempt attempt to to reduce reduce redundancy and share share efforts, efforts, an an open open source source movement movement in the bioinformatics in the bioinformatics community community has has began began to to share share various various scripts scripts for for parsing parsing HTML files from HTML files from popular popular data data sources sources such such as as GenBank GenBank report report [3], [3], Swiss-Prot Swiss-Prot report ] , and so forth. forth. report [41 [41], and so Recently, the the biological community has picking up momentum to Recently, biological IT IT community has been been picking up momentum to adopt merging XML XML technology technology for and for adopt the the merging for biological biological Web Web services services and for exchange exchange of line data bases already available in of data. data. Many Many on online databases already make make their their data data available in XML XML format.7 format. 7 Semi-structured user-defined tags Semi-structured XML XML supports supports user-defined tags to to hold hold data, data, and and thus thus an an XML XML document document contains contains both both data data and and meta-data. meta-data. The The ability ability for for data data sources sources to to ex exchange in an change information information in an XML XML document document strictly strictly depends depends on on their their sharing sharing a a spe special known as Type Declaration cial document document known as Data Data Type Declaration (DTD), (DTD), which which defines defines the the terms terms (names for tags) and (names for tags) and their their data data types types in in the the XML XML document document [42]. [42]. Therefore, Therefore, DTD DTD serve be viewed serve as as data data schema schema and and can can be viewed as as a a very very primitive primitive ontology ontology in in which which DTD will DTD defines defines a a set set of of terms, terms, but but not not the the relationship relationship between between terms. terms. XML XML will

Distributed System 7. See See the Distributed SystemAnnotation, Annotation, http://www.biodas.org, and the Protein Protein Information Information Re Resource, source, http://nbrfa.georgetown.edu/pirldatabases/pir->:ml. http://nbrfa.georgetown.edu/pir/databases/pir_xml.

Conclusion

31
ease some some of sources, such ease of the the incompatibility incompatibility problems problems of of data data sources, such as as data data formats. formats. However, However, semantic semantic interoperability interoperability and and consistency consistency remain remain a a serious serious challenge. challenge. With With the the autonomous autonomous nature nature of of life life science science Web Web sites, sites, one one can can envision envision that that the naming naming space space of of DTD alone could could easily easily create create an an alphabet alphabet soup soup of of con conthe DTD alone fusing terminology terminology as as encountered encountered in the naming naming of of genes. genes. Recently, Recently, there there has has fusing in the been a proliferation been a proliferation of of XML-based XML-based markup markup languages languages to to represent represent models models of of bi biological ological objects objects and and to to facilitate facilitate information information exchange exchange within within specific specific research research areas array and language,8 areas such such as as micro microarray and gene gene expression expression markup markup language, 8 systems systems biol biol9 and bio-polymer ogy markup markup language, 9 bio-polymer markup markup language.IO language. 1~ Many Many of these are l However, available standard organization. available through through the the XML XML open open standard organization. l 11 However, we we caution caution that that development development of of such such documents documents must must be be compatible compatible with with existing existing biological biological ontologies ontologies or or viewed viewed as as a a concerted concerted community community effort. effort.

11 :7

-_-

: .:

CONCLU SION CONCLUSION


IT IT professionals professionals and and biologists biologists have have to to work work together together to to address address the the level level of of chal challenges lenges presented presented by by the the inherent inherent complexity complexity and and vast vast scales scales of of time time and and space space covered covered by by the the life life sciences. sciences. The The opportunities opportunities for for biological biological science science research research in in the the 21st 21st century century require require a a robust, robust, comprehensive comprehensive information information integration integration in infrastructure frastructure underlying underlying all all aspects aspects of of research. research. As As discussed discussed in in the the previous previous sec sections, substantial progress has has been technical and tions, substantial progress been made made for for data data integration integration at at the the technical and architectural remains a architectural level. level. However, However, data data integration integration at at the the semantic semantic level level remains a major major challenge. challenge. Before Before we we will will be be able able to to seize seize any any of of these these opportunities, opportunities, the the biology biology and and bioinformatics bioinformatics communities communities have have to to overcome overcome the the current current limitations limitations in in meta metadata data specification, specification, maintenance maintenance of of data data provenance provenance and and data data quality, quality, consistent consistent semantics semantics and and ontology, ontology, and and Web Web presentations. presentations. Ultimately, Ultimately, the the life life science science com community munity must must embrace embrace the the concept concept of of the the Semantic Semantic Web Web [30] [30] as as a a web web of of data data that that is people. The bio-ontology efforts is understandable understandable by by both both computers computers and and people. The bio-ontology efforts for for the sciences represent represent one brave, early the life life sciences one important important step step toward toward this this goal. goal. The The brave, early efforts efforts to to build build computational computational solutions solutions for for biological biological information information integration integration are are discussed discussed in in subsequent subsequent chapters chapters of of this this book. book.

8. The MicroArray and Gene Expression (MAGE) markup markup language is being developed by the Mi Microarray Gene Expression Data Society (see (see http://www.mged.org/Workgroups/mage.html). http'//www.mged.org/Workgroups/mage.html). 9. The The Systems Biology Workbench (SBW) (SBW) is a modular framework designed to facilitate data data ex exchange by enabling different tools to interact with each other (see (see http://www.cds.caltech.edu/erato). http://www.cds.caltech.edu/erato). 10. The Biopolymer Markup Markup Language (BioML) (BioML) is an XML encoding schema for the annotation of protein and nucleic acid sequence (see (see http://www.bioml.com). http'//www.bioml.com). 1 1 . OASIS 11. OASIS is an international, not-for-profit consortium that designs and and develops industry standard specifications for interoperability based on XML. specifications

32 32

2 2

Chal lenges Faced n the ntegratio n of ical IInformation nformation Challenges Faced iin the IIntegration of Biolog Biological

REFERENCES R E F E R E NCES
[[1] 1] [2] [2] [3] [3] [4] [5] [5] E . Pennisi. f Life." 1 998): E. Pennisi. "Genome "Genome Data Data Shake Shake Tree Tree o of Life." Science Science280, 280, no. no. 5364 5364 ((1998): 672-674. 672-674. J. C. Wooley. " Journal Journal of of Computational Computational J.C. Wooley. "Trends "Trends in in Computational Computational Biology. Biology." Biology 6, no. 3 314 14 ((1999): 1 999): 459-474. Biology A. D. Baxevanis. "The Molecular Biology A.D. Biology Database Collection: 2002 Update." Research 30, no. no. 1 1 (2002): (2002): 1-12. 1-12. Nucleic Acid Research P. D. Karp. "Database Links are a Foundation for " Trends Trends in P.D. for Interoperability. Interoperability." 1 996): 273-279. Biotechnology 14, no. 7 ((1996):

D.L. Church, A. E. Lash, et al. "Database Resources Resources of the D. L. Wheeler, D. M. Church, National National Center of Biotechnology Information: 2002 Update." Nucleic Nucleic Acids Research 30, no. 1 1 (2002 (2002): 13-16. Research ): 1 3-16.
J. Thierry-Meig and R. Durbin. "Syntactic Definitions for the ACeDB Data Base Manager." Database, 1 1992, 992, Manager. " AceDB-A C. elegans Database, http://genome.cornell.edu/acedocs/syntax.html. http:// genome .comell.ed ulacedocs/s yntax .html. T. Head-Gordon Head-Gordon and J. C. Wooley. "Computational "Computational Challenges in Structural Genomics. " IBM Systems Journal 40, no. 2 (200 1 ) : 265-296. Genomics." (2001): 265-296. J. D. Ullmann and J. Widom. A First J.D. First Course Course in Database Systems. Upper Saddle 1997. River, NJ: Prentice Hall, 1 997. Data Mining." Mining." Drug Discovery and Development Development October R. Resnick. "Simplified Data October (2000):: 51-52. (2000) 5 1 -52.

[6] [6]

[7] [8] [8] [9] [9]

[ 1 0] R.G.G. R. G. G. Cattell. Object Data Management: Object-Oriented and Extended [10] Relational Database Systems, revised ed. Reading, MA: Addison-Wiley, 1 994. 1994. [ 1 1 ] K. Jungfer, G. Cameron, Cameron, and and T. Flores. "EBI: CORBA and Databases. " In [11] and the EBI Databases." 245-254. Norwell, Norwell, Bioinformatics: Databases and and Systems, edited by S. Letovsky, 245-254. MA: Kluwer Kluwer Academic Publishers, Publishers, 1999. 1 999. A. C. Siepel, A. N. N. Tolopko, A. D. Farmer, et Platform for [12] A.C. et al. "An "An Integration Platform 40, Heterogeneous Bioinformatics Software Components." Components. " IBM Systems Journal 40, Heterogeneous no. no. 2 2 (2001): (200 1 ) : 570-591. 570-5 9 1 . [ 1 3 ] G. Wiederhold. Wiederhold. "Mediators "Mediators in i n the the Architecture Architecture of o f Future Future Information Information Systems." [13] IEEE Computer Computer 25, 25, no. no. 3 (1992): ( 1 992): 38-49. 38-49. IEEE [ 1 4] G. Wiederhold Wiederhold and and M. M. Genesereth. Genesereth. "The "The Conceptual Conceptual Basis for for Mediation Mediation [14] Services." IEEE IEEE Expert, Expert, Intelligent Systems and and Their Applications 12, 12, no. no. 5 Services." (1997): ( 1 997): 38-47. 3 8-47.
V. Tannen, Tannen, et et al. al. "BioKleisli: "BioKleisli: A Digital Digital Library Library for for [ 15 ] S. Davidson, Davidson, C. Overton, Overton, V. [15] International Journal Journal of of Digital Libraries 1, 1 , no. no. 1 1 Biomedical Researchers." Researchers. " International Biomedical (1997): ( 1 997): 36-53. 36-53.

References References

33
( 1 6) L. Wong. "Kleisli, [16] "Kleisli, A Functional Query System." Journal of of Functional 0, no. 1 9-56. Programming 1 10, 1 (2000): 1 19-56. ( 1 7) S. Y. Chung and L. Wong. "Kleisli: A New " [17] S.Y. New Tool for Data Data Integration in Biology. Biology." Trends in Biotechnology 17 17 ( 1 999): 35 1-355. (1999): 351-355. ( 1 8 ) J. Crabtree, S. Harker, and V. [18] V. Tannen. "The Information Integration System K2," 1 998, http://db.cis.upenn.edufK21K2.doc. 1998, http://db.cis.upenn.edu~2/K2.doc. ( 1 9) S. B. Davidson, J. Crabtree, B. P. Brunk, et al. " K21Kleisli and GUS: [19] S.B. "K2/Kleisli GUS: Experiments in Integrated Access Access to Genomic Data Sources." IBM Systems Journal 40, no. 2 (200 1 ): 5 1 2-53 1 . (2001): 512-531. (20) f the Object-Protocol Model [20] I-M. A A.. Chen and V. V. M M.. Markowitz. "An Overview o of (OPM) and OPM Data Management Management Tools." Information Systems 20, no. 5 ( 1 995): 393-4 18. (1995): 393-418. ([21] 2 1 ) I-M. A. Chen, A. S . Kosky, V. S. V. M. Markowitz, et al. "Constructing and Maintaining Scientific Scientific Database Views Views in the Framework of the Object-Protocol Model. " In Proceedings Model." Proceedings of of the Ninth International Conference on Scientific Scientific and Statistical Database Management, 237-248. New York: IEEE, 997. IEEE, 1 1997. (22) M. Haas, DiscoveryLink: A System for [22] L. L.M. Haas, P. P. M. Schwartz, P. P. Kodali, et al. " "DiscoveryLink: Integrated Access Access to Life " IBM Systems Life Science Science Data Sources. Sources." Systems Journal 40, no. 2 (2001 ): 489-5 11. (2001): 489-511. (23) . M . Haas, R . Niswonger, et al. "Transforming Heterogeneous Data [23] L L.M. R.. J J.. Miller, Miller, B B. With Database Middleware: Middleware: Beyond Integration." IEEE Data Engineering 1 999): 3 1-36 . Bulletin 22, no. 1 1( (1999): 31-36. (24) W. Patton, R. Stevens, ta l . "Query Processing i n the TAMBIS [24] N. N.W. Stevens, P. P. Baker, Baker, e et al. in Bioinformatics Source Integration System. " In Proceedings Proceedings of of the 1 11 l th System."

International Conference on Scientific Scientific and Statistical Database Database Management,


1 3 8-147. New York: IEEE, 1 999. 138-147. IEEE, 1999. (25) [25] R. Stevens, Stevens, P. P. Baker, Baker, S. Bechhofer, Bechhofer, et al. "TAMBIS: "TAMBIS: Transparent Access Access to Multiple Bioinformatics Information Sources. " Bioinformatics 1 6, no. 2 (2000): Sources." 16, 1 84-1 86 . 184-186. (26) [26] Y. Y. Papakonstantinou, H. Garcia-Molina, and J J.. Widom. "Object Exchange Across Heterogeneous Information Sources. " In Proceedings Sources." Proceedings of of the IEEE Conference Conference on Data Engineering, 1-260. New 995. Engineering, 25 251-260. New York: IEEE, IEEE, 1 1995. (27) [27] S. F. E Altschul, Altschul, W. Gish, W. Miller, Miller, et al. "Basic Local Alignment Search Tool." Journal of 1 990): 403-4 10. of Molecular Biology 215, 215, no. no. 3 3 ((1990): 403-410. ([28] 2 8 ) T. . Waterman. f the T. F. E Smith Smith and and M. M. S S. Waterman. "Identification "Identification o of the Common Common Molecular Molecular 1981): 1 95Subsequences. " Journal of Subsequences." of Molecular Biology 147, no. no. 1 1 ((1981): 1951 97. 197. (29) . W. Mount. Bioinformatics: [29] D D.W. Bioinformatics. Sequence and Genome Analysis. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press, 200 1. 2001.

34

2 2

Chal lenges Faced ntegration of Biological IInformation nformation Challenges Faced in the IIntegration

[30] T. Berners-Lee, J. Hendler, and O. Lassila. "The c "The Semantic Web." Scientifi Scientific American 278, no. 5 (May 2001): 35-43. [31] P. E Buneman, Buneman, S. Khanna, Khanna, and W-C. Tan. "Why and Where: A Characterization Characterization of Data " In J. Vander Bussche and V. of the Eighth Data Provenance. Provenance." V. Vianu. Proceedings of 1 6-330. Heidelberg, International Conference on Database Theory (lCDT), (ICDT), 3 316-330. 1. Germany: Springer-Verlag, 200 2001.
R. Gruber. "A Translation Approach t o Portable Ontology Specification." T.R. to [32] T. Knowledge Acquisition 5, no. 1 993): 199-220. no. 2 ((1993): 199-220.

[33] H. M. Wain, M. Lush, F. [33] H.M. F. Ducluzeau, et al. "Genew: The Human Human Gene 69-1 71. Nomenclature . " Nucleic Acids Research 30, no. 1 1 (2002): 1 169-171. Nomenclature Database Database." 1 1, n o . 6838 (200 1 ) : 63 1-632. [34] 411, no. (2001): 631-632. [34] H. Pearson. "Biology's Name Name Game." Nature 4 [35] L. Edwards, [35] J. J.L. Edwards, M. A. Lane, Lane, and E. S. Nielsen. "Interoperability of Biodiversity Databases: Biodiversity Information on Every Desk." Science Science 289, no. 5488 (2000): 2312-23 14. 2312-2314.
. E. Oliver, D. L. Rubin, J. M. Stuart, et al. "Ontology Development for a [36] [36] D D.E. Pharmacogenetics Knowledge Base." In Pacific Symposium on Biocomputing, 65-76. 65-76. Singapore: World Scientific, 2002. Gene Ontology: Tool for [37] [37] M. Ashburner, C. A. Ball, J. A. Blacke, et al. " "Gene for the Unification of Biology. " Nature Genetics 25, no. 1 Biology." 1 (2000): 25-29. 25-29. Knowledge-Based Integration of [38] A. Gupta, Gupta, B. Ludascher, Lud/ischer, M. E. Martone. Martone. " "Knowledge-Based [38] 2 th International Conference Neuroscience Data Data Source." In Proceedings Proceedings of of the 1 12th on Scientific and Statistical Database Management (SSDBM), 39-52. 39-52. New New York: IEEE, 2000 2000.. [39] P. Mitra, G. Wiederhold, and M. Kersten. "A Graph-Oriented Graph-Oriented Model for Articulation . " In Proceedings Articulation of Ontology Interdependencies Interdependencies." Proceedings of of the Conference on Extending Database Technology 86-100. Heidelberg, Germany: Technology (EDBT), 86-100. Springer-Verlag, 2000. " Nature 4 1 7, no. 6885 (2002): [40] L. Stein. "Creating 417, "Creating a Bioinformatics Nation. Nation." 1 1 9-120. 119-120.

[41] A. Bairoch and and R. Apweiler. Apweiler. "The SWISS-PROT Protein Sequence Database and and Its Supplement TrEMBL in 2000." 2000." Nucleic Acids Research 28, no. 1 1 (2000): 45-48. 45-48.
T. Oay. Learning XML: [42] E. XML. Guide to Creating Self-Describing Data. San San Jose, CA: E.T. 1. O'Reilly, 200 2001.

CHAPTER CHAPTER

3 3

A A Practitioner's Practitioner's Guide G uide to Data anagement to Data M Management and and Data Data Integration Integration in Bioinform Bioinformatics in atics
Barbara Barbara A. Eckman Eckman

3. 1 3.1

IINTRODUCTION NTRODUCTION
Integration Integration of of a a large large and and widely widely diverse diverse set set of of data data sources sources and and analytical analytical meth methods needed to ods is is needed to carry carry out out bioinformatics bioinformatics investigations investigations such such as as identifying identifying and and characterizing characterizing regions regions of of functional functional interest interest in in genomic genomic sequence, sequence, inferring inferring biolog biological networks, networks, and identifying patient patient sub-populations sub-populations with with specific beneficial or or toxic toxic reactions reactions to to therapeutic therapeutic agents. agents. A A variety variety of of integration integration tools tools are are available, available, both both in in the the academic academic and and the the commercial commercial sectors, sectors, each each with with its its own own particular particular strengths and weaknesses. Choosing the right tools for the task is critical strengths and weaknesses. Choosing the right tools for the task is critical to to the the success of success of any any data data integration integration endeavor. endeavor. But But the the wide wide variety variety of of available available data data sources, vendors makes sources, integration integration approaches, approaches, and and vendors makes it it difficult difficult for for users users to to think think clearly clearly about about their their needs needs and and to to identify identify the the best best means means of of satisfying satisfying them. them. This This chapter introduces use cases for biological data integration and translates them chapter introduces use cases for biological data integration and translates them into It introduces into technical technical challenges. challenges. It introduces terminology terminology and and provides provides an an overview overview of of the landscape of the landscape of integration integration solutions, solutions, including including many many that that are are detailed detailed in in other other chapters this book, book, along chapters of of this along with with a a means means of of categorizing categorizing and and understanding understanding individual approaches individual approaches and and their their strengths strengths and and weaknesses. weaknesses. This written from the point point of This chapter chapter is is written from the of view view of of a a bioinformatician bioinformatician practic practicing with the will be ing database database integration, integration, with the hope hope that that it it will be useful useful for for a a wide wide variety variety of biologists who unfamiliar with database concepts of readers, readers, from from biologists who are are unfamiliar with database concepts to to more more computationally familiarity with computationally experienced experienced bioinformaticians. bioinformaticians. A A basic basic familiarity with common common biological assumed throughout. biological data data sources sources and and analysis analysis algorithms algorithms is is assumed throughout. The follows. Section .2 introduces The chapter chapter is is organized organized as as follows. Section 3 3.2 introduces traditional traditional data database base terms terms and and concepts. concepts. Those Those already already familiar familiar with with these these concepts concepts may may want want to to

36 36

~:~:~:~:~:;~:~:~:~:;:~:~:~:~;~:~`~:~:~:;:~;:~:~:~:~:~:~:;~:~;~;:;~;~:;:~:;:~:;:;:;:~:~:,;:;~:~:~:~:~::~:~{~:~:~:~:,~:~:~:~:~::~:~:~:~:~:~:~:~;~;~:~:~:~:~:~:~:~:~:;~:;:;:~:;:;:~;:~:;:;:;~;:~:~;~;~:~:~:;:~::~:~:~:~:`:~:~:
....... ~:,:::::::::~::~ .................... :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ...................................... , .......................................................................................................................... ~ .......................... ~ .................. ::::::::::::::::::::::::: .............................. ~:::::::::::::::, ................................................ ~::::~ ................................................................................................................. ~ ............................................ ~::,~ ............................. ::::::::::::::::::::::::::::::

3 3

A A

Practitioner's Guide to Data Data Management


........................ :::::::::::::::::::::::~::::::::::::::::::::::::::~;~:~::~:~:~,>~::~:~::~:.~*~`~`~`~*~;~`~``~`~`~`~/~:~`:~:~*~`*~ ....

skim that reading at .3, which which introduces di skim that section section and and begin begin reading at Section Section 3 3.3, introduces multiple multiple dimensions integration, thus thus intermediate Section 3.4 presents var mensions to to integration, intermediate terminology. terminology. Section 3.4 presents various use cases for integration solutions. ious use cases for integration solutions. Strengths Strengths and and weaknesses weaknesses of of integration integration approaches Section 3.5. approaches are are given given in in Section 3.5. Section Section 3.6 3.6 is is devoted devoted to to tough tough integration integration problems. technologists may problems. Therefore, Therefore, computer computer scientists scientists and and information information technologists may ben benefit from from the the advanced advanced problems evoked in in Section Section 3 3.6. efit problems evoked .6. The The goal goal of of this this chapter chapter is is to to convey convey a a basic basic understanding understanding of of the the variety variety of of data data management management problems problems and and needs needs in in bioinformatics; bioinformatics; an an understanding understanding of of the the variety variety of of integration integration strategies strategies currently currently available, available, and and their their strengths strengths and and weak weaknesses; nesses; an an appreciation appreciation of of some some difficult difficult challenges challenges in in the the integration integration field; field; and and the the ability evaluate existing ability to to evaluate existing or or new new integration integration approaches approaches according according to to six six general general categories knowledge, practitioners practitioners will well categories or or dimensions. dimensions. Armed Armed with with this this knowledge, will be be well prepared to tools that that are prepared to identify identify the the tools are best best suited suited to to meet meet their their individual individual needs. needs.

3.2 3.2

DATA E M E NT IN IN B I O I N FO R MATICS DATA MANAG MANAGEMENT BIOINFORMATICS


Data Data is is arguably arguably the the most most important important commodity commodity in in science, science, and and its its management management is is of of critical critical importance importance in in bioinformatics. bioinformatics. One One introductory introductory textbook textbook defines defines bioinformatics bases to bioinformatics as as "the "the science science of of creating creating and and managing managing biological biological data databases to keep 1]. keep track track of, of, and and eventually eventually simulate, simulate, the the complexity complexity of of living living organisms" organisms" [ [1]. If f bioinformatics If the the central central task task o of bioinformatics is is the the computational computational analysis analysis of of biological biological sequences, sequences, structures, structures, and and relationships, relationships, it it is is crucial crucial that that biological biological sequence sequence and and all all associated associated data data be be accurately accurately captured, captured, annotated, annotated, and and maintained, maintained, even even in in the the face face of of rapid rapid growth growth and and frequent frequent updates. updates. It It is is also also critical critical to to be be able able to to retrieve retrieve data data of of interest interest in in a a timely timely manner manner and and to to define define and and retrieve retrieve data data of of interest interest precisely enough enough to noise of precisely to separate separate effectively effectively its its signal signal from from the the distracting distracting noise of irrelevant insignificant data. irrelevant or or insignificant data.

3. 2.1 3.2.1

Data a n a g e me nt Basics Data M Management Basics


To begin the basic terms To begin the discussion discussion of of data data management management in in bioinformatics, bioinformatics, basic terms and and concepts concepts will will be be introduced introduced by by means means of of use use cases, cases, examples examples or or scenarios scenarios of of familiar database will will be be used used both both as as "a "a familiar data data management management activities. activities. The The term term database collection collection of of data data managed managed by by a a database database management management system" system" (DBMS) (DBMS) and, and, more more generally, generally, when when concepts concepts of of data data representation representation are are presented, presented, regardless regardless of of how how the the data data is is managed managed or or stored. stored. Otherwise, Otherwise, the the term term data data collection collection or or data data source source will DBMS. For will be be used used for for collections collections of of data data not not managed managed by by a a DBMS. For a a more more detailed detailed explanation explanation of of basic basic data data management management than than is is possible possible in in this this chapter, chapter, see see Ullman Ullman . and and Widom's Widom's A A First First Course Course in in Database Database Systems Systems [2] [2].

3.2 3.2

Data Management Management in Bioinformatics Bioi nformatics Data

37

Use Case: Case: A A Simple Simple Curated Curated Gene Gene Data Data Source Source Use

Consider a a simple simple collection collection of of data data about about known known and and predicted predicted human human genes genes in in Consider a chromosomal chromosomal region region that that has has been been identified identified as as likely likely to to be be related related to to a a genetic genetic a predisposition for for a a disease disease under under investigation. investigation. The The properties properties stored stored for for each each predisposition gene are are as as follows: follows: gene GenBank accession accession number number (accnum) (accnum) [3] [3] 9 GenBank Aliases in in other other data data sources sources (e.g., (e.g., Swiss-Prot Swiss-Prot accession accession number) number) [4] [4] 9 Aliases Description of of the gene 9 Description the gene Chromosomal location location 9 Chromosomal Protein families families database database (Pfam) (pfam) classification classification [5] [5] 9 Protein Coding sequence sequence (CDS) (CDS) 9 Coding Peptide sequence sequence 9 Peptide Gene Ontology Ontology (GO) (GO) annotation annotation [6] [6] 9 Gene Has expression expression results? results? (Are there expression expression results results for for this gene?) 9 Has (Are there this gene?) Single Nucleotide Nucleotide Polymorphisms Polymorphisms (SNPs) ? (Are (Are there there known 9 Has Has Single (SNPs)? known SNPs SNPs for for this gene?) gene? ) this 9 Date Date gene gene was was entered entered 9 Date Date gene gene entry entry was was last last modified modified The The complement complement of of properties properties stored stored in in a a database, database, along along with with the the relationships relationships among Individual properties, properties, GenBank GenBank among them, them, is is called called the the database's database's schema. Individual acc ess in on number, acce ss io number, are are attributes. Attributes Attributes can can be be single-valued, like like pept peptii d de e sequence, s e q u e n c e , or or multi-valued, like like al a l i iases. a s e s . Attributes Attributes can can be be atomic, like like pep p e p tt ii d de e sequence, s e q u e n c e , which which is is a a simple simple character character string, string, or or nested, like like a (data source a lli i a ases, s e s , which which themselves themselves have have structure structure (data source + + identifier). identifier). Data Data accuracy accuracy is is critically critically important important in in scientific scientific data data management. management. Single Single attributes attributes or or groups groups of of attributes attributes must must satisfy satisfy certain certain rules rules or or constraints for for the the data data to to be be valid valid and and useful. useful. When When entering entering data data into into the the database, database, or or populating it, it, care must be taken to ensure that these constraints are met. Examples of constraints care must be taken to ensure that these constraints are met. Examples of constraints in in the the simple simple gene gene data data source source are: are: 9 The The chromosomal chromosomal location location of of each each gene gene must must lie lie within within the the original original region region of of interest. interest. 9 The The CDS CDS and and peptide peptide sequences sequences must must contain contain only only valid valid nucleotide nucleotide and and amino amino acid acid symbols, symbols, respectively. respectively.

38

3 3

A u ide to A Practitioner's Practitioner's G Guide to Data Management

CDS sequence sequence must must have internal in-frame in-frame stop codons (which 9 The The CDS have no no internal stop codons (which would would terminate terminate translation translation prematurely). prematurely).
9 The The peptide peptide sequence sequence must must be be a a valid valid translation translation of of the the CDS CDS sequence. sequence.

9 The The Pfam Pfam classification classification must must be be a a valid valid identifier identifier in in the the Pfam Pfam data data source. source. This This simple simple gene gene data data collection collection is is subject subject to to continual continual curation, curation, in in which which new new data data is is inserted, inserted, old old data data is is updated, updated, and and erroneous erroneous data data is is deleted. deleted. A A user user might might make make changes changes to to existing existing entries entries as as more more information information becomes becomes known, known, such such as more more accurate accurate sequence sequence or or ex exon of a a predicted predicted gene, gene, refined refined GO GO as on boundaries boundaries of classification, SNPs classification, SNPs discovered, discovered, or or expression expression results results obtained. obtained. A A user user might might also also make make changes changes to to the the source's source's schema, schema, such such as as adding adding new new attributes attributes like like mouse mouse orthologues, or or links links to to LocusLink, LocusLink, RefSeq RefSeq [7], or KEGG KEGG pathways pathways [8]. [8]. New New orthologues, [7], or linkage may result in a a widening or narrowing of the the chromosomal chromosomal region region linkage studies studies may result in widening or narrowing of of interest, interest, requiring requiring a a re-evaluation re-evaluation of of which which genes genes are are valid valid members members of of the the of collection and and the the addition addition or or deletion deletion of of genes. genes. Finally, Finally, multiple multiple curators curators may may collection be working working on on the the data data collection collection simultaneously. simultaneously. Care Care must must be be taken taken that that an an be individual curator's changes individual curator's changes are are completed completed before before a a second second curator's curator's changes changes are are applied, applied, lest lest inconsistencies inconsistencies result result (e.g., (e.g., if if one one curator curator changes changes the the CDS CDS sequence sequence and and the the other other changes changes the the peptide peptide sequence sequence so so they they are are no no longer longer in in the the correct correct translation relationship translation relationship to to one one another). another). The The requirement requirement for for correctly correctly handling handling currency. multiple, multiple, simultaneous simultaneous curators' curators' activities activities is is called called multi-user multi-user con concurrency. Databases are are only useful, of of course, course, if if data of interest interest can be retrieved retrieved from from Databases only useful, data of can be them when when needed. user might might simply need to them needed. In In a a small small database, database, a a user simply need to retrieve retrieve all all the the attributes attributes at at once once in in a a report. report. More More often, often, however, however, users users wish wish to to retrieve retrieve subsets of by specifying subsets of a a database database by specifying conditions, conditions, or or search search predicates, predicates, that that the the data data retrieved should meet. Examples of queries from the curated gene data collection retrieved should meet. Examples of queries from the curated gene data collection described "Retrieve the the gene gene whose whose GenBank GenBank accession accession number number described previously previously are: are: "Retrieve is 1 23456" ; "retrieve is AA AA123456"; "retrieve only only genes genes that that have have expression expression results" results";; "retrieve "retrieve only only genes '' Search Search genes that that contain contain in in their their description description the the words words 'serotonin 'serotonin receptor.' receptor.'" predicates combined using OR operators predicates may may be be combined using logical logical AND AND and and OR operators to to produce produce more conditions; for Retrieve genes more complex complex conditions; for example, example, in in the the query query " "Retrieve genes which which were were entered chromosomal entered since since 0910112002 09/01/2002 and and lie lie in in a a specified specified sub-region sub-region of of the the chromosomal region region of of interest," interest," the the conjunction conjunction of of the the two two search search predicates predicates will will be be expressed expressed by AND. by an an operator operator AND.
Use ving Genes Genes and Use Case: Case: Retrie Retrieving and Associated Associated Expression Expression Results Results

Along with the gene data data collection, may wish view expres Along with the simple simple curated curated gene collection, a a user user may wish to to view expression array experiments. experiments. sion data data on on the the genes genes that that have have been been gathered gathered through through micro microarray For example, an expression data For example, an expression data source source might might permit permit the the retrieval retrieval of of genes genes that that

3.2

Data Ma nagement in Bioi nformatics

39

show r greater two-fold difference n expression show equal equal to to o or greater than than two-fold difference iin expression intensities intensities between between ribonucleic acid (RNA) isolated isolated from normal and ribonucleic acid (RNA) from normal and diseased diseased tissues. tissues. To To retrieve retrieve all all genes at least two-fold differential genes with with known known SNPs SNPs with with at least two-fold differential expression expression between between nor normal and and diseased search predicates each of mal diseased tissue, tissue, search predicates would would need need to to be be applied applied to to each of the the two data sources. genes that that satisfy both of two data sources. The The result result would would be be the the genes satisfy both of the the conditions. conditions. There There are are many many different different but but equivalent equivalent methods methods of of retrieving retrieving the the genes genes that that satisfy and an satisfy both both of of these these predicates, predicates, and an important important task task of of a a database database system system is is to to identify identify and and execute execute the the most most efficient efficient of of these these alternate alternate methods. methods. For For example, example, the could first the system system could first find find all all genes genes with with SNPs SNPs from from among among the the curated curated genes, genes, and and then values for then check check the the expression expression values for each each of of them them one one by by one one in in the the expression expression data data source. could find all genes source. Alternatively, Alternatively, the the system system could find all genes in in the the expression expression data data source source with diseased tissue, with two-fold two-fold expression expression in in normal normal versus versus diseased tissue, then then find find all all genes genes in in the curated data source that have SNPs, and finally merge the two lists, retaining the curated data source that have SNPs, and finally merge the two lists, retaining only lists. Typically, only the the genes genes that that appear appear in in both both lists. Typically, methods methods differ differ significantly significantly in in their bases, the their speed speed due due to to such such factors factors as as the the varying varying speeds speeds of of the the two two data databases, the volume volume of of data data retrieved, retrieved, the the specificity specificity of of some some predicates, predicates, the the lack lack of of specificity specificity of satisfied. They also differ of others, others, and and the the order order in in which which predicates predicates are are satisfied. They may may also differ in in their CPU) their usage usage of of computer computer system system resources resources such such as as central central processing processing unit unit ((CPU) or cost may may be be defined defined either either or disk. disk. Depending Depending on on individual individual needs, needs, the the execution execution cost as as execution execution time time or or resource resource usage usage (see (see Chapter Chapter 13). 13). The The process process of of estimating estimating costs various alternative costs of of various alternative data data retrieval retrieval strategies strategies and and identifying identifying the the lowest lowest one one among cost-based query query optimization. optimization. among them them is is known known as as cost-based

3.2.2 3.2.2

Two l a r Data a nagement Strateg ies Two Popu Popular Data M Management Strategies and Th e i r Lim itations and Their Limitations
Two Two approaches approaches that that have have commonly commonly been been used used to to manage manage and and distribute distribute data data in in bioinformatics bioinformatics are are spreadsheets spreadsheets and and semi-structured semi-structured text text files. files.
Spreadsheets Spreadsheets

Spreadsheets easy to handy for their Spreadsheets are are easy to use use and and handy for individual individual researchers researchers to to browse browse their data quickly, simple arithmetic arithmetic operations, data quickly, perform perform simple operations, and and distribute distribute them them to to col collaborators. laborators. The The cell-based cell-based organization organization of of a a spreadsheet spreadsheet enables enables the the structuring structuring of separate items, spreadsheet may sorted. The of data data into into separate items, by by which which the the spreadsheet may then then be be sorted. The Microsoft spreadsheet software provides handy handy data Microsoft Excel Excel spreadsheet software [9] [9] provides data entry entry features features for for replicating values multiple cells, populating a replicating values in in multiple cells, populating a sequence sequence of of rows rows with with a a sequence sequence of of integer integer identifiers, identifiers, and and entering entering values values into into a a cell cell that that have have appeared appeared in in the the same same column column previously. previously. A least as used, is A disadvantage disadvantage of of spreadsheets, spreadsheets, at at least as they they are are typically typically used, is that that very very little data entered. It little data validation validation is is performed performed when when data data is is entered. It is is certainly certainly possible, possible,

40

3 3

A Practitioner's Guide to Data Management

by programming programming in in Microsoft Microsoft Visual or using advanced Excel Excel features, to by Visual Basic Basic or using advanced features, to perform constraint constraint checking checking such as verifying verifying that values have taken perform such as that data data values have been been taken that numeric numeric data data fall fall from an an approved approved list list of of values values or or controlled vocabulary, that from in blank; but in the the correct correct range, range, or or that that a a specific specific cell cell has has not not been been left left blank; but in in prac practice tice this this is is not not often often done. done. Furthermore, Furthermore, while while advanced advanced features features exist exist to to address address this problem, in this problem, in practice practice spreadsheets spreadsheets typically typically include include a a great great deal deal of of repeated repeated or redundant data. data. For example, a spreadsheet of or redundant For example, a spreadsheet of gene gene expression expression data data might might include the sample against include the following following information, information, repeated repeated for for each each tissue tissue sample against which which the GenBank es ss si io on gene gene the gene gene was was tested: tested: G e n B a n k acc acce n number, number, g e n e name, name, g e n e des description, D. I fa n error cription, LocusLink L o c u s L i n k Locus L o c u s ID, ID, and and UniGene U n i G e n e Cluster Cluster I ID. If an error should be fields, the should be found found in in any any of of these these redundant redundant fields, the change change would would have have to to be be made question. If made in in each each row row corresponding corresponding to to the the gene gene in in question. If the the change change is is not not made made in in all all relevant relevant rows, rows, an an inconsistency inconsistency arises arises in in the the data. data. In In database database circles, circles, this this inconsistency inconsistency caused caused by by unnecessary unnecessary data data duplication duplication is is called called an an update anomaly. Another Another problem problem with with spreadsheets spreadsheets is is they they are are fundamentally fundamentally single-user single-user data data sources. sources. Only Only one one user user may may enter enter data data into into a a spreadsheet spreadsheet at at a a time. time. If If multiple multiple users users must must contribute contribute data data to to a a data data source source housed housed in in a a spreadsheet, spreadsheet, a a single single curator curator must multiple copies distributed, and must be be designated. designated. If If multiple copies of of a a spreadsheet spreadsheet have have been been distributed, and each has been each has been edited edited and and added added to to by by a a different different curator, curator, it it will will be be a a substantial substantial task task to to harmonize harmonize disagreements disagreements among among the the versions versions when when a a single single canonical canonical version version is itself offers is desired. desired. The The spreadsheet spreadsheet itself offers no no help help in in this this matter. matter. Finally, search methods over data data stored Finally, search methods over stored in in spreadsheets spreadsheets are are limited limited to to simple simple text searches over the entire spreadsheet; complex combinations of search text searches over the entire spreadsheet; complex combinations of search condi conditions, tions, such such as as "return "return serotonin serotonin receptors receptors that that have have SNPs SNPs but but do do not not have have gene gene expression results" are are not not permitted. permitted. Additional Additional limitations limitations of of text text searches searches are are presented in presented in the the next next section. section.
Semi-Structured ext Files Semi-Structured T Text Files

Semi-structured text files, text files Semi-structured text files, that that is, is, text files containing containing a a more more or or less less regular regular se series labels and associated values, values, have limitations similar ries of of labels and associated have data data management management limitations similar to to spreadsheets. A files spreadsheets. A prominent prominent example example is is the the GenBank GenBank sequence sequence annotation annotation flat flat files [10]. National Center [10]. It It should should be be noted noted that that the the National Center for for Biotechnology Biotechnology Information Information (NCBI) (NCBI) does does not not store store its its data data in in flat flat file file format; format; rather, rather, the the GenBank GenBank flat flat file file format format is simply a based on representation [ 11]. is simply a report report format format based on the the structured structured ASN.l ASN.1 data data representation [11]. An f the s that An advantage advantage o of the semi-structured semi-structured text text format format iis that it it permits permits more more com complex, plex, hierarchical hierarchical (tree-like) (tree-like) structures structures to to be be represented. represented. A A sequence sequence has has multiple multiple references, references, each each of of which which has has multiple multiple authors. authors. Text Text files files are are also also perhaps perhaps the the most most portable portable of of formats-anyone formats--anyone with with a a text text editor editor program program can can view view and and edit edit them them (unless limit of (unless the the file file size size exceeds exceeds the the limit of the the editor's editor's capability). capability). However, However, most most

3.2

Data M a nagement i n Bioi nformatics

41

text text editors editors provide provide no no data data validation validation features. features. Like Like spreadsheets, spreadsheets, they they are are not not oriented oriented toward toward use use by by multiple multiple concurrent concurrent users users and and provide provide little little help help in in merg merging or or harmonizing harmonizing multiple that have have diverged diverged from from an an original canonical ing multiple copies copies that original canonical version. version. Without Without writing writing an an indexing indexing program, program, searching searching a a text text file file is is very very inef inefficient because read sequentially, ficient because the the entire entire file file must must be be read sequentially, looking looking for for a a match match to to the user's input. Further, it is impossible to specify which part of the flat file entry the user's input. Further, it is impossible to specify which part of the flat file entry is to to be be matched. matched. If If a a user to find find mammalian mammalian sequences, there is is no no way way user wants wants to sequences, there is to to limit limit the the search search to to the the section section organism organism of of the the file file to to speed speed the the search. search. As As with spreadsheets, spreadsheets, full-text full-text searches searches over over text text files files do do not not support support complex complex com comwith binations binations of of search search conditions. conditions. Full-text Full-text searches searches may may also also result result in in incorrect incorrect data data retrieval. For For example, example, consider consider a a flat-file flat-file textual textual data data source source of of human human genes genes and and retrieval. their their mouse mouse orthologues, orthologues, both both of of which which have have chromosomal chromosomal locations. locations. Suppose Suppose the the "find all human genes related to mouse orthologues on mouse chro user wants to user wants to "find all human genes related to mouse orthologues on mouse chromosome 0" ; simple mosome 1 10"; simple text-searching text-searching permits permits no no way way of of specifying specifying that that the the match match 0 should should refer to to chromosome chromosome 1 10 refer to to the the human human gene gene and and not not the the mouse mouse gene. gene. Finally, text text editors editors provide provide no no easy easy means means of of retrieving retrieving associated associated data data from from two two Finally, related associated related text text data data sources sources at at once, once, for for example, example, a a GenBank GenBank entry entry and and its its associated Swiss-Prot entry. entry. More More sophisticated sophisticated search search capability capability over over semi-structured, semi-structured, text textSwiss-Prot formatted data data sources sources is is provided provided by by systems Biosciences' Sequence Sequence formatted systems like like LION LION Biosciences' Retrieval Retrieval System System (SRS) (SRS) [12] [12] (presented (presented in in Chapter Chapter 5); 5); however, however, such such read-only read-only indexing indexing systems systems do do not not provide provide tools tools for for data data validation validation during during curation curation or or solve solve the the multi-user multi-user concurrency concurrency problem, problem, and and they they have have limited limited power power to to compensate compensate for the underlying underlying text for data data irregularities irregularities in in the text files. files.

3.2.3 3.2.3

Tra d iti o n a l Data base M a nagement Traditional Database Management


This discussion files points This discussion of of the the limitations limitations of of spreadsheets spreadsheets and and flat flat files points toward toward the the advantages of advantages of traditional traditional data data management management approaches. approaches. The The most most mature mature of of these, these, relational technology, was conceived 970 in 1 3 ]. ]. relational technology, was conceived in in 1 1970 in a a seminal seminal paper paper by by E. E. F. E Codd Codd [ [13 I n the the technology In the succeeding succeeding 30 30 years, years, the technology has has become become very very mature mature and and robust, robust, and and a a great great deal deal of of innovative innovative thought thought has has been been put put into into making making data data retrieval retrieval faster faster and and faster. faster. For For example, example, a a great great step step forward forward was was cost-based cost-based optimization, optimization, or or plan planning 979 by ning a a query query based based on on minimizing minimizing the the expense expense to to execute execute it, it, invented invented in in 1 1979 by Patricia Selinger [14]. Similarly, because relational technology was originally devel Patricia Selinger [14]. Similarly, because relational technology was originally developed oped for for business business systems systems with with a a high high volume volume of of simultaneous simultaneous inserts, inserts, updates updates and and deletes, deletes, its its ability ability to to accommodate accommodate multiple multiple concurrent concurrent users users is is highly highly advanced. advanced.
The The Relational Relational Model Model

A A data data model model is is the the fundamental fundamental abstraction abstraction through through which which data data is is viewed. viewed. Al Although though the the terms terms are are often often confused, confused, a a data data model model is is not not the the same same as as a a schema, schema,

42

42

3 3
.................. ,=,~,::~,:,:,:,:,:,~,:=:,::,:::,:,:,:,:,,:,:::~,:,:,~,:,:,:::,:,~,:,:,~ .................................. ~:,:,:::,:,:,:,:,,:,~:,:,~:,:,:,,,

A u ide to a nagement A Practitioner's Practitioner's G Guide to Data Data M Management


................................ *::~::::::::*~*~::*:*~:*:*:*~*:~:~*~*:*:::*:*~*:*":~:*~:*~:~`*:*~:*~:*:*:*:*~:*:*~:~*:`~*~:*:~:~*~:~*~:~:*::~`~*`*`::*;*~*:*;:~*:::*;:~*:*::*~::~:~**::~**~

................................ , ........................................................... ~.......................................................... ~............................................... : ........................................................ : ............................ : .................................................................................... : ......................... ~:::,:,:,:,:,~:,:,:,:,:,,,~,:::::,,:,:,:,,,

which which represents represents the the structure structure of of a a particular particular set set of of data. data. The The basic basic element element of of the the relational data model is is a a table table (or (or relation) of of rows rows (or (or tuples) and and columns columns (or (or attributes) attributes).. A A representation representation of of gene gene expression expression data data in in tabular tabular fashion fashion means means the the relational relational data data model model is is being being used. used. A A particular particular relational relational schema schema might might contain contain a es ss si io on ss s-P Prot a gene gene table table whose whose columns columns are are GenBank G e n B a n k acc acce n number number,, Swi Swis rot acce s s ion number ion , chromosomal ion , Pfam accession number,, descript description, chromosomal locat location, Pfam c l as s i f ication , CDS classification, CDS sequence sequence,, peptide peptide sequence sequence,, GO GO annota annotation, s i on results tion, gene gene expres expression results,, SNPs SNPs,, date_entered date_entered,, and and date_ date_ mod i f i ed; and GenBank_ modified; and a a gene gene expression expression table table whose whose columns columns are are GenBank_ i ty_value. acc e s s i on number i s sue_ID , and accession number,, t tissue_ID, and intens intensity_value. A A number number of of basic basic operations operations are are defined defined on on relations, relations, expressed expressed by by the the rela rela[2]. tional algebra operators [2]:
(n) produces 9 Projection Projection (Jr) produces from from a a relation relation R R a a new new relation relation (noted (noted n Jr R) R) that that has columns. In has only only some some of of R's R's columns. In the the example, example, the the projection projection operator operator might might return return only only the the GenBank GenBank and and Swiss-Prot Swiss-Prot accession accession numbers numbers of of the the genes genes in in the table. table. the

9 Selection (a) (or) produces produces from from a a relation relation R Ra a new new relation relation (noted (noted a cr R R)) with with a a subset of rows. For For example, could be subset of R's R's rows. example, this this could be the the genes genes that that have have a a Pfam Pfam protein kinase protein kinase domain. domain. of two relations R R and S (noted RU US S) ) is is the set of of rows union (U) (U) of two relations and S (noted R the set rows that that are are 9 The The union in R R or or S or both. R and and S S must must have For example, example, in S or both. ( (R have identical identical sets sets of of attributes.) attributes.) For if 24 separate chromosome, if there there were were 24 separate tables tables of of genes, genes, one one for for each each human human chromosome, the union union operator could be be used used to the genes the operator could to yield yield a a single single table table containing containing all all the genes in the the genome. in genome. The difference difference operation R-S of of two two relations relations R R and and S S is is the the set set of of 9 The operation noted noted R-S elements that in S; for example, of elements that are are in in R R but but not not in S; for example, this this could could be be the the set set of GenBank accession accession numbers in the present GenBank numbers that that appear appear in the genes genes table table but but are are not not present in the the gene table. in gene expression expression table. The join join (~) ( I><J ) of of two R and and S S (noted (noted R R~ I><J S) S) is is a a relation consisting 9 The two relations relations R relation consisting of all all the the columns columns of of R R and and S, S, with with rows rows from R and and S S paired paired if if they they agree agree of from R on particular particular attribute(s) attribute(s) common common to to R R and and S, S, called called the the join attribute(s). attributers). on For example, example, a a user user might might join join the the genes genes table table and and the the expression expression table table on on For GenBank accession accession number, number, pairing pairing genes genes with with their their expression expression results. results. GenBank The relational relational algebra algebra operations operations are are the the building building blocks blocks that that may may be be comcom The bined to to form more complex complex expressions, expressions, or or queries, that that enable enable users users to to ask ask bined form more complex questions questions of of scientific For example, example, the the following query involves involves complex scientific interest. interest. For following query "Retrieve the GenBank GenBank accession numbers, numbers, projection, selection, selection, union, union, and and join: join: "Retrieve projection,

3.2

Data

n <> rn <> ru

in Bioi nformatics

43

peptide peptide sequence, sequence, and and tissue tissue IDs" IDs" [projection] [projection] "for "for all all genes genes on on any any chromosome" chromosome" [union] [union] "that "that have have associated associated SNPs" SNPs" [selection] [selection] "and "and show show expression" expression" [join] [join] "in "in central [selection].. central nervous nervous system system tissue" tissue" [selection] A key element element of approach is A key of the the relational relational approach is enabling enabling users users to to describe describe the the behavior results they behavior they they want want to to ensure ensure or or the the results they want want to to retrieve, retrieve, rather rather than than requiring them them to to write write a a program program that that specifies, step by by step, step, how how to to obtain the requiring specifies, step obtain the results results or or ensure ensure the the behavior. behavior. The The Structured Structured Query Query Language Language (SQL) (SQL) [2], [2], the the language through questions to language through which which users users pose pose questions to a a relational relational database database and and specify specify constraints data, is constraints on on relational relational data, is thus thus declarative declarative rather rather than than procedural. procedural. Through Through declarative null, that declarative statements, statements, users users can can specify specify that that a a column column value value may may not not be be null, that it unique in table, that that it it must it must must be be unique in its its table, must come come from from a a predefined predefined set set or or range range of of values, or already be values, or that that it it must must already be present present in in a a corresponding corresponding column column of of another another table. already table. For For example, example, when when adding adding an an expression expression result, result, the the gene gene used used must must already be table. Through be registered registered in in the the gene gene table. Through declarative declarative queries, queries, users users can can ask ask complex complex questions of of the the data data involving involving many many different different columns columns in in the the database at once, once, questions database at and and because because relational relational tables tables may may be be indexed indexed on on multiple multiple columns, columns, such such searches searches are fast. fast. Advanced Advanced search search capabilities permit defining defining subsets subsets of of the the database database and and are capabilities permit then then counting counting or or averaging averaging numeric numeric values values over over the the subset. subset. An An example example would would be be listing listing all all tissues tissues sampled sampled and and the the average average expression expression value value in in each each over over a a set set of of housekeeping housekeeping genes. genes. Performing Performing such such computations computations over over subsets subsets of of tables tables is is called called aggregation, aggregation, and and functions functions like like count, count, average, average, minimum, minimum, and and maximum maximum are are aggregate aggregate functions. functions. Finally, Finally, because because it it is is easy easy to to define define multiple multiple related related tables tables in in a a relational relational database, database, a a user user may may define define separate separate tables tables for for genes genes and and their their aliases, aliases, permit permitting multiple aliases aliases and ting fast fast searches searches over over multiple and eliminating eliminating the the need need for for users users to to know know what what type type of of alias alias they they are are searching searching with, with, that that is, is, where where it it comes comes from from (Swiss (SwissProt, Prot, GenBank, GenBank, etc.). etc.). There There are are two two main main disadvantages disadvantages of of relational relational databases databases when when compared compared to to flat flat file file data data sources sources and and spreadsheets: spreadsheets: Specialized Specialized software software is query the data, and searches of is required required to to query the data, and free free text text searches of the the entire entire entry entry are are not not supported relational database. supported in in a a traditional traditional relational database. A A criticism criticism sometimes sometimes made made of of the the relational relational data data model model is is that that it it is is not not natural biological objects natural to to model model complex, complex, hierarchically hierarchically structured structured biological objects as as flat, flat, relational example, an sequence, as relational tables. tables. For For example, an annotated annotated sequence, as represented represented in in Gen GenBank, Bank, is is a a rich rich structure. structure. The The systems systems in in the the BioKleisli BioKleisli family family (see (see Chapter Chapter 6) 6) address basic operations address this this issue issue by by defining defining their their basic operations on on nested nested relations, relations, that that is, relations. Another is, relations relations whose whose attributes attributes can can themselves themselves be be relations. Another approach approach to to management data is management of of hierarchically hierarchically structured structured data is to to represent represent it it in in eXtensible eXtensible Markup (XML) [ 1 5] , a Markup Language Language (XML) [15], a structured structured text text data data exchange exchange format format based based on on data data values values combined combined with with tags tags that that indicate indicate the the data's data's structure. structure. Special-purpose Special-purpose XML XML query query languages languages are are in in development development that that will will enable enable users users to to pose pose complex complex

44
4

......... ,~,~,~,~o~,,~

.................. ~ . ~ , ~ = E

==~

. . . . . . . . . . . . . . . . . . . . ~ . . , . . , . . . , .~ = ~ =

~ , , ~

. . . . . . . . . . . . . . . . . = . . , . . , . . ~ . . = .= = x = = = = , ~ = , ,

. . . . . . . . . . . . . . . . . ~ . . . . . . . . . . . . . . . . . . . . . . . . , . . = . . = . .

3 3

. . . . . . . . . . . . . . . =

Practitioner's G Guide Management A Practitioner's u ide to Data Ma nagement


==== ............ .===~=~====~.~==~=.,~=,=~ . . . . . . . . . . ~ . . . . , ~ = . ~ , .............. = ~ . ~ , , = , ~ .......... ~...~....=~=~.~o~,,

queries bases and queries against against XML XML data databases and specify specify the the desired desired structure structure of of the the resulting resulting data data [16]. [16].

Use Case: Case: Transforming Transforming Database Database Structure Structure Use
Often, transformation transformation of of database database structures is necessary necessary to to enable enable effective effective query queryOften, structures is ing data. Many venerable data ing and and management management of of biological biological data. Many venerable data sources sources no no longer longer represent biological objects represent biological objects optimally optimally for for the the kinds kinds of of queries queries investigators investigators typi typically want pose. For example, it cally want to to pose. For example, it has has often often been been noted noted that that GenBank GenBank is is sequence sequencecentric, individual genes centric, not not gene-centric, gene-centric, so so queries queries concerning concerning the the structure structure of of individual genes are easy to express. In In contrast, Swiss-Prot is are not not easy to express. contrast, Swiss-Prot is sequence-centric, sequence-centric, not not domain domaincentric, so so it it is is rather rather awkward awkward to to ask ask for for proteins proteins with with carbohydrate carbohydrate features features in in a a centric, certain domain all these these features represented in certain domain because because all features are are represented in terms terms of of the the sequence sequence as whole. as a a whole. To illustrate method of of handling handling data data transformations, transformations, consider consider a a very very To illustrate one one method ioon ss sP Prot simple simple gene gene table table with with attributes attributes GenBank G e n B a n k access accessi n number number, , Swi Swis rot acc es ss siioon acce n number, number, and and sequence. sequence. It It might might be be advantageous advantageous to to enable enable users users to to retrieve retrieve sequences sequences by by accession accession number number without without knowing knowing where where the the accession accession number GenBank or number originated originated ((GenBank or Swiss-Prot) Swiss-Prot).. Creating Creating a a separate separate table table for for aliases aliases is is one one solution, solution, particularly particularly if if each each gene gene has has many many different different accession accession numbers, numbers, including including multiple multiple accession accession numbers numbers from from the the same same original original data data source. source. Another Another way database into way to to permit permit this this search search is is to to transform transform the the database into the the following following schema: schema: acc ess in on number acce ss io n u m b e r and and sequence. s e q u e n c e . This This transformation transformation can can be be accomplished accomplished by es ss siioon by retrieving retrieving all all the the GenBank G e n B a n k acc acce n numbers n u m b e r s and and their their associated associated se sessProt ioon quences, quences, then then retrieving retrievingall all the the Swi swis s P r o t access accessi n numbers n u m b e r s and and their their associated equences, and finally doing associated s sequences, and finally doing a a union union of of those those two two sets. sets.The The formula formula or or expression expression that that defines defines this this transformed transformed relation relation is is called called a a view. view. This This expression expression may table, called may be be used used to to create create a a new new table, called a a materialized view, view, which which exists exists sepa separately rately from from the the original original table, table, so so that that changes changes to to the the original original are are not not applied applied to to the the new is not new table. table. If If the the expression expression is not used used to to create create a a new new table, table, but but only only to to retrieve retrieve data original table table and data from from the the original and transform transform it it on on the the fly, fly, it it is is a a non-materialized view, view, or or simply simply a a view. view. Recall Recall the the critique critique that that it it is is not not natural natural to to model model complex, complex, hierarchically hierarchically struc structured choose relational tured biological biological objects objects as as flat, flat, relational relational tables. tables. A A user user might might choose relational database database technology technology for for storing storing and and managing managing data data due due to to its its efficiency, efficiency, maturity, maturity, and of the the data and robustness robustness but but still still wish wish to to present present a a hierarchical hierarchical view view of data to to the the user, user, one that more more closely This (non-materialized) view one that closely matches matches biological biological concepts. concepts. This (non-materialized) view may be may be accomplished accomplished by by means means of of a a conceptual conceptual schema schema layered layered on on top top of of the the relational Access to relational database. database. The The biological biological object object layers layers of of Transparent Transparent Access to Multi Multiple (see Chapter 7) and ple Bioinformatics Bioinformatics Information Information Sources Sources (TAMBIS) (TAMBIS) (see Chapter 7) and the the Acero Acero Genome 1 7] are Genome Knowledge Knowledge Platform Platform [ [17] are efforts efforts in in this this direction. direction.

3.3

D i mensions

Solutions

45

3.3 3.3

D I M E N S I O N S DESCRIBING DESCR I B I N G THE TH E SPACE S PACE DIMENSIONS OF INTEGRATION I NTEG RATION SOLUTIONS SOLUTI O N S OF
There is is nearly nearly universal universal agreement agreement in in the the bioinformatics bioinformatics and and genomics genomics comcom There munities that that scientific scientific investigation investigation requires requires an an integrated integrated view view of of all all relevant relevant munities data. A A general general discussion discussion of of the the scope scope of of biological biological data data integration, integration, as as well well as as data. the obstacles that currently exist for integration integration efforts, efforts, is is presented presented in in Chap the obstacles that currently exist for Chapter 1 1 of of this this book. book. The typical bioinformatics bioinformatics practitioner practitioner encounters encounters data in a a ter The typical data in wide variety variety of of formats, formats, as as Chapter Chapter 2 2 presents, presents, including including relational relational databases, data bases, wide semi-structured flat flat files, files, and and XML XML documents. documents. In In addition, addition, the the practicing practicing bioinbioin semi-structured formatician must must integrate integrate the the results results of of analytical analytical applications performing such such formatician applications performing tasks as as sequence sequence comparison, comparison, domain domain identification, motif search, and phyloge tasks identification, motif search, and phylogenetic classification. Finally, Internet sites sites are also critical due to the traditional traditional netic classification. Finally, Internet are also critical due to the importance publicly funded, funded, public public domain domain data data at and government importance of of publicly at academic academic and government Web sites, sites, whether whether they they are are central central resources resources or or boutique collections target Web boutique data data collections targeting specific specific research research interests. interests. These These Internet Internet resources resources often provide specialized ing often provide specialized search functionality functionality as as well well as as data, data, such such as as the the Basic Local Alignment Search search Basic Local Alignment Search Tool (BLAST) (BLAST) at at NCBI NCBI [18] [ 1 8] and and the the Simple Simple Modular Modular Architecture Architecture Research Tool Tool Research Tool (SMART) European Molecular Biology Laboratory Laboratory (EMBL) [ 1 9] . A A bioinbioin (SMART) at at the the European Molecular Biology (EMBL) [19]. formatics must make make sure this specialized specialized search capability is formatics integration integration strategy strategy must sure this search capability is retained. retained.

3.3. 1 3.3.1

A otivati n g U se Case r IIntegration nteg rati o n A M Motivating Use Case fo for


To To motivate motivate the the need need for for an an integration integration solution, solution, consider consider the the following following use use case: case: "Retrieve "Retrieve sequences sequences for for all all human human expressed expressed sequence sequence tags tags (ESTs) (ESTs) that that by by BLAST BLAST are are >60% >60% identical identical over over >50 >50 amino amino acids acids to to mouse mouse channel channel genes genes expressed expressed in in central " For For those those less less familiar familiar with with biological biological terms, terms, central nervous nervous system system (CNS) (CNS) tissue. tissue." a a channel channel gene gene is is a a gene gene coding coding for for a a protein protein that that is is resident resident in in the the membrane membrane of of a a cell cell and and that that controls controls the the passage passage of of ions ions (potassium, (potassium, sodium, sodium, calcium) calcium) into into and and out out of of the the cell. cell. The The channels channels open open and and close close in in response response to to appropriate appropriate signals signals and and establish ion levels within the cell. This is particularly important for neural network establish ion levels within the cell. This is particularly important for neural network cells. cells. The The data data sources sources used used in in this this query query are: are: the the Mouse Mouse Genome Genome Database Database (MGD) (MGD) at at the the Jackson Jackson Laboratory Laboratory in in Bar Bar Harbor, Harbor, Maine Maine [20]; [20]; the the Swiss-Prot Swiss-Prot protein protein sequence sequence data data source source at at the the Swiss Swiss Institute Institute for for Bioinformatics, Bioinformatics, and and the the BLAST BLAST search search tool tool and and the the GenBank GenBank nucleotide nucleotide sequence sequence data data collection collection at at NCBI. NCBI. The The data distributed, across across multiple multiple data data data necessary necessary to to satisfy satisfy this this query query are are split, split, or or distributed, sources sources at at multiple multiple sites. sites. One One way way to to integrate integrate these these data data sources sources is is to to enable enable the the user user to to access access them them as as if if they they were were all all components components of of a a single, single, large large database database with with a global schema schema is is an an integrated integrated view view of of all all the the local local a single single schema. schema. This This large large global

46 46

3 3

A Practitioner's Practitioner's G Guide to Data Data Management A u i d e to

schemas of of the the individual individual data data sources. sources. Producing Producing such such a a global global schema schema is is the the task task of of schema integration. This example example illustrates illustrates six six dimensions dimensions for for categorizing categorizing integration integration solutions: solutions: This

via browsing or or querying? 9 Is Is data data accessed accessed via access provided 9 Is Is access provided via via declarative declarative or or procedural code? code?
(used for for all all similar similar data data sources) sources) or or hard-coded for for 9 Is Is the the access access code code generic (used the the particular particular source? source?

9 Is Is the the focus focus on on overcoming overcoming semantic heterogeneity heterogeneity (heterogeneity (heterogeneity of of meaning) meaning) or ? or syntactic heterogeneity heterogeneity (heterogeneity (heterogeneity of of format) format)? data warehouse or or a a federated approach? approach? 9 Is Is integration integration accomplished accomplished via via a a data or a a non-relational data data model? model? 9 Is Is data data represented represented in in a a relational or As As will will become become evident, evident, some some approaches approaches will will be be better better suited suited to to addressing addressing this this particular particular use use case case than than others; others; this this is is not not intended intended to to prejudice prejudice but but to to clarify clarify the among the Section 3.3 the differences differences among the approaches. approaches. The The rest rest of of Section 3.3 discusses discusses various various alternative alternative approaches approaches to to addressing addressing this this motivating motivating use use case. case.

3.3.2

B rows i n g vs. Qu e ryi ng Browsing Querying


The s similar n li The relationship relationship between between browsing browsing and and querying querying iis similar to to the the relationship relationship iin library brary research research between between browsing browsing the the stacks stacks and and conducting conducting an an online online search. search. Both Both are are valid valid approaches approaches with with distinct distinct advantages. advantages. Browsing, Browsing, like like freely freely wandering wandering in in the the stacks, stacks, permits permits relatively relatively undirected undirected exploration. exploration. It It involves involves a a great great deal deal of of leg leg work, work, but but it it is is the the method method of of choice choice when when investigators investigators want want to to explore explore the the domain domain of focus. It of interest interest to to help help sharpen sharpen their their focus. It is is also also well well suited suited to to retrieval retrieval of of a a single single Web Web page page by by its its identifier identifier or or a a book book by by its its call call number. number. On On the the other other hand, hand, query querying, ing, like like online online searching, searching, permits permits the the formulation formulation of of a a complex complex search search request request as as a a single statement, statement, and results are single collated single and its its results are returned returned as as a a single collated set. set. Both Both browsing browsing and and querying querying allow allow the the user user to to select select a a set set of of documents documents from from a a large large collection collection and However, browsing and retrieve retrieve them. them. However, browsing stops stops at at retrieval, retrieval, requiring requiring manual manual navi navigation gation through through the the resulting resulting documents documents and and related related material material via via static static hyperlinks. hyperlinks. Querying further than than retrieval: ] of Querying goes goes further retrieval: It It accesses accesses the the content content [21 [21] of the the resulting resulting documents, extracts information and manipulates it, for example, dropping documents, extracts information and manipulates it, for example, dropping some some items items and and performing performing computations computations on on others. others. Querying Querying thus thus makes makes very very efficient efficient use human time is the use of of human time and and is the method method of of choice choice when when an an investigator's investigator's interests interests are focused, especially subsets of are already already focused, especially if if aggregations aggregations over over subsets of data data are are involved. involved. While the the motivating While motivating use use case case may may be be successfully successfully addressed addressed using using the the brows browsing approach, approach, it tedious, error-prone, cumbersome, involving ing it is is tedious, error-prone, and and very very cumbersome, involving an an

n M ; h ; I n tthe Sol utions ...................................... S of 3.3 Dimensions 3.3 Dimensions o.o~_Describiong the eo~ Space :pace _~.g of r ant ot in or,,.tinn n Solution.o~s

47

47

average average of of 70 70 BLAST BLAST result result sets sets consisting consisting of of up up to to 500 500 EST EST hits hits each. each. In In the the browsing approach, the user searches searches for channel sequences expressed in browsing approach, the user for channel sequences expressed in eNS CNS 7 assays. assays. The tissues tissues using using the the MGD MGD query query form. form. The The result result is is 14 14 genes genes from from 1 17 The user then visits each gene's gene's MGD MGD page. page. Assume Assume that that the the user user is is only interested in in user then visits each only interested Swiss-Prot Swiss-Prot sequences sequences and and that that each each gene gene has has an an average average of of five five associated associated Swiss SwissProt Prot sequence sequence entries. entries. The The user user has has to to visit visit each each sequence's sequence's Swiss-Prot Swiss-Prot page, page, from from which a BLAST against gbest EST portion of GenBank) GenBank) is is launched. launched. which a BLAST search search against gbest (the (the EST portion of Each Each BLAST BLAST result result must must be be inspected inspected to to eliminate eliminate non-human non-human sequence sequence hits hits and and alignments > 60% identity 50 amino alignments that that do do not not meet meet the the inclusion inclusion criteria criteria ((>60% identity over over > >50 amino acids) acids) and and to to eliminate eliminate duplicate duplicate ESTs ESTs hit hit by by multiple multiple Swiss-Prot Swiss-Prot sequences. sequences. Finally, Finally, the sequences for all the the hits hits that the full full EST EST sequences for all that survive survive must must be be retrieved retrieved from from GenBank. GenBank. If If the the browsing browsing approach approach was was used used to to satisfy satisfy this this query, query, these these steps steps would would then then be be repeated the 14 14 genes (Figure 3.1). repeated for for each each of of the genes returned returned by by the the initial initial query query (Figure 3.1).
MGD query

MGD gene

MGD gene

MGD gene

MGD gene

S!seq

...

~176176

Result
3.1

F IGURE FIGURE

Schematic Schematic diagram diagram of of the the browsing browsing approach approach to to the the motivating motivating use use case. case.

48 48

3 3

A Practitioner's Practitioner's G Guide to Data Data M Management A u ide to a nagement

"

Querying Approach Q u e ryi ng Approach


S h o w me me all human human EST EST sequences s e q u e n c e s that are >60% >60% Show identical over over 50 50 AA AA to mouse m o u s e channel channel genes genes identical e x p r e s s e d in eNS" CNS" expressed

Simplified Sal:
SELECT g.accnum ,
FROM

g.sequence genbank g, blast b, swissprot s, mgd m m.exp = "CN S ' m defn LIKE ""I ochannel%" m spid = s.id s.seq = b.query b.percentid b.alignlen
>

WHERE

AND AND AND AND AND AND

b.hit = g.accnum
>

50

60

Moua Genome --

BLAST AppIIc:daI

3.2 3.2 FIGURE F IGURE

The querying approach to use case. The querying approach to the the motivating motivating use case.

I na a querying querying approach approach to to this this problem, problem, a a short short SQL SQL query query i s submitted to In is submitted to the query processor. The The query processor visits MGD to to identify identify channel genes the query processor. query processor visits MGD channel genes expressed in in CNS, CNS, and and the the Swiss-Prot Swiss-Prot Web Web site site to to retrieve their sequences. sequences. For For each each expressed retrieve their of these these sequences, sequences, it it launches launches a a BLAST BLAST search search against against gbest, gbest, gathers gathers the the results, results, of applies the the stringency inclusion criteria, criteria, and and finally finally retrieves retrieves the the full-length full-length EST EST applies stringency inclusion sequences from GenBank sequences from GenBank (Figure (Figure 3.2). 3.2).

3.3.3 3.3.3

Syntactic Syntactic vs. vs. Semantic Sema ntic Integration I nteg rati o n
As stated stated previously, previously, syntactic syntactic integration integration addresses addresses heterogeneity heterogeneity of of form. form. GenGen As Bank is is a a structured structured file, file, MGD MGD is is a a Sybase Sybase (relational) (relational) database, database, and and BLAST BLAST is is Bank an analytical analytical application. application. These These differences differences in in form form are are overcome overcome in in the the browsing browsing an by providing providing a a Web-based Web-based front front end end to to the the sources sources and and in in the the queryquery strategy strategy by ing strategy strategy by by providing providing SQL SQL access access to to all all the the sources. sources. Contrariwise, Contrariwise, semantic semantic ing

3.3

Dimensions Describing the Space of I ntegration Sol utions

49

integration sa n anno integration addresses addresses heterogeneity heterogeneity of of meaning. meaning. In In GenBank, GenBank, a a gene gene iis an annotation tation on on a a sequence, sequence, while while in in MGD MGD a a gene gene is is a a locus locus conferring conferring phenotype phenotype (e.g., (e.g., black black hair, hair, blindness). blindness). Neither Neither of of the the integration integration approaches approaches in in this this example example specif specifically ically focuses focuses on on resolving resolving this this heterogeneity heterogeneity of of meaning. meaning. They They rely rely instead instead on on the the user's knowledge of underlying data combine data user's knowledge of the the underlying data sources sources to to combine data from from the the sources sources in scientifically scientifically meaningful ways. in meaningful ways.

3.3.4 3.3.4

Wa reh o use vs. Fed e rati o n Warehouse Federation


In In a a warehousing warehousing approach approach to to integration, integration, data data is is migrated migrated from from multiple multiple sources sources into DBMS, typically DBMS. As into a a single single DBMS, typically a a relational relational DBMS. As it it is is copied, copied, the the data data may may be be cleansed cleansed or or filtered, filtered, or or its its structure structure may may be be transformed transformed to to match match the the desired desired queries is a queries more more closely. closely. Because Because it it is a copy copy of of other other data data sources, sources, a a warehouse warehouse must must be refreshed be refreshed at at specified specified times-hourly, timesmhourly, daily, daily, weekly, weekly, monthly, monthly, or or quarterly. quarterly. A A data warehouse may contain contain multiple data warehouse may multiple data data marts, marts, subset subset warehouses warehouses designed designed to to support support a a specific specific activity activity or or inquiry. inquiry. While data in While a a warehouse warehouse replicates replicates data, data, a a federated federated approach approach leaves leaves data in its its native native format format and and accesses accesses it it by by means means of of the the native native access access methods. methods. In In the the pre previous example, example, the the querying querying approach approach is is a a federated federated approach-it approachmit accesses accesses MGD MGD vious as as a a Sybase Sybase database, database, Swiss-Prot Swiss-Prot and and GenBank GenBank as as Web Web sites, sites, and and BLAST BLAST via via run runtime time searches, searches, and and it it integrates integrates their their results results using using complex complex software software known known as as middleware. An An alternative alternative demonstration demonstration of of the the querying querying approach approach could could have have middleware. imported Swiss-Prot, MGD, imported GenBank, GenBank, Swiss-Prot, MGD, and and the the results results of of BLAST BLAST searches searches into into Sybase, Oracle, Oracle, or database systems Sybase, or IBM IBM DB2 DB2 database systems and and executed executed the the retrievals retrievals and and filtering filtering there. there. This This would would have have been been an an example example of of the the warehousing warehousing approach. approach.

3.3.5 3.3.5

Decl a rative vs. Proced u ra l Access Declarative Procedural Access


As As discussed discussed previously, previously, declarative declarative access access means means stating stating what what the the user user wants, wants, while procedural procedural access while access specifies specifies how how to to get get it. it. The The typical typical distinction distinction opposes opposes the the use SQL) and use of of a a query query language language (e.g., (e.g., SQL) and writing writing access access methods methods or or sub-routines sub-routines in in Perl, Perl, Java, Java, or or other other programming programming languages languages to to access access data. data. In In the the motivating motivating use . 3 . 1 , the SQL query use case case in in Section Section 3 3.3.1, the querying querying approach approach uses uses the the SQL query language. language. Alternatively, Perl [22] sub-routines or object methods that extract data Alternatively, Perl [22] sub-routines or object methods that extract data from from MGD, and GenBank BLAST searches MGD, Swiss-Prot, Swiss-Prot, and GenBank and and run run the the necessary necessary BLAST searches could could have have performed performed the the task. task.

3.3.6 3.3.6

G e n e ric vs. H a rd-Coded Generic Hard-Coded


The The federated federated approach approach in in the the previous previous example example was was generic; generic; it it assumed assumed the the use general purpose purpose wrappers use of of a a general-purpose general-purpose query query execution execution engine engine and and general wrappers

50 0

~~~:~~8~%~%~~8~:~~%~%~%~%~~%~%~:

3 3

A u ide to A Practitioner's Practitioner's G Guide to Data Data Management ~ % ~ ~ ~ ~

(software (software modules modules tailored tailored to to a a particular particular family family of of data data sources) sources) for for data data ac access. An hard-coded approach problem would would be cess. An example example of of a a hard-coded approach to to the the problem be writing writing a a special Per! script needed to this special purpose purpose Perl script to to retrieve retrieve just just the the information information needed to answer answer this particular question. question. A A generic generic system system enables enables users to ask ask numerous numerous queries supparticular users to queries sup porting porting a a variety variety of of scientific scientific tasks, tasks, while while a a hard-coded hard-coded approach approach typically typically answers answers a task. Generic a single single query query and and supports supports users users in in a a single single task. Generic approaches approaches generally generally involve involve higher higher up-front up-front development development costs, costs, but but they they can can pay pay for for themselves themselves many many times times over over in in flexibility flexibility and and ease ease of of maintenance maintenance because because they they obviate obviate the the need need for for extensive programming programming every every time time a a new new research research question question arises. arises. extensive

3.3.7

Relational vs. N Non-Relational Data Model Model R e l ati o n a l vs. on-Relati o n a l Data
Recall s not Recall that that a a data data model model iis not a a specific specific database database schema, schema, but but rather rather something something more abstract: abstract: the the way way in in which which data data are are conceptualized. conceptualized. For For example, example, in in the the re remore lational model, the lational data data model, the data data are are conceptualized conceptualized as as a a set set of of tables tables with with rows rows and and columns. columns. Oracle, Oracle, Sybase, Sybase, DB2, DB2, and and MySQL MySQL are are all all DBMSs DBMSs built built on on the the relational relational model. model. In In data data management management systems systems adhering adhering to to a a non-relational non-relational data data model, model, data data may may be be conceptualized conceptualized in in many many different different ways, ways, including including hierarchical hierarchical (tree-like) (tree-like) structures, structures, ASCII ASCII text text files, files, or or Java Java or or Common Common Object Object Request Request Broker Broker Architec Architecture example, MGD ture (CORBA) (CORBA) [23] [23] objects. objects. In In the the motivating motivating example, MGD is is relational, relational, and and the the other other sources sources are are non-relational. non-relational.

3.4

U S E CASE S OF IINTEGRATION NTEG RATI O N SO LUTI O N S USE CASES SOLUTIONS


The motivating use case in permitted a brief outline outline of The motivating use case in Section Section 3.3.1 3.3.1 permitted a brief of the the six six dimen dimensions sions for for categorizing categorizing integration integration solutions. solutions. To To further further elucidate elucidate these these dimensions dimensions and and demonstrate demonstrate their their use, use, this this section section describes describes each each dimension dimension in in greater greater detail, detail, presents presents a a prototypical prototypical featured featured solution, solution, and and categorizes categorizes the the featured featured integration integration solution on all six solution on all six dimensions. dimensions.

3.4. 1 3.4.1

B rows i n g-Drive n Sol utio n s Browsing-Driven Solutions


As As in in the the previous previous example, example, in in a a browsing browsing approach approach users users are are provided provided with with inter interactive access to data, allowing active access to data, allowing them them to to step step sequentially sequentially through through the the exploratory exploratory process. A process. A typical typical browsing browsing session session begins begins with with a a query query form form that that supports supports a a set set of of pre-defined, pre-defined, commonly commonly posed posed queries. queries. After After the the user user has has specified specified the the parame parameters ters of of interest interest and and the the query query is is executed, executed, a a summary summary screen screen is is typically typically returned. returned. From individual objects From here here the the user user may may drill drill down, down, one one by by one, one, into into the the individual objects meet meeting ing the the search search criteria criteria and and from from there there view view related related objects objects by by following following embedded embedded links, up language (HTML) or links, such such as as hypertext hypertext mark markup language (HTML) or XML XML hyper!inks. hyperlinks. The The data data

3.4 Use Use Cases Cases of of Integration I ntegration Solutions Solutions 3.4

-=====

51 51

source(s) underlying underlying a a browsing browsing application application may may be be warehoused warehoused or or federated, federated, and and source(s) relational relational or or non-relational. non-relational. Browsing Browsing applications applications are are ubiquitous ubiquitous on on the the Internet; Internet; examples examples are are Swiss-Prot Swiss-Prot and and the the other other data data collections collections on on the the Expert Expert Protein Protein Analysis System System (ExPASy) (ExPASy) server server at at the the Swiss Swiss Institute Institute of of Bioinformatics Bioinformatics [24], [24], the the Analysis FlyBase FlyBase Web Web site site for for Drosophila Drosophila genetics genetics [25, [25, 26], 26], and and the the featured featured example, example, the the Entrez Web Web site site at at NCBI NCBI [10]. [10]. Entrez
Bro wsing Featured Featured Example: Browsing Example: NCBI NCBI Entrez Entrez

As example of of the the following As an an example the browsing browsing approach, approach, consider consider the following query: query: "Find "Find in in PubMed articles articles published published in about human human metalloprotease genes PubMed in 2002 2002 that that are are about metalloprotease genes and retrieve their associated associated GenBank accession numbers numbers and and sequences. " The The se seand retrieve their GenBank accession sequences." quence in answering answering this 3 . 7. First First quence of of steps steps in this query query is is shown shown in in Figures Figures 3.3 3.3 through through 3.7. the enters the Boolean search term "metalloprotease "metalloprotease AND AND human the user user enters the Entrez Entrez Boolean search term human AND in the PubMed online online query query form form [7]. [7]. The The result result is is a a summary summary AND 2002 2002 [pdat]" [pdat] " in the PubMed of qualifying hits; there were were 1054 December 2002 2002 (Figure (Figure 3.3). 3.3). From From here, here, of qualifying hits; there 1 054 in in December the user can visit individual individual PubMed .4), read read their their abstracts, the user can visit PubMed entries entries (Figure (Figure 3 3.4), abstracts, check check for for a a GenBank GenBank sequence sequence identifier identifier in in the the secondary secondary source source ID ID attribute attribute (Figure 3.5), 3.5), and and visit the associated associated GenBank GenBank entry entry to to retrieve retrieve the the sequence. sequence. (Figure visit the
Ubrvy of Medldne
NaUonal

!Iems

1-20

oflOS4

mp4;tCCtl UAC..,H IcldneCnkY RtI dAIWIe'. l.JnkI r l:TddM',,*k,Le"" JP RcmbdA B pyl DJ PUYtrfttDOergt)gmAJOumnCHmtplmeanRt.gbtrY SIVTWCs-mtT 0 .... . . .... d clo.eIy relaud non-specmc S.quenc. odeDtI6<:lIIOn II>d tharIt_OII ofbumaft . .... dipeplldas. J BIOI .. DIl Ooc 6 [tpUb ....'" .,....) PMID 1:1473676 (PubMtd . .. ouppIo.d 1' pubIa.htr]

r 3:m'rWY.HR'NCCldyll OaedruSXI.HBlWAC Rei tedMlc.... un met oflUmOr neao... fact-alpha torrrerllQl enzyme (TACE) ODd meta!oproteu. IllIIbrI 011 omy\otd pncunor proteID meIabou.n ., InmaD oeurOl1S
J NNrO_ DIl O lJ.4P..1 l.!7 PWJD IlC7lIIP (PubM.d .. ouppIotd 1' pubUhtr]

r l:v"',bDP Httb1mm'McQoyMlKANMIQU NcuymHJ.3trpbcn'AARPM'P'PPWN RI' dArtltlt. unkt K ,, 'AIPR !!Jp _ :SW O .... nrp lQ D..\F m4 c11PR.M toaI!:qyu1aM P MMnmabn Tohd metal\optolellW ItId not maInx metalloproteu. 2 nor membrane type 1 metaIIoproteu proc ..... IarnininS ., . arid dart .. J _a.- DIl Ooc 7 (tpUb ...."'., ....) PWJD IWlIUO (P111>IoC .d . u ouppIotd 1'pubIathtr]

3.3 3.3

PubMed articles articles published published in in 2002 2002 on on human human metalloprotease metalloprotease genes. genes. PubMed

FIGURE

52

52

======

3 3

A u ide to a nagement A Practitioner's Practitioner's G Guide to Data Data M Management

Pub
Pr
_ "

Ubrvy of Medldne

HaUonaJ

Del"

r 1: Blochem J 2002 Apr 15.363(P1 2)"2S3-62


Promoter characterhadon of the novel eeIJs of epitheUal oripn. Manhuko G

rep1adon by the T-c:dl factor-4 ImpUes spedllc expression of the cme , Manh"nko ND, LODI J, Stronp AY.

human

matrix metaI1oprotdnas26 Cme:


cancer

in

The Bumhamlnsbtul<. 10901 Nor1h Torrey PUle. R.oad. LaJoD... CA 92037. U S A.

A novel IIIIInZ me!aDoproteuW.-26 (MMP - 26) IS knoWII lo be sp.afieoJly ......d 10 eptIbebaI c_mu To SIUdae. ofMMP-26 bonaI oa. wc have cloned ODd charactetv:ed a 1 kb 5-lla.>Iana r.on of human MMP-26 Alto&ed>er. our &ndIOp IIIdocIlle tMl It.! MMP-26 promoter has d&sImcave nruclUr1ll and fimcDocW re..... """"" MMP aene' An \DIIUII polyadeny\atlon SIIe proDnll IO It.! Don-CacIOr-bmdona SIIeJ proleets Ir'IIOSa1>Don oflt.! MMP26 gene from Iht upllteam promOlen and .enIS a port oflt.! stnQ&eO% transCf'4'bOnal rtaubbOD of Iht aene The MMP-26 gene has a COIlIe11IUS TATA-box ODd one transcq>bonaI JIart ate locared 60 ODd 35 oucIeolodes upllteam oflht !l'lllJlmonal JIart 1IIe. n:sp.c:IIveIy The MMP-26 promoter .... able

3.4 3.4 FIGURE FIG URE

of the qualifying One of qualifying PubMed PubMed abstracts. abstracts.

Alternatively, the the user user can take advantage f the the PubMed PubMed Alternatively, can take advantage o of the LinkOut LinkOut option option on on the entry page (Figures (Figures 3.4 3.4 and 3.6), which which enables access to to the the sequence sequence informaentry page and 3.6), enables access informa tion provided (Figure 3.7). Notice that that there there are are many many more tion provided by by LocusLink LocusLink (Figure 3 .7). Notice more navnav igation strength of igation paths paths to to follow follow via via hyperlinks hyperlinks than than are are described described here; here; a a strength of the the browsing approach approach is is that that it it supports supports many many different different navigation navigation paths paths through through browsing the the data. data. The categorization of Entrez based on The categorization of Entrez on the six dimensions dimensions is given in Table 3.1 3.1 on o n page page 56. 56.

3.4.2 3.4.2

Data Data Warehousing Wa re h o u s i n g Solutions So l uti o n s


In the the data data warehousing warehousing approach, approach, data data is is integrated integrated by by means means of of replication replication and and In storage in in a a central central repository. repository. Often Often data data is is cleaned cleaned and/or and/or transformed transformed during during storage the loading loading process. process. While While a a variety variety of of data data models models are are used used for for data data warehouses, warehouses, the including including XML XML and and ASN.1, ASN. 1 , the the relational relational data data model model is is the the most most popular popular choice choice (e.g., Oracle, Oracle, Sybase, Sybase, DB2, DB2, MySQL). MySQL). Examples Examples of of the the integration integration solutions solutions followfollow (e.g., ing the the data data warehousing warehousing approach approach include include Gene Gene Logic's Logic's GeneExpress GeneExpress Database Database ing

3.4 3.4

Use ntegration S o l utions Use Cases Cases of of IIntegration Solutions

=====

53 53

Luaferuelgenetx:. Matnx Me!aBoprolClas.slaenetit'Molecular Sequence DIIa Neoplasms. Glandular IIOd EpilbeUVgenellC' Neoplasms. Glandular IIOd EpilbebVemymoloayPromoter 810111 (Genetic.)Support. Non-U S. Gov'l Supp ort, U S. Gov'!, P.H.S. TranscnpQOD FactonImeIabolismTranscnpCloo. Gen.eIic Tumor Cells. Cullured
Transfi:CbOll

SubSlaDCer

_ =aIIoproccmu. 26 Matnx Me!aloproccmue, I.uaferue Transc:npllOll FlCton T cell factor 4 DNA. Neoplasm lOIn' id-

ondaty

Grw support:

GENB.ANKl AF291

-----....

CA 774701CAlNCI CA 830171CAlNCI

PMID, 1 1931652

3.5

Checking Checking for for GenBank GenBank references references in in the the PubMed PubMed entry. entry.

FIGURE (presented 1 0 ) [27], the Genome (presented in in Chapter Chapter 10) [27], the Genome Information Information Management Management System System of of the the University University of of Manchester Manchester [28], [28], the the data data source source underlying underlying the the GeneCards GeneCards Web Weizmann Institute Institute in [29, 30], Genes [3 1], which Web site site at at the the Weizmann in Israel Israel [29, 30], and and All AllGenes [31], which will will serve serve as as the the featured featured example. example.
Warehousing Warehousing Featured Featured Example: Example: AI/Genes AIIGenes

A A research research project project of of the the Computational Computational Biology Biology and and Informatics Informatics Laboratory Laboratory at at the Genes is the University University of of Pennsylvania, Pennsylvania, All AllGenes is designed designed to to provide provide access access to to a a database database integrating integrating every every known known and and predicted predicted human human and and mouse mouse gene, gene, using using only only publicly publicly available available data. data. Predicted Predicted human human and and mouse mouse genes genes are are drawn drawn from from transcripts transcripts pre predicted by clustering (mRNA) sequences. dicted by clustering and and assembling assembling EST EST and and messenger messenger RNA RNA (mRNA) sequences. The is on The focus focus is on integrating integrating the the various various types types of of data data (e.g., (e.g., EST EST sequences, sequences, genomic genomic sequence, sequence, expression expression data, data, functional functional annotation). annotation). Integration Integration is is performed performed in in a a structured manner manner using using a relational database and structured a relational database and and controlled controlled vocabularies vocabularies and ontologies to clustering ontologies [32]. [32]. In In addition addition to clustering and and assembly, assembly, significant significant cleansing cleansing and and transformation transformation are are done done before before data data is is loaded loaded onto onto AllGenes, AllGenes, making making data data ware warehousing an housing an excellent excellent choice. choice.

54 5 4

3 3

A u ide to nagement A Practitioner's Practitioner's G Guide to Data Data Ma Management

Der

r 1: MarthenlcoON el a! Promoter chJncterizaibOt> or [PMID.II 931652)

Ldu fiJIIuxr II!d resource nOC'DWlOt> ...e d by prOVIders Ldu wuh OD as1aISIt m<Score !he LanIt prOVIder reqans a sub.1IOt>, membenbp. or ree ror ate,ell Ul'ER.ATURE; o Aartplorl EBSCO -tUI\.!<XI Ofthne infotneYe -!<XIonbne o o PublisbenlproVlders

Rot I*! Mcl ... UnkS

=ruB-tell!Ofthne

PonIand Pre ..

3.6 3.6 FIGURE

The LinkOuts access to to LocusLink. The LinkOuts page page enables enables access LocusLink.

A sample query for for All Genes iis s the following: "Show e the the D DNA repair "Show m me N A repair A sample query AllGenes the following: genes that that are n o w n to to be expressed in in central central nervous system tissue. tissue." The query query " The genes are k known be expressed nervous system is specified and and run flexible query-builder query-builder interface (Figure 3.8), is specified run using using a a flexible interface (Figure 3 . 8 ), yielding yielding of qualifying assemblies (Figure 3.9). From query result result page page the the a summary of qualifying assemblies (Figure 3.9). From the the query a summary user can can visit visit a a summary summary page page for for each each qualifying qualifying assembly, assembly, which includes such such user which includes valuable hyperlinks to valuable information information as as predicted predicted GO GO functions; functions; hyperlinks to GeneCards, GeneCards, the the Mouse Genome Genome Database Database (MGD), (MGD), GenBank, GenBank, ProDom, ProDom, and and so so on; on; Radiation Radiation HyHy Mouse brid (RH) ( RH) Map Map locations; locations; the the 10 1 0 best best hits hits against against the the GenBank GenBank non-redundant non-redundant brid protein protein database database (nr); (nr); and and the the 10 1 0 best best protein protein domain/motif domain/motif hits. hits. The categorization of The categorization of AllGenes All Genes based based on on the the six six dimensions dimensions is is given given in in Table Table 3.2 3.2 on on page page 58. 58.

3.4.3 3.4.3

Federated Database Data base Systems System s Approach Approach Federated


Recall that that i na a federated federated approach, approach, data data sources sources are are not not migrated migrated from from their their nana Recall in tive source source formats, formats, nor nor are are they they replicated replicated to to a a central central data data warehouse. warehouse. The The data data tive sources sources remain remain autonomous, autonomous, data data is is integrated integrated on on the the fly fly to to support support specific specific queries queries

3.4 3.4

Use ntegration Solutions Use Cases Cases of of IIntegration Solutions

55 55

View Hs MMP26

:=:1 One of 1 Loci


-

Save All Loci

I
-e'

A B C D E F G H I J K L M N O P Q R S T U VWX Y Z
Chck to DISplay mRNA-Genotnlc Ahgnments (spanrung 4359 bps)
VAR -

GOB

M:MP26: matrix metaDoproteinase 26 LocusID: 56547

RefSeQ Swrunuy: ProteUls of the matrix metalloproteinase (MMP) family


are

iovolved io the breakdown of extracellular matnx io normal

physiological processes, such as embryonic development, reproduction,

and tissue remodehng. as well as in disease processes, such as arthntis and

metastaSIS. Most MMP's are secreted as Ulacllve proproteUlS which are

3.7

FIGURE

Sequences may be Sequences may be obtained obtained from from LocusLink LocusLink entries entries corresponding corresponding to to PubMed PubMed articles. articles.

or typically through declarative query or applications, applications, and and access access is is typically through a a declarative query language. language. Ex Examples federated systems systems and and their sys amples of of federated their data data models models include include complex-relational complex-relational systems, such such as ) and Kl (see tems, as BioKleisliIK2 BioKleisli/K2 (Chapter (Chapter 8 8) and its its cousin cousin GeneticXchange's GeneticXchange's K1 (see Chapter systems (OPMfTINet) Chapter 6), 6), object-relational object-relational systems (OPM/TINet) [33], [33], and and IBM's IBM's relational relational sys system Chapter 1 0 and and will will serve tem DiscoveryLink, DiscoveryLink, which which is is detailed detailed in in Chapter 10 serve as as the the featured featured example example [34]. [34].
Federated Featured Federated Featured Example: Example: DiscoveryLink DiscoveryLink

The motivating use case of . 3 . 1 is approach like The motivating use case of Section Section 3 3.3.1 is a a good good fit fit for for a a federated federated approach like DiscoveryLink's. DiscoveryLink's. DiscoveryLink DiscoveryLink provides provides transparency: transparency: The The federation federation of of diverse diverse types types of of data data from from heterogeneous heterogeneous sources sources appears appears to to the the user user or or the the application application as as a a single single large large database, database, in in this this case case a a relational relational database. database. The The SQL SQL query query language language

56
Browsing Browsing

A u ide to A Practitioner's Practitioner's G Guide to Data Management

Querying Querying
No querying queryingcapability capability

Interactive Web Web browser browser access accessto data Interactive


Semantic

Syntactic
Provides Provides access access to nucleotide nucleotide and protein protein sequence, LINE abstracts, sequence, annotation, MED MEDLINE abstracts, etc. etc.

semantic integration integration No semantic

Warehouse

Federation
No federation federation

Provides access access to data data sources sources at NCBI NCBI Provides
Declarative Access Declarative

Procedural Access
Access Access via via Entrez Entrez Programming ProgrammingUtilities Utilities (E-utilities) (E-utilities)

declarative access access No declarative

Generic
Not generic generic

Hard-Coded
Hard-coded for NCBI Hard-coded NCBI sources sources only only Links Links are hard-coded hard-coded indices indices

Relational Data Model Relational

Non-Relational Data Model


Data stored in the ASN.1 ASN.l complex-relational complex-relational data model model

Relational model not used used Relational data model

3.1 3 .1
9 -. ~,, \.... .:~

Entrez with respect respect to to the the six of integration. integration. Entrez categorization categorization with six dimensions dimensions of

TABLE TABLE

is supported over all the federated federated sources, even if if the underlying sources' sources' native is supported over all the sources, even the underlying native search capabilities are less full-featured single federated in search capabilities are less full-featured than than SQL; SQL; a a single federated query, query, as as in the motivating example, example, typically typically combines multiple sources. sources. the earlier earlier motivating combines data data from from multiple Similarly, non-SQL search search capabilities capabilities of the underlying sources are Similarly, specialized specialized non-SQL of the underlying sources are also functions. also available available as as DiscoveryLink DiscoveryLink functions. The of DiscoveryLink appears in 10 (Figure 1 0. 1 ). At The architecture architecture of DiscoveryLink appears in Chapter Chapter 10 (Figure 10.1). At the far right are the data sources. To these sources, DiscoveryLink looks like the far right are the data sources. To these sources, DiscoveryLink looks like an application~they application-they are not changed changed or or modified in any any way. way. DiscoveryLink DiscoveryLink an are not modified in which use use the the data data source's source's own own client-server client-server talks to to the the sources sources using using wrappers, which talks mechanism to to interact interact with with the the sources sources in in their their native native dialect. dialect. DiscoveryLink DiscoveryLink mechanism about the the data data has a a local local catalog catalog in in which which it it stores stores information has information (meta-data) about accessible (both (both local local data, data, if if any, any, and and data data at at the the back back end end data data sources). sources) . ApAp accessible plications of of DiscoveryLink DiscoveryLink manipulate manipulate data data using using any any supported SQL AppliAppli plications supported SQL cation Programming Programming Interface Interface (API); (API); for for example, example, Open Open Database Database Connectivity Connectivity cation (ODBC) or or Java Java DataBase DataBase Connectivity Connectivity (JDBC) (JDBC) are are supported, supported, as as well well as as emem (ODBC) bedded SQL. SQL. Thus Thus a a DiscoveryLink DiscoveryLink application application looks looks like like any any normal normal database database bedded application. application.

3.4 3.4

Use Cases Cases of of IIntegration Solutions Use ntegration Solutions

57

n.. '''' ..... ,. .......... '-.us.,...., ...,. ... ......ut .... . P"'", 1IctIOn "f.*J '''' ..... ..... ... -,,)C)- .. -oR'" 'or.-..plt, IO C'WIItNft ,, .. ''""'" . ...... ... wCI . ....., .... ,....t! "''"'' ... ICIn, ,....,. .... tbI ,.....,.. ..... ... .1K1 hf1t ..,-.nt. .,... .. ., dtcbItc _ .. A)It)- ... , ...... .... ,.. '4I4IM... t ... __ ,,""' a-. O'. ...... ...... . ...., .. , ..... ..,\'a... .....a. ...... . *"t ...,. 1 AlIDJ ,... .... ,.... ..ut..u.f1" .... ., ,,.. . ,Q,..... ..... ...... &I\.t fl"Ct,.... wiI"" ttfWa.I, .. IMdI:., 1 UM'4M lIIlrond. IItoKt'" __

lAND IGO funet!onal ela.. lncation

OO ._

1.. 00 _ ... ... _ . 0..- 1-.. _ .. ..... ...-.3

? ? ? ?

10' _ _ - _ "... "' - _ ..

3.8

Genes query builder. The All AllGenes

FIGURE
QUIIY: (Homo sepoens or MJ<; rroscUus RNAs ass'gned !he GO fIIIc1Ion ""hJdeIC acid bon<lng DNA bonding DNA lepllll prole,n"" (11 GO MclJon asSlgntllll ltS Irterwct (Homo sap ens or t.\Js tnlSCWs RNAs Ih8t con ,n at est 1 ESTs end have et leasl 25% of lheol ESTs po! nlll"1 e:qlfessed In organ SysI m n system C_lll neMlUS system)
11.11 RNA ." nes.or

orglnlsm
"..-....... ....... --... ....... ....... .......

mRNA? ,..

2 1:!I12l!! ro:p6m2
nI16il1

I1ll1ml

_ . 1501" sc;utncts length deSCription 38 ms 11lCm ....., .. loo.l of(AfI I2263j bblll"'..... IH_ ........)
6 2 1094

11.11

T"

280

,..

21

12 364 700 I 7
699

1 llI2!l

6 1lI2:!!22
8 Ill2:!HJ2 1lI2m2222

11 3 4 s
,.,

1 1 '!H21071

10 III2:! n Q m

17S4

IIlCm .-, I. 100% ofDNA UlSNATCH REPAlR PROT[N NSID (REPAlR3 P10'llllN) (REP. I) 1 odcroIIIy to of(BCC I I224) sm.r to """""' ""Ill< I [Mw _cU) 81% adtray 10 ofDNAREPAlIt PROIliIN nccI IIlCm ....., to IIlCm oCDNA-(APtmNIC OR APYRINIDINIC SlI'E) LYASE CAP ENDONUCLEASE I) CAPEX NUCLASE) CAPEN) (REF. I PROTEIN) 81%adtray 10 3 of(BCCI I224) sm.r to """''' '<pW I IN .. ..... ..... , 78'K """"" to of(BCCI I224) S ... 1O ___ r<p_ I [M ........., . 8l% adtray to > ofDNA EXctSION R.EPAIR PROT[N ERCCI sm """"" to of(Af08S717) 1U'.O....... I hMIo' [Moo ...._1 98Y , "'-7 to 36% ofDNA ""_"'o_r<p.3 arowo _ """"" 10 oCOGl.0092OO) X,.., er ... C prOl'" I [H-

12 Q!2114m 1 3 III iI1m l 14 1lIl1ml IS urm27(1

,..

11 4

792 91%. to 71% oCllRACIL-DNA GLYCOSYLASli PiECURSOR (UDG) 36SS _ """"" to _ oCOGl.041 Sso) hypotbtIrcol pr .. ... XP041S [H ..... _, 768 97%. to 8% oCOCN.Os.l382) Irjopod>ncol pr",", XPJ)s.l382 (Homo .. ..... , p 96% "'-7 to 69% oCPROllFERAiING CElL NUCLIiAR A1mGEN (l'CNA) 101

...,.,1

3.9

Results of the All Genes query. AllGenes

FIGURE

58 8

3 A A Practitioner's Practitioner's Guide G u ide to to Data Data Management Management =================================================================================================================================================================================================================== E~I

Browsing Browsing
Interactive Web browser access access to data Interactive

Querying Querying

Limited querying querying capability capability via parameterized query builder

Semantic
semantic integration Ontologies for semantic

Syntactic

Data warehousing warehousing for syntactic syntactic integration


Federation

Warehouse Warehouse
relational warehouse Data stored in relational

Not a federation
Access Procedural Access
procedural access access No procedural

Declarative Declarative Access Access


Under the covers; covers; users users use parameterized query builder parameterized

Generic
Information not available Information available

Hard-Coded
Information Information not available available

Relational Data Model


DBMS Data stored in Oracle Oracle DBMS

Non-Relational Non-Relational Data Model Model


Not used

TABLE

11_

3.2 3.2

categorization with with respect respect to AllGenes to the the six six dimensions dimensions of integration. integration. All Genes categorization

The s given The categorization categorization of of DiscoveryLink DiscoveryLink based based on on the the six six dimensions dimensions iis given in in Table .3. Table 3 3.3.

3 .4.4 3.4.4

S e m a ntic Data nteg rati o n Semantic Data IIntegration


Recall Recall that that semantic semantic data data integration integration focuses focuses on on resolving resolving heterogeneity heterogeneity of of mean meaning, ing, while while syntactic syntactic data data integration integration focuses focuses on on heterogeneity heterogeneity of of form. form. In In a a volume volume on Kashyap and on management management of of heterogeneous heterogeneous database database systems, systems, Kashyap and Sheth Sheth write: write:
In any approach to interoperability of database systems [database integration], the fundamental question is that of identifying objects in different bases that are different data databases semantically related and then resolving the schematic [schema-related] differences differences among among semantically semantically related related objects. objects. [35] [35]

This This is is the the fundamental fundamental problem problem of of semantic semantic data data integration. integration. The The same same protein protein se sequence quence is is known known by by different different names names or or accession accession numbers numbers (synonyms) (synonyms) in in GenBank GenBank and and Swiss-Prot. Swiss-Prot. The The same same mouse mouse gene gene may may be be represented represented as as a a genetic genetic map map locus locus in in MGD, MGD, the the aggregation aggregation of of multiple multiple individual individual exon exon entries entries in in GenBank, GenBank, and and a a set set of product may of EST EST sequences sequences in in UniGene; UniGene; in in addition, addition, its its protein protein product may be be an an entry entry in in

Use Cases Cases of of Integration I ntegration Solutions Sol utions 3.4 Use 3.4

59
Querying Querying

Browsing Browsing No browsing browsing capability capability No

Full ad ad hoc SQL query query language


Syntactic Syntactic

Semantic
No semantic integration integration No

Maps heterogeneous sources into 9Maps

relational model

Maps SQL into native query languages 9Maps of sources of

Warehouse
Not available, though though warehouses may may be Not members of a DiscoveryLink federation federation members of

Federation
Integrates heterogeneous sources Integrates through wrappers and through wrappers and middleware

Declarative Access
SQL query language

Procedural Access
No procedural access No

Generic
Query processor, most wrappers wrappers Relational Data Model Model top of DB2 Built on top

Hard-Coded
Some access wrappers wrappers (e.g., BLAST) Non-Relational Model Non-Relational Data Model Not used Not

3.3 3.3

DiscoveryLink categorization with respect to the six dimensions of integration.

TABLE TABLE
Swiss-Prot and its human orthologues represented as Swiss-Prot and its human orthologues may may be be represented as a a disease-associated disease-associated locus locus in in Online Online Mendelian Mendelian Inheritance Inheritance in in Man Man (OMIM) (OMIM) [36] [36].. Semantic Semantic integration integration also deals deals with also with how how different different data data sources sources are are to to be be linked linked together. together. For For exam example, ple, according according to to documentation documentation at at the the Jackson Jackson Lab Lab Web Web site site [37], [37], MGD MGD links links to to Swiss-Prot through Swiss-Prot through its its marker marker concept, concept, to to RatMap RatMap [38] [38] through through orthologues, orthologues, to to Pub Med through genes) or PubMed through references, references, and and to to GenBank GenBank through through either either markers markers (for (for genes) or molecular probes for anonymous molecular probes and and segments segments ((for anonymous DNA DNA segments). segments). Finally, Finally, a a schema schema element element with with the the same same names names in in two two different different data data sources sources can can have have different different se semantics values. For mantics and and therefore therefore different different data data values. For example, example, retrieving retrieving orthologues orthologues to to the the human human BRCAl BRCA1 gene gene in in model model organisms organisms from from several several commonly commonly used used Web Web sites yields varying results: GeneCards returns the BRCAl gene in mouse and sites yields varying results: GeneCards returns the BRCA1 gene in mouse and C. C. elegans; MGD MGD returns returns the the mouse, mouse, rat, rat, and and dog dog genes; genes; the the Genome Genome DataBase DataBase (GDB) (GDB) [39, [39, 40] 40] returns returns the the mouse mouse and and drosophila drosophila genes; genes; and and LocusLink LocusLink returns returns only only the the mouse mouse gene. gene. Approaches Approaches to to semantic semantic integration integration in in the the database database community community generally generally cen center ter on on schema schema integration: integration: understanding, understanding, classifying, classifying, and and representing representing schema schema

60 60

3 3

A u ide to A Practitioner's Practitioner's G Guide to Data Data Management

differences bases. For differences between between two two disparate disparate data databases. For example, example, in in capturing capturing the the se semantics mantics of of the the relationships relationships between between objects objects in in multiple multiple databases, databases, Kashyap Kashyap and and Sheth Sheth describe describe work work on on understanding understanding the the context context of of the the comparison, comparison, the the abstrac abstraction tion relating relating the the domains domains of of the the two two objects, objects, and and the the uncertainty uncertainty in in the the relationship relationship [35]. [351. Bioinformatics Bioinformatics efforts efforts at at semantic semantic integration integration have have largely largely followed followed the the ap approach of of the the artificial artificial intelligence intelligence community. community. Examples Examples of of such such semantic semantic in inproach tegration tegration efforts efforts are are the the Encyclopedia Encyclopedia of of Escherichia Escherichia coli coli genes genes and and metabolism metabolism (EcoCyc) [41 ] , GO, GO, and (EcoCyc) [41], and TAMBIS, TAMBIS, the the featured featured example example and and the the subject subject of of Chapter Chapter 7 7 of of this this book. book.
Semantic AMBIS Semantic Integration Integration Featured Featured Example: Example: T TAMBIS

The system is The TAMBIS TAMBIS system is the the result result of of a a research research collaboration collaboration between between the the depart departments biological sciences ments of of computer computer science science and and biological sciences at at the the University University of of Manchester Manchester in in England. England. Its Its chief chief components components are are an an ontology ontology of of biological biological and and bioinformatics bioinformatics terms managed by a terminology server and a wrapper service that, terms managed by a terminology server and a wrapper service that, as as in in Dis DiscoveryLink, coveryLink, handles handles access access to to external external data data sources. sources. An An ontology is is a a rigorous rigorous formal domain. The formal specification specification of of the the conceptualization conceptualization of of a a domain. The TAMBIS TAMBIS ontology ontology (TaO) [42] describes (TaO) [42] describes the the biologist's biologist's knowledge knowledge in in a a manner manner independent independent of of indi individual vidual data data sources, sources, links links concepts concepts to to their their real real equivalents equivalents in in the the data data sources, sources, mediates mediates between between (near) (near) equivalent equivalent concepts concepts in in the the sources, sources, and and guides guides the the user user to to form biological queries. 800 as form appropriate appropriate biological queries. The The TaO TaO contains contains approximately approximately 1 1800 asserted serted biological biological concepts concepts and and their their relationships relationships and and is is capable capable of of inferring inferring many many more. more. Coverage Coverage currently currently includes includes proteins proteins and and nucleic nucleic acids, acids, protein protein structure structure and structural classification, functions, and and structural classification, biological biological processes processes and and functions, and taxonomic taxonomic classification. classification. The The categorization categorization of of TAMBIS TAMBIS based based on on the the six six dimensions dimensions is is given given in in Table .4. Table 3 3.4.

3 .5 3.5

STR E N GTHS A N D WEAKN ES S E S O F TH E STRENGTHS AND WEAKNESSES OF THE VAR I O U S APPROAC H E S TO IINTEGRATION NTEG RATI O N VARIOUS APPROACHES
This This chapter chapter has has described described multiple multiple approaches approaches to to database database integration integration in in the the bioinformatics domain and provided examples examples of each. Each Each of bioinformatics domain and provided of each. of these these approaches approaches has has strengths strengths and and weaknesses weaknesses and and is is best best suited suited to to a a particular particular set set of of integration integration needs. needs.

3.5

rious Approaches to Integration Strengths and and Weaknesses Weaknesses of of the the Va Various

61

61

Browsing

Que ryi ng Querying


Limited Limited querying querying capability via via parameterized parameterized query query builder builder

Interactive browser browser

Semantic

Syntactic S yntactic
Integrates Integrates via via its its wrapper wrapper service service

According to to TAMBIS' TAMBIS' authors, its


"big win" lies in the ontology

Warehouse
Not Not used used

Federation Federation
Uses Uses BioKleisli BioKleisli for for federated federated integration integration

Declarative A ccess Access


Uses Uses the the CPL CPL query query language, language, but but users users see see only builder only the the parameterized parameterized query query builder

Procedural Access
No No procedural access access

Generic
Information Information not not available available

Hard-Coded
Information Information not not available available

Relational Data Model


Relational Relational data data model model not not used used

Non-Relational Data Model


Object/complex-relational Object/complex-relational data data model model

3.4 TABLE

TAMBIS categorization categorization with with respect respect to to the the six six dimensions dimensions of of integration. integration. TAMBIS

3.5. 1 3.5.1

B rows i n g a n d Query i n g : Stre ngths Browsing and Querying" Strengths a n d Wea kn esses and Weaknesses
The The strengths strengths of of a a browsing browsing approach approach are are many. many. As As noted noted previously, previously, its its interactive interactive nature makes it well suited suited to nature makes it especially especially well to exploring exploring the the data data landscape landscape when when an an investigator has has not formulated a also well well suited suited to investigator not yet yet formulated a specific specific question. question. It It is is also to retrieval about single optionally drilling retrieval of of information information about single objects objects and and for for optionally drilling down down to to greater The ubiquity greater levels levels of of detail detail or or for for following following hyperlinks hyperlinks to to related related objects. objects. The ubiquity of of the the Internet Internet makes makes Web Web browsers browsers familiar familiar to to even even the the most most inexperienced inexperienced user. user. The The weaknesses weaknesses of of a a browsing browsing approach approach are are the the flip flip side side of of its its strengths. strengths. Because Because it it is is fundamentally fundamentally based based on on visiting visiting single single pages pages containing containing data data on on a a single object, well suited suited to performing a single object, it it is is not not well to handling handling large large data data sets sets or or to to performing a large, multi-step interim results. large, multi-step workflow workflow including including significant significant processing processing of of interim results. Its Its flexibility user is confined to query forms and navigation flexibility is is also also limited, limited, as as the the user is confined to the the query forms and navigation paths paths the the application application provides. provides. The The strengths strengths of of a a querying querying approach approach are are the the natural natural opposite opposite of of those those of of the approach. Because result sets the browsing browsing approach. Because it it is is based based on on specifying specifying attributes attributes of of result sets

62

3 3

A u ide to A Practitioner's Practitioner's G Guide to Data Data Management

via a a query query language, language, often often with with quite quite complex complex search search conditions, conditions, the the querying querying via approach is is well well suited suited to to multi-step multi-step workflows workflows resulting resulting in in large large result result sets. sets. This This approach approach is is also also flexible, flexible, allowing allowing the the user user to to specify specify precisely precisely inclusion and ex exapproach inclusion and clusion criteria clusion criteria and and noting noting which which attributes attributes to to include include in in the the final final result result set. set. Con Contrariwise, trariwise, the the querying querying approach approach is is not not as as well well suited suited to to the the exploration exploration or or manual manual inspection inspection of of interim interim results, results, and and the the need need to to specify specify desired desired results results using using query query language syntax syntax requires requires more more computational computational sophistication sophistication than than many many potential potential language users users possess. possess.

3.5.2 3.5.2

Wa Warehousing re h o u s i n g a and n d Fede Federation" rati o n : Stre Strengths n gths and Weakn esses and Weaknesses
A A major major strength strength of of a a data data warehousing warehousing approach approach is is that that it it permits permits cleansing cleansing and and filtering filtering of of data data because because an an independent independent copy copy of of the the data data is is being being maintained. maintained. If If the the original original data data source source is is not not structured structured optimally optimally to to support support the the most most com commonly monly desired desired queries, queries, a a warehousing warehousing approach approach may may transform transform the the data data to to a a more more amenable amenable structure. structure. Copying Copying remote remote data data to to a a local local warehouse warehouse can can yield yield excellent excellent query equal. Warehousing query performance performance on on the the warehouse, warehouse, all all other other things things being being equal. Warehousing exerts exerts a a load load on on the the remote remote sources sources only only at at data data refresh refresh times, times, and and changes changes in in the the remote remote sources sources do do not not directly directly affect affect the the warehouse's warehouse's availability. availability. The data warehousing The primary primary weakness weakness of of the the data warehousing approach approach is is the the heavy heavy main maintenance incurred by by maintaining tenance burden burden incurred maintaining a a cleansed, cleansed, filtered, filtered, transformed transformed copy copy of of remote remote data data sources. sources. The The warehouse warehouse must must be be refreshed refreshed frequently frequently to to ensure ensure users' users' access access to to up-to-date up-to-date data; data; the the warehousing warehousing approach approach is is probably probably not not the the method method of basis. Adding of choice choice for for integrating integrating large large data data sources sources that that change change on on a a daily daily basis. Adding a a data source source to and mainte data to a a warehouse warehouse requires requires significant significant development, development, loading, loading, and maintenance overhead; therefore unlikely to nance overhead; therefore this this approach approach is is unlikely to scale scale well well beyond beyond a a handful handful of sources. Warehousing specialized search of data data sources. Warehousing data data may may lose lose the the specialized search capability capability of of the the native native data data sources; sources; an an example example would would be be specialized specialized text text searching searching over over doc documents or chemical compound compound data uments or sub-structure sub-structure searching searching over over chemical data collections. collections. A major strength of of the the federated approach is is that that the the user user always always enjoys enjoys A major strength federated approach access possible. While access to to the the most most up-to-date up-to-date data data possible. While connectivity connectivity to to remote remote sources sources requires maintenance, the adding and requires some some maintenance, the burden burden of of adding and maintaining maintaining a a new new data data source than in warehousing case. source is is considerably considerably less less than in the the warehousing case. The The federated federated approach approach scales large numbers numbers of scales well, well, even even to to very very large of data data sources, sources, and and it it readily readily permits permits new be added to the new sources sources to to be added to the system system on on a a prototype prototype or or trial trial basis basis to to evaluate evaluate their to users. users. In In a their potential potential utility utility to a fast-paced, fast-paced, ever-changing ever-changing field field like like bioinfor bioinformatics, this this nimbleness The federated well with with a matics, nimbleness is is invaluable. invaluable. The federated approach approach meshes meshes well a landscape many individual, individual, autonomous which the landscape of of many autonomous data data sources, sources, which the bioinformat bioinformatics community currently provide access access to ics community currently boasts. boasts. Finally, Finally, a a federated federated system system can can provide to

3.5 3.5
.

n d Weaknesses Strengths a and Weaknesses of of the the Various Various Approaches to Integration
. . . ~ ~ ~ . , ~ . ~ . ~ , ~ ~ : ~ ~ . ~ . ~ , ~ r

. :

63

63

data data that that cannot cannot be be easily easily copied copied into into a a warehouse, warehouse, such such as as data data only only available available via via a a Web Web site. site. Any Any data data cleansing cleansing must must be be done done on on the the fly, fly, for for a a federation federation accesses accesses remote remote data sources sources in in their their native native form. form. The The members members of of the the federation federation must must be be able able data to handle handle the the increased increased load load put put on on them them by by federated federated queries, queries, and and if if network network to bandwidth is bandwidth is insufficient, insufficient, performance performance will will suffer. suffer.

3.5.3 3.5.3

P roced u ra l Code n d Decl a rative Que ry Procedural Code a and Declarative Query La n g u a g e : Stre n gths a n d Weaknesses Language" Strengths and Weaknesses
Procedural Procedural code code may may be be tuned tuned very very precisely precisely for for a a specific specific task. task. There There are are virtually virtually no no limitations limitations on on its its expressive expressive power; power; however, however, this this very very strength strength can can make make it it Ad hoc hoc inquiries support, and difficult to difficult to optimize. optimize. Ad inquiries can can be be difficult difficult to to support, and extending extending the additional sources additional queries the system system to to handle handle additional sources or or additional queries can can be be difficult. difficult. Declarative Declarative languages languages are are flexible flexible and and permit permit virtually virtually unlimited unlimited ad ad hoc hoc query querying. ing. Queries Queries expressed expressed in in a a declarative declarative language language are are relatively relatively easy easy to to program program and and maintain maintain due due to to their their small small size size and and economy economy of of expression. expression. Sometimes, Sometimes, however, however, their simplicity simplicity is is misleading; misleading; for for example, it is is easy to write write a a syntactically syntactically correct correct their example, it easy to SQL query, but the results returned may not be what was intended because SQL query, but the results returned may not be what was intended because the the query using the query was was written written using the wrong wrong constructs constructs for for the the desired desired meaning. meaning. Finally, Finally, some much more more easily written in some programming programming tasks tasks are are much easily written in a a procedural procedural language language than than a a declarative declarative one; one; the the classic classic example example is is recursive recursive processing processing over over tree-like tree-like structures. structures.

3.5.4 3.5.4

G e n eric a nd H a rd-Coded Approach es: Generic and Hard-Coded Approaches" Stre n gths a n d Weaknesses Strengths and Weaknesses
Generic Generic coding coding is is generally generally acknowledged acknowledged to to be be desirable, desirable, where where practicable, practicable, due due to extensibility and re-use. It to its its extensibility and maintainability maintainability and and because because it it facilitates facilitates code code re-use. It does, however, hard-coded for does, however, yield yield a a greater greater up-front up-front cost cost than than programming programming hard-coded for a a specific task, task, and specific and sometimes sometimes schedules schedules do do not not permit permit this this up-front up-front expenditure. expenditure. If If the the instances instances being being generalized generalized are are not not sufficiently sufficiently similar, similar, the the complexity complexity of of generic generic code code can can be be prohibitive. prohibitive. Hard-coding Hard-coding permits permits an an application application to to be be finely finely tuned tuned to to optimize optimize for for a a specific specific critical potentially yielding yielding very critical case, case, potentially very fast fast response response times; times; this this approach approach may may be be the sets the preferred preferred strategy strategy when when only only a a limited limited set set of of queries queries involving involving large large data datasets is absence of an already already existing is required. required. In In the the absence of an existing generic generic system, system, it it is is generally generally quicker quicker to to prototype prototype rapidly rapidly by by hard hard coding. coding. On On the the other other hand, hand, code code with with many many system-specific system-specific assumptions assumptions or or references references can can be be difficult difficult to to maintain maintain and and extend. extend. Adding Adding a a new new data data source source or or even even a a new new query query often often means means starting starting from from scratch. scratch.

64 64

A A Practitioner's Practitioner's Guide G u ide to to Data Management

3.5.5 3.5.5

R e l atio n a l and a n d Non-Relational N o n-Relati o n a l Data Data Models: M od e l s: Relational Stre n gths and a n d Weaknesses Weaknesses Strengths
The relational relational data data model model is is based based on a well-understood, well-understood, theoretically theoretically rock-solid rock-solid The on a foundation. Relational Relational technology technology has has been been maturing maturing for for the the past past 30 years and and foundation. 30 years can provide provide truly truly industrial-strength industrial-strength robustness robustness and and constant availability. RelaRela can constant availability. tional databases databases prevent anomalies while while multiple multiple users users are are reading and writing writing tional prevent anomalies reading and concurrently, thus thus safeguarding safeguarding data data integrity. integrity. Optimization Optimization of of queries queries over over rere concurrently, lational databases data bases has has been been developed developed and and honed honed for for decades. decades. The The SQL SQL query query lational language is is powerful and widely widely used, used, so so SQL SQL programmers are relatively relatively easy easy language powerful and programmers are find. However, However, the the relational relational model model is is based based on on tables tables of of rows rows and and columns, columns, to to find. and individual tables tables are are typically typically required required to single complex and several several individual to represent represent a a single complex biological object. object. biological Hierarchical non-relational data models models seem seem to be a a more natural fit for Hierarchical non-relational data to be more natural fit for complex scientific objects. objects. However, However, this this technology technology is is still still quite quite immature, immature, and and complex scientific standard database desiderata such such as as cost-based cost-based query query optimization, optimization, data data integrity, integrity, standard database desiderata and concurrency have been hard because of of the the increased increased and multi-user multi-user concurrency have been hard to to attain attain because complexity the non-relational non-relational systems. systems. complexity of of the

3.5.6 3.5.6

Concl usion: A A Hybrid I nteg ratio n Conclusion" Hybrid Approach Approach to to Integration I s IIdeal deal Is
Considering Considering the the variety variety of of integration integration needs needs in in a a typical typical organization, organization, a a hybrid hybrid approach approach to to database database integration integration is is generally generally the the best best strategy. strategy. For For data data that that it it is is critical critical to to clean, clean, transform, transform, or or hand hand curate, curate, and and for for which which only only the the best best query query performance probably the performance is is adequate, adequate, data data warehousing warehousing is is probably the best best approach. approach. If If the the warehouse warehouse is is derived derived from from data data outside outside the the organization, organization, it it is is best best if if the the original original data data source source changes changes infrequently, infrequently, so so the the maintenance maintenance burden burden in in merging merging updates updates is is not not too too onerous. onerous. Otherwise, Otherwise, the the federated federated model model is is an an excellent excellent choice choice because because of of its its relatively relatively low low maintenance maintenance cost cost and and its its extensibility extensibility and and scalability. scalability. Federations Federations allow allow easy easy prototyping prototyping and and swapping swapping of of new new data data sources sources for for old old in in evaluation evaluation mode, mode, and and they they permit permit integration integration of of external external data data that that is is not not accessible accessible for for dupli duplicating cating internally, internally, such such as as data data only only available available via via Web Web sites. sites. They They also also permit permit the the integration purpose search integration of of special special purpose search algorithms algorithms such such as as sequence sequence comparison, comparison, sec secondary ondary structure structure prediction, prediction, text text mining, mining, clustering, clustering, chemical chemical structure structure searching, searching, and and so so forth. forth. Wherever Wherever possible, possible, strategies strategies should should be be generic, generic, except except for for one-time, one-time, one-use one-use programs programs or or where where hard-coding hard-coding is is needed needed to to fine fine tune tune a a limited limited set set of of operations operations over over a a limited limited set set of of data. data. Both Both browsing browsing and and querying querying interfaces interfaces are are important important for for different different levels levels of of users needs. For users and and different different needs. For access access to to data data in in batch batch mode, mode, the the most most common common

3.6 3.6

n Bioinformatics Tough Problems Problems iin Bioinformatics Integration

65 65

queries queries can can be be pre-written pre-written and and parameterized parameterized and and offered offered to to users users via via a a Web Web form formbased interface. Both Both semantic and syntactic integration are based interface. semantic and syntactic data data integration are needed, needed, although although semantic semantic integration integration is is just just beginning beginning to to be be explored explored and and understood. understood. Due Due to to the the maturity maturity of of the the technology technology and and its its industrial industrial strength, strength, the the relational relational data data model model is is currently currently the the method method of of choice choice for for large large integration integration efforts, efforts, both both warehousing middle software warehousing and and federation. federation. A A middle software layer layer may may be be provided provided to to expose expose biological objects users, as based on biological objects to to users, as mentioned mentioned previously. previously. But But based on the the current current state the underlying underlying data state of of the the industry, industry, the data curation, curation, storage, storage, querying querying planning, planning, and optimization optimization are are arguably arguably best best done done in in relational relational data databases. and bases.

3.6 3.6

TO U G H PRO B LE M S IN IN B I O I N FO R MATICS TOUGH PROBLEMS BIOINFORMATICS IINTEGRATION NTE G RATI O N


In spite of the the variety of techniques techniques and and approaches approaches to to data data integration in bioinfor bioinforIn spite of variety of integration in matics, many many tough tough integration integration problems remain. These These include include query query processing processing matics, problems remain. in federated system in a a federated system when when some some members members of of the the federation federation are are inaccessible; inaccessible; uni universally accepted versally accepted standards standards of of representation representation for for central central biological biological concepts concepts such such as protein, transcript, polymorphism, and as gene, gene, protein, transcript, sequence, sequence, polymorphism, and pathway; pathway; and and represent representing and querying ing and querying protein protein and and DNA DNA interaction interaction networks. networks. This This section section discusses discusses two additional additional examples examples of of tough tough problems problems in in bioinformatics bioinformatics integration: integration: seman semantwo tic planning and management. tic query query planning and schema schema management.

3.6. 1 3.6.1

S e m a ntic Que ry P l a n n i ng Over Semantic Query Planning Over Web Web Data o u rces Data S Sources
While While the the TAMBIS TAMBIS and and GO GO projects projects have have made made an an excellent excellent start start in in tackling tackling the problem, more GO the semantic semantic integration integration problem, more remains remains to to be be done. done. TAMBIS TAMBIS and and GO have vocabularies for have focused focused on on building building ontologies ontologies and and controlled controlled vocabularies for biological biological concepts. concepts. Another Another fruitful fruitful area area of of investigation investigation in in semantic semantic integration integration is is using using knowledge knowledge of of the the semantics semantics of of data data sources sources to to generate generate a a variety variety of of alternative alternative methods methods of of answering answering a a question question of of scientific scientific interest, interest, thus thus freeing freeing the the user user from from the need to understand every detail [43]. the need to understand every data data source source in in detail [43]. Recall Recall that that in in accessing accessing multiple multiple data data sources sources there there are are usually usually multiple multiple ways ways of of executing executing a a single single query, query, or or multiple multiple query query execution execution plans. Each Each may may have have a different execution cost, as discussed in Section 3 .2.1. Similarly, there may a different execution cost, as discussed in Section 3.2.1. Similarly, there may be be multiple data multiple data sources sources that that can can be be used used to to arrive arrive at at an an answer answer to to the the same same general general question, question, though though the the semantics semantics of of the the result result may may differ differ slightly. slightly. A A semantic semantic query query planner planner considers considers not not only only the the cost cost of of different different execution execution plans plans but but also also their their

66

3 3

A u ide to A Practitioner's Practitioner's G Guide to Data Data Management

semantics semantics and and generates generates alternate alternate paths paths through through the the network network of of interconnected interconnected data sources. sources. The The goal goal is is to to help help the the user user obtain the best best possible possible answers answers to to data obtain the questions questions of of scientific scientific interest. interest. Web sources are bioinformatics, and Web sources are ubiquitous ubiquitous in in bioinformatics, and they they are are connected connected to to each each other complex tangle other in in a a complex tangle of of relationships. relationships. Links Links between between sources sources can can be be either either explicit calls in explicit hypertext hypertext links links or or constructed constructed calls in which which an an identifier identifier for for a a remote remote data data source may may be be extracted extracted from from a a Web document and and used used to to construct construct a a Uniform Uniform source Web document Resource Identifier Identifier (URI) (URI) to Resource to access access the the remote remote source. source. Not all inter-data source links are semantically Not all inter-data source links are semantically equivalent. equivalent. For For example, example, there there are two two ways ways of of navigating to GenBank: GenBank: through occurare navigating from from PubMed PubMed to through explicit explicit occur rences of GenBank accession accession numbers numbers within secondary source identifier (SI) rences of GenBank within the the secondary source identifier (SI) attribute MED LINE formatted attribute of of the the MEDLINE formatted entry entry and and through through the the Entrez Entrez Nucleotide Nucleotide Link display option. Following Following these these two two navigation navigation paths paths does does not not always always pro proLink display option. duce example, for duce the the same same set set of of GenBank GenBank entries: entries: For For example, for the the PubMed PubMed entry entry with with ID ID 8552 1 9 1 , there four embedded 8552191, there are are four embedded GenBank GenBank accession accession numbers, numbers, while while the the Nu Nucleotide 10 sequence cleotide Links Links option option yields yields 10 sequence entries entries (the (the four four embedded embedded entries entries plus plus related related RefSeq RefSeq entries) entries) [43]. [43]. To plans, a To generate generate alternate alternate plans, a semantic semantic query query planner planner requires requires knowledge knowledge of of certain certain characteristics characteristics of of Web Web sources, sources, including including their their query query and and search search ca capabilities, links between sources, and overlaps between pabilities, the the links between sources, and overlaps between the the contents contents of of data data sources. example of subset of Med is sources. An An example of modeling modeling a a subset of the the search search capabilities capabilities of of Pub PubMed is as as follows: follows:
1. Search PubMedID or ineID), returning 1. Search by by key key ( (PubMedID o r Medl MedlineID), returning a a single single entry. entry. Single Single or or multiple multiple bindings bindings for for PubMedID PubMedID or or MedlineID M e d l i n e I D are are accepted. accepted.

2. Search phrase, returning Search by by phrase, returning multiple multiple entries. entries. For For example, example, the the search search term term

gene in on performs performs an untyped text cii ournal g e n e express expressio an untyped text search; search; S sc eence nce [ [ jo urnal l ] returns all all articles articles from from Science; 1/ 2 [ returns Science; and and 2 20 00 01 /0 06 6 :: 2 20 00 02 [pda p d a tt 1 ] returns returns articles articles published published since since June June 2002. 2002. Given To To further further illustrate illustrate semantic semantic query query planning, planning, consider consider the the following following query: query: " "Given a a Human Human Genome Genome Organization Organization (HUGO) (HUGO) name, name, retrieve retrieve all all associated associated PubMed PubMed citations." There are citations." There are at at least least three three plans plans for for this this query: query:
1. Search PubMed directly 1. Search PubMed directly for for the the HUGO HUGO name name using using the the second second search search capa capability bility above. above.

2. Find Find the the GeneCards GeneCards entry entry for for the the HUGO HUGO name name and and follow follow its its link link to to PubMed PubMed publications. publications.

I nt.o n l." t inn Problems in Bioinformatics Bioi nformatics Integration 3.6 Tough Problems

67 67

3. From From the the GeneCards GeneCards entry entry for for the the HUGO HUGO name, name, follow follow the the links to the the Entrez Entrez 3. links to RefSeq entry entry and and extract extract the the relevant relevant PubMed PubMed identifiers. identifiers. RefSeq

These three three plans plans all all return return different different answers. answers. For For example, example, given given the the HUGO HUGO These name BIRC1 BIRCl (neuronal (neuronal apoptosis apoptosis inhibitory inhibitory protein), protein), plan plan 1 1 returns returns no no answer, answer, name plan 2 2 returns returns two two answers answers (PubMed (PubMed identifiers identifiers 7813013 78 13013 and and 9503025), 9503025), and and plan plan plan 3 returns returns five five answers answers (including (including the two entries entries returned returned by by plan plan 2) 2) [43]. [43]. 3 the two In summary, a a query query planner planner who who took took advantage advantage of semantic knowledge knowledge In summary, of semantic of Web Web data data sources sources and and their their search search capabilities capabilities would would first first identify identify that that there there of are multiple multiple alternate alternate sources sources and and capabilities capabilities to to answer answer a a query. query. Then, Then, semantic semantic are knowledge would would be be used used to to determine determine if if the the results results of of each each alternate alternate plan plan would would knowledge be identical. identical. Finally, Finally, such such a a planner planner might might suggest suggest these these alternate alternate plans plans to to a a user, user, be whose expert judgment would would determine determine which the most most suitable suitable to to the the whose expert judgment which plan plan was was the scientific task. task. The The user user would would be be freed to focus focus on on science instead of on navigating navigating scientific freed to science instead of on the often treacherous waters waters of source space. space. Semantic query planning planning will will the often treacherous of data data source Semantic query be addressed in more detail in Section 4.4.2 in Chapter 4. be addressed in more detail in Section 4.4.2 in Chapter 4.

3.6.2 3.6.2

Schema M a n a g e m e nt Schema Management


A schema schema management management system system supports supports data bases and information systems systems as as A databases and information they deal with with a a multitude multitude of of schemas schemas in they deal in different different versions, versions, structure, structure, semantics, semantics, and and format. format. Schema Schema management management is is required required whenever whenever data data is is transformed transformed from from one structure to such as one structure to another, another, such as publishing publishing relational relational data data as as XML XML on on a a Web Web site, site, restructuring restructuring relational relational data data in in hierarchical hierarchical form form for for a a biological biological object object concep conceptual view tual view layer, layer, and and integrating integrating overlapping overlapping data data sets sets with with different different structures, structures, as as needed needed in in a a merger merger of of two two large large pharmaceutical pharmaceutical companies. companies. The The system system developed developed by Clio research project [44] by the the Clio research project [44] is is an an example example of of a a basic basic schema schema management management system plans for in the system with with plans for development development in the direction direction to to be be described. described. The building blocks The six six building blocks of of a a schema schema management management system system are are listed listed below. below. They and illustrated use cases. They will will be be defined defined and illustrated through through three three use cases.
+ 9 + 9

Schema Schema association/schema association/schema extraction extraction Schema Schema versioning/schema versioning/schema evolution evolution Schema Schema mapping/query mapping/query decomposition decomposition View View building/view building/view composition composition Data Data transformation transformation Schema Schema integration integration

+ 9

+ 9 + 9 + 9

68

68

3 3

A u ide to A Practitioner's Practitioner's G Guide to Data Data Management

Use arehousing Use Case: Case: Data Data W Warehousing

As described described earlier, earlier, data data warehousing warehousing is is often often used as an an integration approach As used as integration approach when hand-curated. A when the the data data must must be be extensively extensively cleaned, cleaned, transformed, transformed, or or hand-curated. A ware warehouse house may may be be built built from from a a variety variety of of data data sources sources in in different different native native formats. formats. Schema association determines determines if if these these heterogeneous heterogeneous documents documents match match a a schema schema already stored stored in in the the schema schema manager. manager. If If no no existing existing schema schema is is found, found, schema already extraction extraction determines determines a a new new schema schema based based on on the the data, data, for for example, example, an an XML XML document, document, and and adds adds it it to to the the schema schema manager. manager. Schema integration helps helps develop develop a a warehouse schema accommodating all relevant relevant data warehouse global global schema accommodating all data sources. sources. As As the the ware warehouse evolves, determines how schemas of house evolves, schema mapping determines how to to map map between between the the schemas of newly newly discovered discovered data data sources sources and and the the warehouse's warehouse's global global schema. schema. Finally, Finally, data and executes transformation transformation discovers discovers and executes the the complex complex operations operations needed needed to to clean clean and and transform transform the the source source data data into into the the global global warehouse warehouse schema. schema. The The transfor transformations XML query mations generated generated would would be be specified specified in in the the XML query language language (XQuery) (XQuery) or or Extensible and SQL Extensible Stylesheet Stylesheet Language Language Transformations Transformations (XSLT) (XSLT) for for XML XML data, data, and SQL for relational relational data. data. for
Use and Combine and New Use Case: Case: Query Query and Combine Old Old and New Data Data

Because young, research-oriented field, database Because bioinformatics bioinformatics is is a a young, research-oriented field, database schemas schemas to to hold lab notebook data change change frequently hold lab notebook data frequently as as new new experimental experimental techniques techniques are are de developed. veloped. Industry Industry standards standards are are still still emerging, emerging, and and they they evolve evolve and and change change rapidly. rapidly. Suppose company were Suppose two two Web Web sites sites at at a a large large pharmaceutical pharmaceutical company were using using two two dif different versions of database schema, ferent versions of the the same same database schema, but but scientists scientists wanted wanted to to query query the the old-version new-version data uniform fashion, fashion, without old-version and and new-version data sources sources in in uniform without worrying worrying versions and about the about the schema schema versions. versions. Schema Schema evolution keeps keeps track track of of schema schema versions and their allows the old and their differences. differences. Query decomposition allows the user user to to query query old and new new doc documents single query, all conformed latest schema schema version, uments in in a a single query, as as if if they they all conformed to to the the latest version, using using knowledge knowledge of of the the differences differences between between versions. versions.
Use Use Case: Case: Data Data Federation Federation

Assume a GO, Gene Assume a federated federated database database system system integrates integrates relational relational (e.g., (e.g., MGD, MGD, GO, GeneLynx Lynx [45]) [45]) and and XML XML data data sources sources (e.g., (e.g., PubMed PubMed and and the the output output of of bioinformatics bioinformatics algorithms such BLAST) and provides integrated SQL access algorithms such as as BLAST) and provides integrated SQL access to to them. them. View building allows users build customized relational and building allows users to to build customized views views on on top top of of relational and XML XML schemas schemas using using a a graphical graphical interface. interface. Schema Schema mapping provides provides knowledge knowledge about about correspondences correspondences among among the the different different sources. sources. To To respond respond efficiently efficiently to to queries queries against against these these sources, sources, view view composition composition and and query query decomposition decomposition must must use use the the correspondences gained through mapping and issue the correspondences gained through schema schema mapping and view view building building to to issue the right queries to about the right queries to the the right right sub-systems. sub-systems. Finally, Finally, knowledge knowledge about the data data sources' sources'

3.7

69 capabilities capabilities and and global global query query optimization optimization allow allow the the processor processor to to push push expensive expensive operations operations to to local local sources sources as as appropriate. appropriate.

3.7 3.7

S U M MARY SUMMARY
Effective data data management management and are critical to the the success of bioin bioinEffective and integration integration are critical to success of formatics, and and this this chapter chapter has has introduced introduced key key concepts concepts in in these these technical technical areas. areas. formatics, While While the the wide wide and and varied varied landscape landscape of of integration integration approaches approaches can can seem seem over overwhelming beginner, this this chapter chapter has has offered which to whelming to to the the beginner, offered six six dimensions dimensions by by which to characterize new integration characterize current current and and new integration efforts: efforts: browsing/querying, browsing/querying, declarative/ declarative/ procedural code, generic/hard-coded code, semantic/syntactic procedural code, generic/hard-coded code, semantic/syntactic integration, integration, data data warehousing/federation, Basic defini warehousing/federation, and and relational/non-relational relational/non-relational data data model. model. Basic definitions tions and and the the relative relative strengths strengths and and weaknesses weaknesses of of a a variety variety of of approaches approaches were were explored through a series of which are explored through a series of use use cases, cases, which are summarized summarized in in Table Table 3.5. 3.5. The The optimal strategy organization or research project project will optimal strategy for for a a given given organization or research will vary vary with with its its individual individual needs needs and and constraints, constraints, but but it it will will likely likely be be a a hybrid hybrid strategy, strategy, based based on on a consideration of a careful careful consideration of the the relative relative strengths strengths and and weaknesses weaknesses of of the the various various approaches. approaches. While While many many areas areas of of data data integration integration are are solved solved or or nearly nearly so, so, tough, tough, largely largely unsolved unsolved problems problems still still remain. remain. The The chapter chapter concluded concluded by by highlighting highlighting two two of of them: them: semantic semantic query query planning planning and and schema schema management. management.
Use Case Preferred Preferred Approach Approach Relational Relational technology technology Relational Relational technology technology Data warehousing, views Data warehousing, views Federation, Federation, querying querying (DiscoveryLink, (DiscoveryLink, 3.4.3) 3.4.3) Browsing Browsing (Entrez) (Entrez)

3.2. 1 . 1 Simple 3.2.1.1 Simplecurated curated gene gene data data source source 3.2.1 .2 Retrieving 3.2.1.2 Retrievinggenes genes and associated associated
expression expression results results

3.2.3.2 3.2.3.2 Transforming Transformingdatabase database structure structure 3.3.1 heterogeneous integration 3.3.1 Multi-source Multi-source heterogeneous integration 3.4. 1 .1 Exploring 3.4.1.1 Exploringsequences sequences associated associated with with
recent metalloproteases recent articles articles about about metalloproteases

3.4.2.1 Database of known 3.4.2.1 Database known and predicted predicted human human
and mouse mouse genes genes and transcripts

Warehouse Warehouse (AllGenes) (AllGenes) Semantic Semantic integration integration (TAMBIS) (TAMBIS)

3.4.4 biological concepts 3.4.4 Querying Queryingthrough through unified unified biological concepts

3.5 3.5 TABLE

Summary Summary of of use use cases cases and and approaches. approaches.

70

70

3 3
~ , ` ~ : ~ : ~ * ` ~ : ~ ` ~ * ~ ~ : ~ : ~ : ~ : ~ : ~ ...... ~ = ~ ~ * ~ , . ~ = ~ ~ ~

A u ide to A Practitioner's Practitioner's G Guide to Data Data Management


` ~ ~ ( ~ * ~ ~ ~ ` . ~ ..../.. . . . . . . . . . ~ . . . .

---

ACKN OWLE DG M E NTS ACKNOWLEDGMENTS

Warm thanks are offered offered to to my my colleagues colleagues at at IBM IBM for for many stimulating discussions discussions Warm thanks are many stimulating and Haas, Peter ulia Rice, and collegial collegial support: support: Laura Laura Haas, Peter Schwarz, Schwarz, J Julia Rice, and and Felix Felix Naumann. Naumann. Thanks Thanks also also to to Carole Carole Goble Goble and and Robert Robert Stevens Stevens of of the the University University of of Manchester Manchester for for kindly kindly providing providing materials materials on on the the TAMBIS TAMBIS project; project; Howard H o w a r d Ho Ho and and the the IBM IBM Clio team Clio team for for their their generous generous contributions contributions to to the the schema schema management management section; section; and and Bill Swope warehousing. Finally, Bill Swope for for help help on on data data warehousing. Finally, I I thank thank the the editors editors and and reviewers reviewers for patience and for their their patience and helpful helpful suggestions. suggestions.

------

R E F E R E NCES REFERENCES
[1] [2] [3] [4]
n Biological L.. K K.. Buehler. Buehler. Bioinformatics Basics: Basics: Applications iin Biological H. H. Rashidi, L FL: CRC Press, 2000. Science and Medicine. Boca Raton, Raton, FL:
J. D. Ullman and ]. J. Widom. A First Course in Database Systems. Upper Saddle River, NJ: Prentice Hall, 1 997. 1997. D. Benson, " Nucleic Acids Benson, I. Karsch-Mizrachi, D. Lipman, Lipman, et al. "GenBank. "GenBank." Research 3 1 , no. 1 :23-27, http://www.ncbi.nlm.nih.gov/Genbank. 31, 1 (2003) (2003):23-27, http://www.ncbi.nlm.nih.gov/Genbank. B. Boeckmann, A. Bairoch, R. Apweiler, et al. "The SWISS-PROT Protein Knowledgebase and Its Supplement TrEMBL in 2003." 1, 2003." Nucleic Acids Research 3 31, no. 1 (2003): 365-370. 365-370. A. Bateman, E. Birney, " Birney, L. Cerruti, et al. "The Pfam Protein Families Database. Database." Nucleic Acids Research 30, no. 1 1 (2002): 276-280. 276-280. M. Ashburner, C. A. Ball, J. A. Blake, et al. "Gene "Gene Ontology: Tool for for the Unification of Biology. 1 Biology. The Gene Ontology Consortium." Consortium." Nature Genetics 25, no. 1

5] [5] [

[6]

(2000): 25-29. 25-29. [7]

D. L. Wheeler, D. M. Church, A. E. Lash, et al. "Database Resources of the National National Center for Biotechnology Information: 2002 Update." Nucleic Acids 3-16. Research 30, no. 1 1 (2002): 1 13-16.
M. Kanehisa, S. Goto, S. Kawashima, et al. "The KEGG Databases at GenomeNet." Research 30, no. 1 1 (2002): 42-46. 42-46. GenomeNet." Nucleic Acids Research Microsoft 999. Microsoft Corporation. Corporation. Microsoft Excel 2000, 1 1999.

[8] [9]

[10] National [10] National Center for Biotechnology Information. The Entrez Search and Retrieval System. http://www.ncbi.nlm.nih.govlEntrez, http://www.ncbi.nlm.nih.gov/Entrez, 2002. [ 1 1] J. M. Ostell, S. J. Wheelan, and " Methods of [11] J.M. and J. A. Kans. "The NCBI Data Data Model. Model." of Biochemical Analysis 43 (200 1 ): 1 9-43. (2001): 19-43.

References References

71
[ [12] 12] T. Ezold and P. Argos. "SRS: "SRS: An Indexing and Retrieval Tool for Flat File Data 1 993): 49-57. Biosciences 9, no. 1 1 ((1993): 49-57. Libraries." Computer Applications in the Biosciences

[13] E Codd. "A Relational Model Model of Data for Large Shared Data Banks. Banks." [ 1 3 ] E. F. " Communications of CM 13, no. 6 ((1970): 1 970): 377-387. of the A ACM 377-387. [14] P.G. Selinger, M. M. Astrahan, D. D. Ch Chamberlin, [ 14] P. G. Selinger, amberlin, et al. "Access Path Selection in a Relational " Proceedings of the 1 979 ACM Relational Database Management System. System." 1979 SIGMOD International Conference on Management of Data, Boston, MA, May 30-June 1 . ACM ( 1 979): 23-34. (1979): 23-34. 30-June 1. [15] [ 1 5] T. Bray, J. Paoli, C. M. Sperberg-McQueen, et al. Extensible Markup Language (XML): C) Recommendation, (XML). World Wide Web Consortium (W3 (W3C) Recommendation, 2nd edition, October 6, 2000, http://www.w3 .orgffRlREC-xmUhtml. http://www.w3.org/TRfREC-xml/html.
[ 1 6] http://www.w3c.orglxme/query. [16] http://www.w3c.org/xme/query. ." [ 1 8] S. F. [18] E Altschul, W. Gish, W. W. Miller, et al. "Basic Local Alignment Search Tool Tool." Journal of 1 990): 403-410. of Molecular Biology 215, no. 3 ((1990): 403-410.
www.acero.com. [ 1 7] The Acero Genome Knowledge Platform, http:// [17] http://www.acero.com.

[19] Goodstadt, N. J. Dickens, et al. " "Recent to the SMART [ 1 9] I. Letunic, L. Goodstadt, Recent Improvements to Domain-Based Sequence Annotation Resource. " Nucleic Acids Research Research 30, no.1 Resource." (2002): 242-244. 242-244.
[20] J.A. J. A. Blake, J. E. Richardson, C. J. Bult, et al. "The Mouse Genome Database [20] (MGD): The Model Organism " Nucleic Acids Organism Database for the Laboratory Mouse. Mouse." Research 1 3-1 1 5 . Research 30, no. 1 1 (2002): 1 113-115. [21 ] Z [21] Z.. Lacroix, A. Sahuget, and R. Chandrasekar. "Information Extraction and Database Techniques: A User-Oriented Approach to Querying the Web." 99 8 10th 1 0th International Conference on Advanced Information Proceedings of the 1 1998 Systems Engineering (CAiSE 998: 289-304. (CAiSE '98), Pisa, Italy, June 8-12, 8-12, 1 1998: 289-304. [22] L. Wall, T. Christiansen, and R. Schwartz. Programming Perl, Perl, 2nd edition. Sebastopol, CA: O'Reilly and Associates, 1 996. 1996. [23] J. Mowbray and R. Zahavi. The Essential CORBA: CORBA. Systems Systems Integration Using Using [23] T. T.J. Distributed Objects. Objects. New 995. New York: Wiley, 1 1995. [24] [24] The Expert Protein Analysis System (ExPASy) (ExPASy) server at the Swiss Swiss Institute of Bioinformatics, http://www.expasy.ch. [25] The FlyBase Web site for Drosophila genetics, [25] genetics, http://flybase.bio.indiana.edu. http://flybase.bio.indiana.edu. [26] M. Gelbart, M. Crosby, B. Matthews, et al. "FlyBase: A Drosophilia [26] W. W.M. Drosophilia Database. 1 99 7): 63-66. The Fly Base Consortium. " Nucleic Acids Research FlyBase Consortium." Research 25, no. 1 1 ((1997): 63-66. [27] www.genelogic.comlgenexpress.cfm. [27] Gene Logic's GeneExpress Database, Database, http:// http://www.genelogic.com/genexpress.cfm.

[28] Paton, S. Wu, et al. "GIMS-A "GIMS~A Data Warehouse for Storage and [28 ] M. Cornell, N. W. Patan, Analysis of Genome Sequence Proceedings of of the 2nd Sequence and Functional Data." In Proceedings

72

72

.,

3 3
~

A u ide to A Practitioner's Practitioner's G Guide to Data Data Management


~ . . .. -. .: . . . . . . .

IEEE International International Symposium on Bioinformatics and Bioengineering Bioengineering (BIBE).


Rockville, MD: IEEE Press, 200 1, 1 5-22. 2001, 15-22. [29] GeneCards Web site at the Weizmann Institute [29] Institute in Israel, http://bioinfo.weizmann.ac.iVcards. http://bioinfo.weizmann.ac.il/cards. GeneCards: A Novel Functional [30] [30] M. Rebhan, V. V. Chalifa-Caspi, Chalifa-Caspi, J. Prilusky, et al. " "GeneCards: Genomics Compendium Compendium with Automated Automated Data Mining and Query Reformulation Reformulation Support." 1 99 8 ) : 656-664. 656-664. Support." Bioinformatics 14, no. 8 ((1998):

[31] Computational Biology and Informatics Library. "AllGenes: "AllGenes: A Web Site [3 1 ] The Computational Providing Access to an Integrated Database of Known and Predicted Human and Mouse Genes. (version 5.0, 2002). Center for Bioinformatics, Unisversity of Pennsylvania. http://www.allgenes.org.
[32] S. B. Davidson, J. Crabtree, [32] S.B. Crabtree, B. P. Brunk, et al. "K21Kleisli "K2/Kleisli and GUS: GUS: Experiments Experiments in Integrated Access to Genomic Data Sources." IBM Systems Journal 40, no. 2 (2001): 512-531. (200 1 ): 5 1 2-53 1 . [33] . A . Eckman, A Extending Traditional [33] B B.A. A.. S S.. Kosky, and L L.. A A.. Laroco Jr. " "Extending Query-Based Integration Approaches Approaches for Functional Characterization of ): 587-6 01. Post-Genomic Data." Data." Bioinformatics 17, no. 7 (2001 (2001): 587-601. [34] . M. Haas, P. ta l. " DiscoveryLink: A System for [34] L L.M. E M. Schwartz, P. E Kodali, e et al. "DiscoveryLink: Integrating Life 1 ): 489-5 11. Life Sciences Sciences Data." Data." IBM Systems Journal 40, no. 2 (200 (2001): 489-511. [35] V. Semantic Similarities Between Objects in Multiple V. Kashyap and A A.. Sheth. " "Semantic Databases." of Heterogeneous and Autonomous Autonomous Database Database Systems, Databases." In Management of 3rd edition, by A. Elmagarmid, M. Rusinkiewicz, and A. Sheth, 57-89 57-89.. San Francisco: Morgan 999. Morgan Kaufmann, Kaufmann, 1 1999. [36] Online Mendelian Inheritance in Man [36] A. Hamoush, Hamoush, A. F. E Scott, J. Amberger, et al. " "Online Man (OMIM), A Knowledge Base " Nucleic Base of Human Human Genes and Genetic Disorders. Disorders." Acids Research 30, no. 1 1 (2002) (2002):: 52-55 52-55.. [37] [37] The Jackson Lab Web site, http://www.informatics.jax.orglmgihome/overview.shtml. http://www.informatics.j ax.org/mgihome/overview.shtml. [38] [38] RatMap, RatMap, http://ratmap.gen.gu.se/. http://ratmap.gen.gu.se/. [39] [39] Genome DataBase (GDB), http://www.gdb.org. http://www.gdb.org. [40] C. Talbot Jr. and [40] C. C.C. and A. J. Cuticchia. "Human "Human Mapping Mapping Databases." In Current . 1 3. 1-1 . 13 . 1 2. New York: Wiley, 1 999. Protocols 1.13.1-1.13.12. 1999. Protocols in Human Human Genetics, 1 [41] A. Pellegrini-Toole, C. Bonavides, and S. Gama-Castro. "The EcoCyc Database. " [41] Database." Nucleic Acids Research 3 0 , no. 1 8. 30, 1 (2002) (2002):: 56-5 56-58. [42] G. Baker, C ta l . "An Ontology for Bioinformatics [42] P. P.G. C.. A. Gobel, S. Bechhofer, e et al. 5, no. 6 ( 1 999): 5 1 0-520. Applications. " Bioinformatics Applications." Bioinformatics 1 15, (1999): 510-520. [43] A. Eckman, [43] B. B.A. Eckman, Z. Lacroix, and L. Raschid. "Optimized, Seamless Integration of Biomolecular Data." In In proceedings proceedings of of the 2nd IEEE International International

. . . . . . . . . . . ~. ===-= ====

References References

73 73

Symposium Symposium on Bioinformatics Bioinformatics and Bioengineering (BIBE). Rockville, Rockville, MD: MD: IEEE, IEEE, 2001, 23-32.

[44] R. R.J. J. Miller, L. M. Haas, L. Yan, et al. "The Clio Project: Managing Heterogeneity." [44] A CM SIGMOD 1 ) : 78-83. ACM SIGMOD Record 30, no. 1 1 (200 (2001):
[45] B. Lenhard, W. S. Hayes, and W. W. Wasserman. " GeneLynx: A Gene-Centric "GeneLynx: Portal to the Human Genome. " Genome Research 1 1 , no. 12 (200 1): 2 1 5 1-21 57. Genome." 11, (2001): 2151-2157.

This Page Intentionally Left Blank

CHAPTER CHAPTER

4 4

Issues to Address Issues to Address While While Designing ical Designing a a Biolog Biological Inform ation System Information System
loe Zo~ Lacroix Lacroix

Life Life science science has has experienced experienced a a fundamental fundamental revolution revolution from from traditional traditional in in vivo vivo discovery discovery methods methods (understanding (understanding genes, genes, metabolic metabolic pathways, pathways, and and cellular cellular mech mechanisms) anisms) to to electronic electronic scientific scientific discovery discovery consisting consisting in in collecting collecting measurement measurement data data through through a a variety variety of of technologies technologies and and annotating annotating and and exploring exploring the the resulting resulting elec electronic tronic data data sets. sets. To To cope cope with with this this dramatic dramatic revolution, revolution, life life scientists scientists need need tools tools that that enable enable them them to to access, access, integrate, integrate, mine, mine, analyze, analyze, interpret, interpret, simulate, simulate, and and visual visualize ize the the wealth wealth of of complex complex and and diverse diverse electronic electronic biological biological data. data. The The development development of adequate technology challenges. First, of adequate technology faces faces a a variety variety of of challenges. First, there there exist exist thousands thousands of biomedical data 323 relevant of biomedical data sources: sources: There There are are 323 relevant public public resources resources in in molecular molecular biology 1 ] . The biology alone alone [ [1]. The number number of of biological biological resources resources increases increases at at great great pace. pace. Pre Previous lists resources in molecular biology 203 data vious lists of of key key public public resources in molecular biology contained contained 203 data sources sources in 999 [2], 226 in in 2000 277 in in 1 1999 [2], 226 2000 [3], [3], and and 277 in 2001 2001 [4] [4].. Access Access to to these these data data repos repositories itories is is fundamental fundamental to to scientific scientific discovery. discovery. The The second second challenge challenge comes comes from from the multiple tools and and interfaces interfaces that support electronic-based the multiple software software tools that support electronic-based scientific scientific discovery. early report from the 999 U.S. discovery. An An early report from the 1 1999 U.S. Department Department of of Energy Energy Genome Genome Program meeting [5] held in in Oakland, challenges with Program meeting [5] held Oakland, California California identified identified these these challenges with the the following following statement: statement:
Genome-sequencing projects are producing data at a rate exceeding ana exceeding current analytical and data-management capabilities. capabilities. Additionally, Additionally, some some current computing

problems are expected to scale scale up exponentially as the data increase. increase. [5] [5]

The The situation situation has has worsened worsened since, since, whereas whereas the the need need for for technology technology to to support support sci scientific entific discovery discovery and and bioengineering bioengineering has has significantly significantly increased. increased. Chapter Chapter 1 1 covers covers the all of the reasons reasons why, why, ultimately, ultimately, all of these these resources resources must must be be combined combined to to form form a a comprehensive picture. Chapter Chapter 2 may well well constitute comprehensive picture. 2 claims claims that that this this challenge challenge may constitute the backbone of research. In research the backbone of 21st 21st century century life life science science research. In the the past, past, the the specific specific research

76

4 4

Issues to to Address Address While While Design Designing a Biological Biological IInformation System Issues i ng a nformation System

and and development development of of geographical geographical and and spatial spatial data data management management systems systems led led to to the the emergence of an important and very active geographic information systems (GIS) emergence of an important and very active geograi3hic information systems (GIS) community. Likewise, Likewise, the the field field of of biological biological information information systems systems (BIS) (BIS) aiming aiming to to community. support support life life scientists scientists is is now now emerging. emerging. To develop develop biological biological information information systems, systems, computer computer scientists scientists must must address address To the the specific specific needs needs of of life life scientists. scientists. The The identification identification of of the the specifications specifications of of compu computer-aided ter-aided biology biology is is often often impeded impeded by by difficulty difficulty of of communication communication between between life life scientists and computer scientists. scientists. Two Two main reasons can can be be identified. identified. First, life scientists and computer main reasons First, life and computer computer scientists scientists have have radically radically different different perspectives perspectives in in their development and their development activities. discrepancies can explained by activities. These These discrepancies can be be explained by comparing comparing the the design design process process in in engineering engineering and and in in experimental experimental sciences. sciences. A A second second reason reason for for misunderstanding misunderstanding results results from from their their orthogonal orthogonal objectives. objectives. A A computer computer scientist scientist aims aims to to build build a a sys system, tem, whereas whereas a a life life scientists scientists aims aims to to corroborate corroborate an an hypothesis. hypothesis. These These viewpoints viewpoints are illustrated in the the following. following. are illustrated in
Engineering Engineering vs. vs. Experimental Experimental Science Science

Software Software development development has has an an approach approach similar similar to to engineering. engineering. First, First, the the speci specifications (or requirements) requirements) of fications (or of the the system system to to be be developed developed are are identified. identified. Then, Then, the the development development relies relies on on a a long long initial initial design design phase phase when when most most of of the the cases, cases, if if not not all, all, are are identified identified and and offered offered a a solution. solution. Only Only then then is is a a prototype prototype imple implemented. mented. Later, Later, iterations iterations of of the the loop loop design B ~ implementation implementation aim aim to to correct correct the the implementation's implementation's failure failure to to perform perform effectively effectively the the requirements requirements and and to to extend, extend, significantly, significantly, the the implementation implementation to to new new requirements. requirements. These These iterations iterations are are typ typically ically expressed expressed through through codified codified versioning versioning of of the the prototype. prototype. In In practice, practice, initial initial design design phases phases are are typically typically shortened shortened because because of of drastic drastic budget budget cuts cuts and and a a hurry hurry to product. However, to market market the the product. However, short short design design phases phases often often cause cause costly costly revisions revisions that that could could have have been been avoided avoided with with appropriate appropriate design design effort. effort. Bioinformatics Bioinformatics aims aims to to support support life life scientists scientists in in the the discovery discovery of of new new biological biological insights global perspective insights as as well well as as to to create create a a global perspective from from which which unifying unifying principles principles in discerned. Scientific in biology biology can can be be discerned. Scientific discovery discovery is is experimental experimental and and follows follows a a progress blazed by experiments designed progress track track blazed by experiments designed to to corroborate corroborate or or fail fail hypotheses. hypotheses. Each Each experiment experiment provides provides the the theory theory with with additional additional material material and and knowledge knowledge that that builds step, whereas whereas an an builds the the entire entire picture. picture. An An hypothesis hypothesis can can be be seen seen as as a a design step, experiment experiment is is an an implementation implementation of of the the hypothesis. hypothesis. Learning Learning thus thus results results from from multiple multiple iterations iterations of of design B ~ implementation, implementation, where where each each refinement refinement of of an an hypothesis failure of hypothesis is is motivated motivated by by the the failure of the the previous previous implementation. implementation. These These two two approaches approaches seem seem very very similar, similar, but but they they vary vary by by the the number number of of it iterations erations of of design B .-~ implementation. implementation. When When computer computer scientists scientists are are in in the the design design phase of phase of developing developing a a new new system system for for life life scientists, scientists, they they often often have have difficulties difficulties in in

4 4

Issues Wh i l e Issues to to Add Addr~ress essVVh!loe

I nformation ..... \1."",,,., " D esigningaBioologica!,,Informaot!OonoSystemo~, .......................................................... 7 7

77

collecting collecting use use cases cases and and identifying identifying the the specifications specifications of of the the system system prior prior to to imple implementation. mentation. Indeed, Indeed, life life scientists scientists are are likely likely able able to to provide provide just just enough enough information information to build a to build a prototype, prototype, which which they they expect expect to to evaluate evaluate to to express express more more requirements requirements for on. This company proposing proposing to for a a better better prototype, prototype, and and so so on. This attitude attitude led led a a company to build build a a biological biological data data management management system system to to offer offer to to "build "build a a little, little, test test a a little" little" to to ensure meeting meeting the the system system requirements. requirements. Somehow Somehow the the prototype prototype corresponds corresponds to to ensure an experiment experiment for for a a life scientist. Understanding Understanding these two dramatically dramatically different different an life scientist. these two approaches design is to develop develop useful useful technology support life approaches to to design is mandatory mandatory to technology to to support life scientists. scientists.
Generic Generic System System vs. Query-Driven Query-Driven Approach Approach

Computer Computer scientists scientists aim aim to to build build systems. systems. A A system system is is the the implementation implementation of of an an approach approach that that is is generic generic to to many many applications applications having having similar similar characteristics. characteristics. When When provided provided with with use use cases cases or or requirements requirements for for a a new new system, system, computer computer scientists scientists typically them as possible to typically abstract abstract them as much much as as possible to identify identify the the intrinsic intrinsic characteristics characteristics and therefore therefore design design the the most most generic generic approach that will perform the the requirements requirements and approach that will perform in in various various similar similar applications. applications. Life in their motivated by Life scientists, scientists, in their discovery discovery process, process, are are motivated by an an hypothesis hypothesis they they wish validation process typically involves wish to to validate. validate. The The validation process typically involves some some data data sets sets extracted extracted from identified identified data and follows pre-defined manipulation col from data sources sources and follows a a pre-defined manipulation of of the the collected data. In validation approach corresponds to complex query lected data. In a a nutshell, nutshell, a a validation approach corresponds to a a complex query asked against against multiple multiple and sources. Life asked and often often heterogeneous heterogeneous data data sources. Life scientists scientists have have a query-driven approach. a query-driven approach. These These two two approaches approaches are are orthogonal orthogonal but but not not contradictory. contradictory. Computer Computer sci scientists entists present present the the value value of of their their approach approach by by illustrating illustrating the the various various queries queries the the system will answer, value their system will answer, whereas whereas life life scientists scientists value their approach approach by by the the quality quality of of the obtained and final validation validation of the data data set set obtained and the the final of the the hypothesis. hypothesis. This This orthogonality orthogonality also the legacy also explains explains the legacy in in bioinformatics bioinformatics implementations, implementations, which which mostly mostly consist consist of hard-coded queries of hard-coded queries that that do do not not offer offer the the flexibility flexibility of of a a system system as as explained explained and and illustrated 1 and 3. This . 1 .2. illustrated in in Chapters Chapters I and 3. This legacy legacy problem problem is is addressed addressed in in Section Section 4 4.1.2. This not aim present or This chapter chapter does does not aim to to present or compare compare the the systems systems that that will will be be de described in scribed in the the later later chapters chapters of of this this book. book. Instead Instead it it is is devoted devoted to to issues issues specific specific to to data management that need need to when designing data management that to be be addressed addressed when designing systems systems to to support support life life science. science. As As with with any any technology, technology, data data management management assumes assumes an an ideal ideal world world upon upon which which most most systems systems are are designed. designed. They They appear appear to to suit suit the the needs needs of of large large corporate corporate usage usage such such as as banking; banking; however, however, traditional traditional technology technology fails fails to to adjust adjust properly properly to to many many new new technological technological challenges challenges such such as as Web Web data data management management and and scientific scientific data data management. management. The The following following sections sections introduce introduce these these issues. issues.

78

Issues to Address igning a ica l IInformation nformation System Address While While Des Designing a Biolog Biological

Section available scientific Section 4.1 4.1 presents presents some some of of the the characteristics characteristics of of available scientific data data and and tech technology. issue that nology. Section Section 4.2 4.2 is is devoted devoted to to the the first first issue that traditional traditional data data management management technology addresses issues issues related technology needs needs to to address: address: changes. changes. Section Section 4.3 4.3 addresses related to to bi biological queries, 4.4 focuses ological queries, whereas whereas Section Section 4.4 focuses on on query query processing. processing. Finally, Finally, data data visualization in Section visualization is is addressed addressed in Section 4.5. 4.5. Bioinformaticians Bioinformaticians should should find find a a variety variety of of illustrations illustrations of of the the reasons reasons why why BIS BIS need need innovative innovative solutions. solutions.

4. 1 4.1

LEGACY LEGACY
Scientific Scientific data data has has been been collected collected in in electronic electronic form form for for many many years. years. While While new new data data management management approaches approaches are are designed designed to to provide provide the the basis basis for for future future homogeneous homogeneous collection, collection, integration, integration, and and analysis analysis of of scientific scientific data, data, they they also also need need to to integrate integrate existing large developed to existing large data data repositories repositories and and a a variety variety of of applications applications developed to analyze analyze them. them. Legacy Legacy data data and and tools tools may may raise raise various various difficulties difficulties for for their their integration integration that that may may affect affect the the design design of of BIS. BIS.

4. 1.1 4.1.1

B i o l og ica l Data Data Biological


Scientific Scientific data data are are disseminated disseminated in in myriad myriad different different data data sources sources across across disparate disparate laboratories, laboratories, available available in in a a wide wide variety variety of of formats, formats, annotated, annotated, and and stored stored in in flat flat files relational or bases. Access files and and relational or object-oriented object-oriented data databases. Access to to heterogeneous heterogeneous biolog biological ical data data sources sources is is mandatory mandatory to to scientists. scientists. A A single single query query may may involve involve flat flat files files (stored [6] or [7], Web (stored locally locally or or remotely) remotely) such such as as GenBank GenBank [6] or Swiss-Prot Swiss-Prot [7], Web resources resources such Database [8], such as as the the Saccharomyces Saccharomyces Genome Genome Database [8], GeneCards GeneCards [9], [9], or or the the refer references data 1 0] . A list of useful biological ences data source source PubMed PubMed [ [10]. A list of useful biological data data sources sources is is given given in Appendix. These in the the Appendix. These sources sources are are mostly mostly textual textual and and of of restricted restricted access access facili facilities. poorly structured ties. Their Their structure structure varies varies from from ASN.l ASN. 1 data data exchange exchange format format to to poorly structured hypertext (HTML) and up language hypertext markup markup language language (HTML) and extensible extensible mark markup language (XML) (XML) for formats. This This variety variety of of repositories repositories justifies justifies the to evoke evoke data data sources sources rather rather mats. the need need to than bases. This than data databases. This chapter chapter only only refers refers to to a a database database when when the the underlying underlying system system is database management database system, is a a database management system system (a (a relational relational database system, for for example). example). A A system based on files is system based on flat flat files is not not a a database. database. Unlike Unlike data data hosted hosted in in a a database database system, system, scientific scientific data data is is maintained maintained by by life life scientists through through user-friendly and scientists user-friendly interfaces interfaces that that offer offer great great flexibility flexibility to to add add and revise data these data this flexibility revise data in in these data sources. sources. However, However, this flexibility often often affects affects data data qual quality ity dramatically. dramatically. First, First, data data sources sources maintained maintained by by a a large large community community such such as as GenBank explains GenBank contain contain large large quantities quantities of of data data that that need need to to be be curated. curated. This This explains the the numerous numerous overlapping overlapping data data sources sources sometimes sometimes found found aiming aiming to to complete complete or or correct from existing sources. As correct data data from existing sources. As importantly, importantly, data data organization organization also also suffers suffers

4. 1 4.1

Leg ac y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

~ ~ ~ o - - - ~ , ~ = - ~ - ~ - ~ ~

79 79

from flexible access: from this this wide wide and and flexible access: Data Data fields fields are are often often completed completed with with differ different goals are missing, so on. ent goals and and objectives, objectives, fields fields are missing, and and so on. This This flexibility flexibility is is typically typically provided provided when when the the underlying underlying data data representation representation is is a a formatted formatted file file with with no no types types or checking. Unfortunately Unfortunately this representation makes or constraints constraints checking. this variety variety of of data data representation makes it it difficult difficult to to use use traditional traditional approaches approaches as as explained explained in in Section Section 4.2.3. 4.2.3. The The quantity quantity of of data data sources sources to to be be exploited exploited by by life life scientists scientists is is overwhelming. overwhelming. Each publicly available available data Each year year the the number number of of publicly data sources sources increases increases significantly: significantly: It It 43% between 999 and 1 , 2]. rose rose 43% between 1 1999 and 2002 2002 for for the the key key molecular molecular data data sources sources [ [1, 2]. In In addition addition to to this this proliferation proliferation of of sources, sources, the the quantity quantity of of data data contained contained in in each each data , data source source is is significantly significantly large large and and also also increasing. increasing. For For example, example, as as of of January January 1 1, 2001,, GenBank GenBank [6] [6] contained contained 1 11,101,066,288 bases in in 1 10,106,023 sequences, and and 2001 1 ,101,066,288 bases 0,1 06,023 sequences, 1 1 ] . While its its growth growth continues continues to to be be exponential, exponential, doubling doubling every every 14 14 months months [ [11]. While the the number number of of distinct distinct human human genes genes appears appears to to be be smaller smaller than than expected, expected, in in the the range range of 13], the of 30,000-40,000 30,000-40,000 [12, [12, 13], the distinct distinct human human proteins proteins in in the the proteome proteome are are ex expected number in apparent frequency pected to to number in the the millions millions due due to to the the apparent frequency of of alternative alternative splic splicing, (RNA) editing, ing, ribonucleic ribonucleic acid acid (RNA) editing, and and post-translational post-translational modification modification [14, [14, 15]. 15]. As of of May May 4, 4, 2001 2001,, Swiss-Prot Swiss-Prot contained contained 95,674 95,674 entries, entries, whereas whereas PubMed PubMed As contained 1 million contained more more than than 1 11 million citations. citations. Managing Managing these these large large data data sets sets efficiently efficiently will be critical in Issues of processing are will be critical in the the future. future. Issues of efficient efficient query query processing are addressed addressed in in Section Section 4.4. 4.4. Future Future collaborations collaborations between between computer computer and and life life scientists scientists may may improve improve the the collection collection and and storage storage of of data data to to facilitate facilitate the the exploitation exploitation of of new new scientific scientific data. data. However, However, it it is is likely likely that that scientific scientific data data management management technology technology will will need need to to address issues issues related address related to to the the characteristics characteristics of of the the large large existing existing data data sets sets that that constitute constitute part part of of the the legacy legacy of of bioinformatics. bioinformatics.

4. 1 .2 4.1.2

B i o log ica l Too ls a n d Wo rkfl ows Biological Tools and Workflows


Scientific Scientific resources resources include include a a variety variety of of tools tools that that assist assist life life scientists scientists in in searching, searching, mining, and Biological tools mining, and analyzing analyzing the the proliferation proliferation of of data. data. Biological tools include include basic basic biosequence analyses such Clustal, Mfold, biosequence analyses such as as FASTA, FASTA, BLAST, BLAST, Clustal, Mfold, Phylip, Phylip, PAUP, PAUP, CAP, CAP, and and MEGA. MEGA. A A data data management management system system not not integrating integrating these these useful useful tools tools would would offer offer little little support support to to life life scientists. scientists. Most Most of of these these tools tools can can be be used used freely freely by by load loading 60 free ing their their code code onto onto a a computer computer from from a a Web Web site. site. A A list list of of 1 160 free applications applications supporting supporting biomolecular biomolecular biology biology is is provided provided in in Misener Misener and and Krawetz's Krawetz's Bioinfor Bioinformatics: 1 6]. The [16]. The first first problem problem is is the the various various platforms platforms matics. Methods and Protocols [ 998, 30-50% used Macintosh used by used by life life scientists: scientists: In In 1 1998, 30-50% of of biologists biologists used Macintosh computers, computers, 40-70% 40-70% used used a a PC PC running running any any version version of of Microsoft Microsoft Windows, Windows, and and less less than than 1 0 % used 60 applications 10% used UNIX UNIX or or LINUX LINUX [16]. [16]. Out Out of of the the 1 160 applications listed listed in in Misener Misener 07 run PC, 88 88 on and and Krawetz's Krawetz's book, book, 1 107 run on on a a PC, on a a Macintosh, Macintosh, and and 42 42 on on other other systems systems

80 0

4
~ ~ ? ~ ~ ~ I ~

Issues to Address While Designing a Biological Information System


~ ! ~ ~ : ~ ~ i ~ ~ ~ ~ ~ ~ ~ ~ ~

such as UNIX. Although 7 listed such as UNIX. Although some some applications applications (2 (27 listed in in the the previously previously mentioned mentioned book) are all computer book) are made made available available for for all computer systems, systems, most most of of them them only only run run on on a a single system. system. The integration of despite the which single The need need for for integration of applications, applications, despite the system system for for which they motivated the they may may be be designed, designed, motivated the idea idea of of the the grid as as explained explained in in Section Section 4.3.4. 4.3.4. Legacy Legacy tools tools also also include include a a variety variety of of hard-coded hard-coded scripts scripts in in languages languages such such as as Perl Perl or or Python Python that that implement implement specific specific queries, queries, link link data data repositories, repositories, and and perform perform sequence of used ex pre-defined sequence of data data manipulation. manipulation. Scripting Scripting languages languages were were used exa pre-defined tensively to to build build early early bioinformatics bioinformatics tools. However, they they do do not not offer offer expected expected tensively tools. However, flexibility flexibility for for re-use re-use and and integration integration with with other other functions. functions. Most Most legacy legacy integra integraflows are tion tion approaches approaches were were developed developed using using workflows. workflows. Work Workflows are used used in in business business applications applications to to assess, assess, analyze, analyze, model, model, define, define, and and implement implement the the core core business business processes processes of of an an organization organization (or (or other other entity). entity). A A workflow workflow approach approach automates automates the business procedures the business procedures where where documents, documents, information, information, or or tasks tasks are are passed passed be between participants according according to tween participants to a a defined defined set set of of rules rules to to achieve, achieve, or or contribute contribute to, business goal. goal. In applications, a to, an an overall overall business In the the context context of of scientific scientific applications, a workflow workflow approach address overall approach may may address overall collaborative collaborative issues issues among among scientists, scientists, as as well well as as the the physical physical integration integration of of scientific scientific data data and and tools. tools. The The procedural procedural support support a a work workflow approach follows the the query-driven query-driven design design of of scientific scientific problems problems flow approach provides provides follows presented presented in in the the Introduction Introduction.. In In such such an an approach, approach, the the data data integration integration prob problem lem follows follows step-by-step step-by-step the the single single user's user's query query execution, execution, including including all all necessary necessary "business "business rules" rules" such such as as security security and and semantics. semantics. A A presentation presentation of of workflows workflows and and their line by 1 7] . their model model is is provided provided on online by the the Workflow Workflow Management Management Coalition Coalition [ [17]. The The integration integration of of these these tools tools and and query query pipelines pipelines into into a a BIS BIS poses poses problems problems that that are are beyond beyond traditional traditional database database management management as as explained explained in in Section Section 4.3.4. 4.3.4.

4.2 4.2

A N IN IN CO N STANT EVOLUTI ON A DOMAI DOMAIN CONSTANT EVOLUTION


A must be while managing A BIS BIS must be designed designed to to handle handle a a constantly constantly changing changing domain domain while managing legacy legacy data data and and technology. technology. Traditional Traditional data data management management approaches approaches are are not not suitable to address constant changes (see Section 4.2. 1 ). Two problems are critical suitable to address constant changes (see Section 4.2.1). Two problems are critical to to address address for for scientific scientific data data management: management: changes changes in in data data representation representation (see (see Section Section 4.2.2) 4.2.2) and and data data identification identification (see (see Section Section 4.2.4) 4.2.4).. The The approach approach presented presented in 0 addresses addresses specifically specifically these problems with in Chapter Chapter 1 10 these problems with gene gene expression expression data. data.

4. 2. 1 4.2.1

Tra d iti o n a l Data base M a na g em ent a nd C h a n ges Traditional Database Management and Changes
The The main main assumption assumption of of traditional traditional data data management management approaches approaches relies relies on on a a pre predefined, unchangeable system defined, unchangeable system of of data data organization. organization. Traditional Traditional database database manage management three kinds: ment systems systems are are of of three kinds: relational, relational, object-relational, object-relational, or or object-oriented. object-oriented.

4.2

A lutio n A Domain Domain in Constant Constant Evo Evolution

81

Relational database database systems systems represent represent data data in in relations relations (tables). (tables). Object-relational Object-relational Relational systems classes, but systems provide provide a a more more user-friendly user-friendly data data representation representation through through classes, but they they rely underlying relational rely on on an an underlying relational representation. representation. Object-oriented Object-oriented databases databases orga organize nize data data through through classes. classes. For For the the sake sake of of simplicity, simplicity, only only relational relational databases databases are bases currently are considered considered in in this this section section because because most most of of the the data databases currently used used by by life life scientists scientists are are relational, relational, and and similar similar remarks remarks could could be be made made for for all all traditional traditional systems. systems. Data Data organization organization includes includes the the relations relations and and attributes attributes that that constitute constitute a a rela relational tional database database schema. schema. When When the the database database schema schema is is defined, defined, it it can can be be populated populated by data data to to create create an an instance instance of of the the schema, schema, in in other other words, words, a a database. database. Each Each row row by of relation is called a database has of a a relation is called a tuple. tuple. When When a a database has been been defined, defined, transactions transactions may database. They may be be performed performed to to update update the the data data contained contained in in the the database. They consist consist of of insertions insertions (adding (adding new new tuples tuples in in relations), relations), deletions deletions (removing (removing tuples tuples from from relations), updates (transforming components of tuples in relations), and and updates (transforming one one or or more more components of tuples in rela relations) tions).. All All traditional traditional database database systems systems are are designed designed to to support support transactions transactions on on their their data; data; however, however, they they support support few few changes changes in in the the data data organization. organization. Changes Changes in the the data data organization organization include include renaming renaming relations relations or or attributes, attributes, removing removing or or in adding or attributes, adding relations relations or attributes, merging merging or or splitting splitting relations relations or or attributes, attributes, and and so so on. renaming are on. Some Some transformations transformations such such as as renaming are rather rather simple, simple, and and others others are are com complex. Traditional database systems plex. Traditional database systems are are not not designed designed to to support support complex complex schema schema transactions. the data transactions. Typically, Typically, a a change change in in the data organization organization of of a a database database is is per performed loading the formed by by defining defining a a new new schema schema and and loading the data data from from the the database database to to an an instance new schema, thus creating instance of of the the new schema, thus creating a a new new database. database. Clearly Clearly this this process process is is tedious and and not acceptable when addressed often. tedious not acceptable when such such changes changes have have to to be be addressed often. Another Another approach to the problem approach to the problem of of restructuring restructuring is is to to use use a a view view mechanism mechanism that that offers offers a a new schema to users as as a virtual schema and schema new schema to users a virtual schema when when the the underlying underlying database database and schema have user interfaces have not not changed. changed. All All user interfaces provide provide access access to to the the data data as as they they are are de defined the view approach fined in in the view and and no no longer longer as as they they are are defined defined in in the the database. database. This This approach offers including the offers several several advantages, advantages, including the possibility possibility of of providing providing customized customized views views of limited to reasons, for of databases. databases. A A view view may may be be limited to part part of of the the data data for for security security reasons, for example. However, this limited as example. However, this approach approach is is rather rather limited as the the transactions transactions available available through through the the view view may may be be restricted. restricted. Another traditional database database systems relies on identity. Another aspect aspect of of traditional systems relies on a a pre-defined pre-defined identity. Objects stored in in a can be attributes that, that, Objects stored a relational relational database database can be identified identified by by a a set set of of attributes together, For example, attributes-first name, together, characterize characterize the the object. object. For example, the the three three attributes~first name, middle last name--can middle name, name, and and last name--can characterize characterize a a person. person. The The set set of of characterizing characterizing attributes called a attributes is is called a primary primary key. key. The The concept concept of of primary primary key key relies relies on on a a character characterization that will over time. ization of of identity identity that will not not change change over time. No No traditional traditional database database system system is designed to is designed to address address changes changes in in identification, identification, such such as as tracking tracking objects objects that that may may have have changed changed identity identity over over time. time.

82

82

4 4

Issues Issuesto to Address Address While While

nformation System Designing a Biological I Information

4.2.2 4.2.2

Data F Fusion Data usion


Data Data fusion fusion corresponds corresponds to to the the need need to to integrate integrate information information acquired acquired from from mul multiple tiple sources sources (sensors, (sensors, databases, databases, information information gathered gathered by by humans, humans, etc.). etc.). The The term term was was first first used used by by the the military military to to qualify qualify events, events, activities, activities, and and movements movements to to be be cor correlated related and and analyzed analyzed as as they they occurred occurred in in time time and and space space to to determine determine the the location, location, identity, identity, and and status status of of individual individual objects objects (equipment (equipment and and units), units), assess assess the the sit situation, uation, qualitatively qualitatively or or quantitatively quantitatively determine determine threats, threats, and and detect detect patterns patterns in in activity that that would would reveal reveal intent intent or or capability. capability. activity Scientific Scientific data data may may be be collected collected through through a a variety variety of of instruments instruments and and robots robots performing performing microarrays, microarrays, mass mass spectrometry, spectrometry, flow flow cytometry, cytometry, and and other other proce procedures. dures. Each Each instrument instrument needs needs to to be be calibrated, calibrated, and and the the calibration calibration parameters parameters may may affect affect the the data data significantly. significantly. Different Different instruments instruments may may be be used used to to perform perform similar similar tasks tasks and and collect collect data data to to be be integrated integrated in in a a single single data data set set for for analysis. analysis. The The analysis analysis is is performed performed over over time time upon upon data data sets sets disparate disparate by by the the context context of of their their collection. collection. The The analysis analysis must must be be tempered tempered by by parameters parameters that that directly directly affect affect the 004.3 in the quality quality of of the the data. data. A A similar similar problem problem is is presented presented in in Section Section 1 10.4.3 in the the context context of of probe probe arrays arrays and and gene gene expression. expression. A A traditional traditional database database approach approach re requires quires the the complete complete collection collection of of measurement measurement data data and and all all parameters parameters to to allow allow the the expression expression of of the the complex complex queries queries that that enable enable the the analysis analysis of of the the disparate disparate (NULL in data set. Should information be data set. Should any any information be missing missing (N-LSLL in a a table), table), the the system system ignores ignores the corresponding corresponding data, an unacceptable unacceptable situation addition, the data, an situation for for a a life life scientist. scientist. In In addition, the use any new new instrument that requires the definition may the use of of any instrument that requires the definition of of new new parameters parameters may affect the data data organization, organization, as more complex. complex. The affect the as well well as as make make the the fusion fusion process process more The situation is is made made even even more more complex complex by by the the constant constant evolution evolution of of the the protocols. protocols. situation Their new new specifications often change change the the overall overall data data organization: organization: Attributes Attributes Their specifications often may be be added, added, split, split, merged, merged, removed, removed, or or renamed. renamed. Traditional Traditional database database systems' systems' may difficulty with these issues of data fusion explains the current use of Microsoft difficulty with these issues of data fusion explains the current use of Microsoft Excel spreadsheets spread sheets and and manual manual computation computation to to perform perform the the integration integration prior prior to to Excel analysis. The The database database system system is is typically typically used used as as a a storage storage device. device. analysis. Can a a traditional traditional database database system system be be adjusted adjusted to to handle handle these these constant constant and and Can complex changes in the data organization? It is unlikely. Indeed, all traditional ap complex changes in the data organization? It is unlikely. Indeed, all traditional approaches rely rely strongly strongly on on a a pre-defined pre-defined and and stale stale data data organization. organization. A A BIS BIS shall shall ofof proaches fer great great flexibility flexibility in in the the data data organization organization to to meet meet the the needs needs of of life life scientists. scientists. New New fer approaches must must be be designed designed to to enable enable scientific scientific data data fusion. fusion. A A solution solution is is to to relax relax approaches the constraint constraint on on the the data data representation, representation, as as presented presented in in the the following following section. section. the

4.2.3 4.2.3

F u l ly Structured Structu red vs. vs. Semi-Structured Semi-Structu red Fully


Traditional database database approaches approaches are are too too structured: structured: When When the the schema schema is is defined, defined, Traditional it is is difficult difficult to to change change it, it, and and they they do do not not support support the the integration integration of of similar, similar, but but it

4.2 4.2

A Domain Domain in Constant Constant Evol Evolution A ution

83

disparate, disparate, data data sets. sets. A A solution solution to to this this need need for for adherence adherence to to a a structure structure is is offered offered by the the semi-structured semi-structured approach. approach. In In the the semi-structured semi-structured approach, approach, the the data data or orby ganization changes such such as ganization allows allows changes as new new attributes attributes and and missing missing attributes. attributes. Semi Semistructured usually represented edge-Iabeled, rooted, structured data data is is usually represented as as an an edge-labeled, rooted, directed directed graph graph [1 8-20] . Therefore, [18-20]. Therefore, a a system system handling handling semi-structured semi-structured data data does does not not assume assume a a given, pre-defined given, pre-defined data data representation: representation: A A new new attribute attribute name name is is a a new new labeled labeled edge, edge, a new new attribute is a a new new edge edge in in the the graph, and so so on. on. Such a system system should should a attribute value value is graph, and Such a offer offer greater greater flexibility flexibility than than traditional traditional database database systems. systems. An An example example of of represen representation of semi-structured semi-structured data data is XML, the the up upcoming standard for for data data exchange exchange tation of is XML, coming standard on by the on the the Web Web designed designed by the World World Wide Wide Web Web Consortium Consortium (W3C). (W3C). XML XML extends extends the the basic basic tree-based tree-based data data representation representation of of the the semi-structured semi-structured model model by by order ordering elements and levels of ing elements and providing providing various various levels of representation representation such such as as XML XML Schema Schema [21-23] [21-23].. These These additional additional characteristics characteristics make make XML XML data data representation representation signifi significantly less less flexible flexible than than the the original original semi-structured semi-structured data data model. structured cantly model. Fully Fully structured data representation, semi-structured semi-structured data data representation, data representation, representation, and and XML XML are are presented presented in in Data Data on on the the Web Web [24]. [24]. There currently two categories of management systems: There are are currently two categories of XML XML management systems: XML XMLenabled enabled and and native native XML. XML. The The first first group group includes includes traditional traditional database database systems systems extended extended to to an an XML XML interface interface for for collection collection and and publication. publication. However, However, the the un underlying derlying representation representation is is typically typically with with tables. tables. Examples Examples of of XML-enabled XML-enabled sys systems tems are are Oracle9i Oracle9i 1 I and and SQL SQL Server Server 2000. 2000. 2 2 These These systems systems were were mostly mostly designed designed to to handle Business-to-Business business tasks handle Business-to-Business (B2B) (B2B) and and Business-to-Customer Business-to-Customer (B2C) (B2C) business tasks on on the the Web. Web. They They have have not not yet yet proven proven useful useful in in scientific scientific contexts. contexts. Native Native XML XML systems such such as as Tamino,3 Tamino, 3 ToX, ToX, 4 4 and and Galaxs Galax s rely rely on on a a real real semi-structured semi-structured ap apsystems proach and and should should provide provide a a flexibility flexibility interesting interesting in in the the context context of of scientific scientific data data proach management. management. Because promising characteristics, Because of of XML's XML's promising characteristics, and and because because it it is is going going to to be be the the lingua lingua franca franca for for the the Web, Web, new new development development for for BIS BIS should should take take advantage advantage of of this this new new technology. technology. A A system system such such as as KIND, KIND, presented presented in in Chapter Chapter 12, 12, already already exploits exploits XML XML format. format. However, However, the the need need for for semantic semantic data data integration integration in in ad addition dition to to syntactic syntactic data data integration integration (as (as illustrated illustrated in in Section Section 4.2.5) 4.2.5) limits limits the the use use of of XML XML and and its its query query language language in in favor favor of of approaches approaches such such as as description description logics. logics.

1 1.. Oracle9i Oracle9i was was developed developed by by the the Oracle Oracle corporation corporation (see (see http://www.oracle.com) http://www.oracle.com)..
2. SQL sa f the SQL Server Server 2000 2000 iis a product product o of the Microsoft Microsoft Corporation Corporation (see (see http://www.microsoft.com). http.//www.microsoft.com).

3 3.. Tamino Tamino XML XML server server is is a a commercial commercial XML XML management management system system from from SoftwareAG. SoftwareAG.

4. ToX ToX is is an an academic academic XML XML management management system system being being developed developed at at the the University University of of Toronto. Toronto.
5 5.. Galax Galax was was developed developed at at the the Bell Bell Laboratory Laboratory of of Lucent Lucent Technology Technology (see (see http://db.bell-labs.com/ http.//db.bell-labs.com/ galax) galax)..

84

nformation System Issues to Address While Designing a Biological IInformation

4.2.4 4.2.4

Scientific Object Identity Scie ntific O bject Ide ntity


Scientific Scientific objects objects change change identification identification over over time. time. Data Data stored stored in in data data sources sources can can typically typically be be accessed accessed with with knowledge knowledge of of their their identification identification or or other other unique unique char characterization acterization initially initially entered entered into into the the data data bank. bank. Usually, Usually, each each object object of of interest interest has has a a name name or or an an identifier identifier that that characterizes characterizes it. it. However, However, a a major major problem problem arises arises when when a a given given scientific scientific object, object, such such as as a a gene, gene, may may possess possess as as many many identifiers identifiers as as there there are are data data sources sources that that contain contain information information about about it. it. The The challenge challenge is is to to manage manage these these semantic semantic heterogeneities heterogeneities at at data data access, access, as as the the following following example example illustrates. illustrates. Gene Gene names names change change over over time. time. For For example, example, the the Human Human Gene Gene Nomencla Nomencla3 ,594 active ture (HUGO) [25] ture Database Database (HUGO) [25] contains contains 1 13,594 active gene gene symbols, symbols, 9635 9635 literature literature aliases, HUGO, S IR2 Ll (withdrawn) (withdrawn) is aliases, and and 2739 2739 withdrawn withdrawn symbols. symbols. In In HUGO, SIR2L1 is a a syn synsymbol) and . PS3 onym onym to to SIRTl SIRT1 (the (the current current approved approved HUGO HUGO symbol) and sir2-like sir2-1ike 1 1. P53 is is a a withdrawn symbol and and an an alias alias for for TPS3 TP5 3 (current (current approved approved HUGO symwithdrawn HUGO HUGO symbol HUGO sym bol). is removed, bol). When When a a HUGO HUGO name name is removed, not not all all data data sources sources containing containing the the name name are are updated. updated. Some Some information, information, such such as as the the content content of of PubMed, PubMed, will will actually actually never never be updated. updated. be Table Table 4.1 4.1 illustrates illustrates the the discrepancies discrepancies found found when when querying querying biological biological data data sources sources with with equivalent equivalent (but (but withdrawn withdrawn or or approved) approved) HUGO HUGO names names in in Novem November 1 . The ber 200 2001. The Genome Genome DataBase DataBase (GDB) (GDB) is is the the official official central central repository repository for for ge genomic resulting from the Human Human Genome [26]. GenAtlas nomic mapping mapping data data resulting from the Genome Initiative Initiative [26]. GenAtlas [27, 28] 28] provides provides information relevant to the mapping mapping of genes, diseases, diseases, and and [27, information relevant to the of genes, markers. Online Online Mendelian Mendelian Inheritance 30] is catalog markers. Inheritance in in Man Man (OMIM) (OMIM) [29, [29, 30] is a a catalog is a human of human genes genes and disorders. GeneCards GeneCards [9] is of human and genetic genetic disorders. a data data source source of of human genes, their products, products, and their involvement diseases. LocusLink [3 1 , 32] 32] propro genes, their and their involvement in in diseases. LocusLink [31, vides curated sequence and and descriptive descriptive information loci. vides curated sequence information about about genetic genetic loci. Querying GDB with with TP53 TPS3 or or its alias P53 PS3 does does not the result result of Querying GDB its alias not affect affect the of the the S IRTl returns entry, whereas its alias does not not return return query. However, However, SIRT1 query. returns a a single single entry, whereas its alias does

HUGO name H U G O name

GOB GDB

GenAtlas GenAtlas

OMIM OMIM

GeneCards GeneCards

LocusLink LocusLink 13 13 6 3 63 2 (same) 2 (same)

TP53 P53 SIRTl SIRT1 SIR2L1 SIR2L1 4.1 4. 1


TABLE TABLE

1 1 (same) (same) 1 1 0

33 17 17 0 0

52 188 188 5 1

22 6 9 69 1 1 (same) (same) 1

Number of of entries entries retrieved retrieved with with HUGO HUGO names names from from GDB, GenAtlas, OMIM, OMIM, Number GDB, GenAtlas, GeneCards, GeneCards, and and LocusLink. LocusLink.

4.2

A l ution A Domain in Constant Constant Evo Evolution

85

any entry. entry. GenAtlas GenAtlas returns returns more more entries entries for for the the approved approved symbol symbol TP53 TP53 than than any 5 3 . The the symbol P the withdrawn withdrawn symbol P53. The question, question, then, then, is is to to determine determine if if the the entries entries corresponding to corresponding to a a withdrawn withdrawn symbol symbol are are always always contained contained in in the the set set of of entries entries returned for hold in OMIM. OMIM returned for an an approved approved symbol. symbol. This This property property does does not not hold in OMIM. OMIM returns 5 3 than returns many many more more entries entries for for the the withdrawn withdrawn symbol symbol P P53 than the the approved approved 3 . However, IRT1 and symbol TP5 symbol TP53. However, it it shows shows opposite opposite behavior behavior with with S SIRT1 and its its alias. alias. This This demonstrates demonstrates that that even even with with the the best best understood understood and and most most commonly commonly accepted accepted characteristic characteristic of of a a gene-its gene--its identification-alternate identification--ahernate identifier identifier values values need need to to be be used used when when querying querying multiple multiple data data sources sources to to get get complete complete and and consistent consistent results. results. This This problem problem requires requires significant significant domain domain expertise expertise to to resolve resolve but but is is critical critical to to the the task task of of obtaining obtaining a a successful, successful, integrated integrated biological biological information information system. system. The problem problem of of gene gene identity identity is is made made more more complex complex when when the the full full name, name, The alternative alternative titles, titles, and and description description of of a a gene gene are are considered. considered. Depending Depending on on the the number of describe the number of data data sources sources that that describe the gene, gene, it it may may have have that that many many equivalent equivalent source IRT1 is source identifiers. identifiers. For For example, example, S SIRT1 is equivalent equivalent to to the the full full name name (from (from HUGO) HUGO) of sirtuin sirtuin (silent (silent mating mating type type information information regulation regulation 2, S. cerevisiae, homolog) homolog) 1 1.. of This This is is also also its its description description in in LocusLink, LocusLink, but but it it has has the the following following alternative alternative title title in in OMIM:SIR2, 1. These OMIM:SIR2, S. S. CEREVISIAE, CEREVISIAE, HOMOLOG-LIKE HOMOLOG-LIKE 1. These varying varying qualifications qualifications can can often often be be easily easily discerned discerned by by humans, humans, but but they they prove prove to to be be very very difficult difficult when when automated. Here, automated. Here, too, too, extensive extensive domain domain expertise expertise is is needed needed to to determine determine that that IRTl. Although these these descriptions descriptions each each represent represent the the very very same same gene, gene, S SIRT1. Although this this is is difficult, difficult, it it does does not not describe describe the the entire entire problem. problem. There There are are as as many many identifiers identifiers I RT1 as I RT1 to to S SIRT1 as there there are are data data sources sources describing describing the the gene. gene. For For example, example, S SIRT1 also corresponds to 6 0 4 4 7 9 (OMIM number), AF0 8 3 1 0 6 (GenBank accession also corresponds to 604479 (OMIM number), AF083106 (GenBank accession number), 9 5 6 5 2 4 (GDB number), and and 9 9956524 (GDB ID). ID). Even Even UniGene UniGene clusters clusters may may have have corresponding corresponding aliases. For 8 4 6 (the aliases. For example, example, Hs Hs.. 1 1846 (the UniGene UniGene cluster cluster for for P53 P53)) [33] [33] is is an an alias alias for for Hs 03 39 H s.. 1 10 99 97 7 (primary (primary cluster cluster for for TP53 TP5 3 ) ).. Existing traditional complex issues Existing traditional approaches approaches do do not not address address the the complex issues of of scientific scientific object solutions to object identity. identity. However, However, recent recent work work on on ontologies ontologies may may provide provide solutions to these these Issues. issues.

4.2.5 4.2.5

Co ncepts a n d Onto l og ies Concepts and Ontologies


An An ontology ontology is is a a collection collection of of vocabulary vocabulary words words that that define define a a community's community's under understanding concepts, which reside in standing of of a a domain. domain. Terms are are labels labels for for concepts, which reside in a a lattice lattice of of relationships relationships between between concepts. concepts. There There have have been been significant significant contributions contributions to to the the specification of standards and ontologies for specification of standards and ontologies for the the life life science science community community as as detailed detailed in in Chapter Chapter 2. 2. In In addition, addition, some some BIS BIS were were designed designed to to provide provide users' users' access access to to data data as [34] based as close close as as possible possible to to their their understanding. understanding. The The system system [34] based on on the the Object Object Protocol Model (OPM), Berkeley Laboratory Protocol Model (OPM), developed developed at at Lawrence Lawrence Berkeley Laboratory and and later later

86

86

4 4

Issues Issues to to Address Address While While

."" TO I"" nformation .... Designing a Biological IInformation System

extended extended and and maintained maintained at at Gene Gene Logic, Logic, provides provides data data organization organization through through classes classes and and relationships relationships to to the the user user (see (see Chapter Chapter 10). 10). The The most most successful successful such such approach approach is is TAMBIS TAMBIS (see (see Chapter Chapter 7), 7), which which was was developed developed to to allow allow users users to to ac access cess and and query query their their data data sets sets through through an an ontology. ontology. Such Such approaches approaches are are friendly friendly because because they they allow allow life life scientists scientists to to visualize visualize the the data data sets sets through through their their under understanding standing of of the the overall overall organization organization (concepts (concepts and and relationships) relationships) as as opposed opposed to to an an arbitrary arbitrary and and often often complex complex database database representation representation with with tables tables or or a a long long list list of of tags of of a a flat flat file. file. tags A A solution solution to to the the problem problem of of capturing capturing equivalent equivalent representations representations of of objects objects consists g e n e can can have have a a primary primary consists in in using using concepts. concepts. For For example, example, a a concept concept gene identity identity (its (its approved approved HUGO HUGO symbol) symbol) and and equivalent equivalent representations representations (withdrawn (withdrawn HUGO HUGO names, names, aliases, aliases, etc.). etc.). There There are are many many ways ways to to construct construct these these equivalent equivalent classes. One classes. One way way consists consists in in collecting collecting these these multiple multiple identities identities and and materializing materializing them them within within a a new new data data source. source. This This approach approach was was partially partially completed completed in in LENS, LENS, which which was was developed developed at at the the University University of of Pennsylvania, Pennsylvania, and and GeneCards. GeneCards. This This first first approach approach captures captures the the expertise expertise of of life life scientists, scientists, and and these these data data sources sources are are usually well usually well curated. curated. To To make make this this task task scalable scalable to to all all the the scientific scientific objects objects of of interest, interest, specific specific tools tools need need to to be be developed developed to to enhance enhance and and assist assist scientists scientists in in the the task managing the could be task of of managing the identity identity of of scientific scientific objects. objects. While While this this expertise expertise could be materialized within materialized within a a new new data data source, source, it it is is critical critical that that it it is is used used by by a a BIS BIS to to alert alert the the biologist biologist when when it it recognizes recognizes alternate alternate identifiers identifiers that that could could lead lead to to incomplete incomplete or inconsistent results. [35, 36] or inconsistent results. This This approach approach could could use use entity entity matching matching tools tools [35, 36] that that capture similarities in in retrieved retrieved objects objects and many of capture similarities and are are appropriate appropriate for for matching matching many of the functional attributes such as description or alternate titles. the functional attributes such as description or alternate titles. Recent work work in in the the context of the the Semantic Web activity of the the W3C W3C may may Recent context of Semantic Web activity of develop more technology to provide users users a layer to to develop more advanced advanced technology to provide a sound sound ontology ontology layer integrate their underlying underlying biological biological resources. resources. However, these approaches not integrate their However, these approaches do do not yet provide provide a solution to to the the problem problem of of capturing representations of of yet a solution capturing equivalent equivalent representations scientific objects, objects, as as presented presented in in Section 4.2.4. scientific Section 4.2.4.

4.3 4.3

4$

B I O LO G I CAL QUERIES QU E R I ES BIOLOGICAL

The design design of of a a BIS BIS strongly strongly depends depends on on how how it it is is going going to to be be used. used. Section Section 1.4 1 .4 The of Chapter Chapter 1 1 presents presents the the successive successive design design steps, steps, and and Chapter Chapter 3 3 illustrates illustrates varivari of ous design design requirements with use use cases. cases. Traditional Traditional database database approaches assume ous requirements with approaches assume that the the relational relational algebra, algebra, or or query query languages languages such such as as the the Structured Structured Query Query LanLan that guage (SQL) (SQL) or or the the Object Object Query Query Language Language (OQL), (OQL), enable enable users users to to express express all all guage their queries. Life Life science science shows otherwise. Similar to geographical geographical information information their queries. shows otherwise. Similar to systems that that aim aim to to let let users users express express complex complex geometric, geometric, topological, topological, or or algebraic algebraic systems

Queries ........................................................................................................................................................................................................................... 4.3 4.3o~o~Biologicao! Queries 87

87

queries, BIS BIS should should enable enable scientists scientists to to express express a a variety variety of of queries queries that that go go beyond beyond queries, the relational algebra. the relational algebra. The The functionalities functionalities required required by by scientists scientists include include sophisti sophisticated . 1 ) and cated search search mechanisms mechanisms (see (see Section Section 4.3 4.3.1) and navigation navigation (see (see Section Section 4.3.2), 4.3.2), in in addition to to standard standard data data manipulation. manipulation. In In traditional traditional data databases, the semantics semantics of of addition bases, the queries queries are are usually usually bi-valued: bi-valued: true true or or false. false. Practice Practice shows shows that that scientists scientists wish wish to to access access their their data data sets sets through through different different semantic semantic layers layers and and would would benefit benefit from from the the use logical methods use of of probabilistic probabilistic or or other other logical methods to to evaluate evaluate their their queries, queries, as as explained explained in Section Section 4.3.3. 4.3.3. Finally, Finally, the the complexity complexity of of scientific scientific use use cases cases and and the the applications applications in that that support support them them may may drive drive the the design design of of a a BIS BIS to to middleware middleware as as opposed opposed to to a a traditional traditional data-driven data-driven database database approach. approach.

4.3. 1 4.3.1

Searc hing a nd M i ning Searching and Mining


Searching consists in Searching typically typically consists in retrieving retrieving entries entries similar similar to to a a given given string string of of char characters keyword, wildcard, sequence, etc.). acters (phrase, (phrase, keyword, wildcard, DNA DNA sequence, etc.). In In most most cases, cases, the the data data source textual documents, and when when searching the data data source, exsource contains contains textual documents, and searching the source, users users ex pect similar to phrase or keyword. Search pect to to retrieve retrieve documents documents that that are are similar to a a given given phrase or keyword. Search used to engines, such such as as Glimpse Glimpse 6 6 used to provide provide search search capabilities capabilities to to GeneCards, GeneCards, use use engines, an index index to to retrieve retrieve documents documents containing containing the the keywords keywords and and a a ranking ranking system system an to to display display ordered ordered retrieved retrieved entries. entries. Searching Searching against against a a sequence sequence data data source source is performed is performed by by honed honed sequence sequence similarity similarity search search engines engines such such as as FASTA FASTA [37], [37], BLAST [38], and LASSAP LASSAP [39]. To search search sequences, sequences, the the input input string string is is a a se seBLAST [38], and [39]. To quence, quence, and and the the ranking ranking of of retrieved retrieved sequences sequences is is customized customized with with a a variety variety of of parameters. parameters. Data various Data mining mining aims aims to to capture capture patterns patterns out out of of large large data data sets sets with with various statistical mining can statistical algorithms. algorithms. Data Data mining can be be used used to to discover discover new new knowledge knowledge about about a a data data set set or or to to validate validate an an hypothesis. hypothesis. Mining Mining algorithms algorithms are are often often combined combined with with association rules, neural neural networks, association rules, networks, or or genetic genetic algorithms. algorithms. Unlike Unlike searching, searching, the the data mining approach approach is by a expressed through data mining is not not driven driven by a user's user's input input expressed through a a phrase phrase nor does it apply apply to to a a particular particular data nor does it data format. format. Mining Mining a a database database distinguishes distinguishes itself a database database by by the itself from from querying querying a the fact fact that that a a database database query query is is expressed expressed in in a language language such as SQL SQL and and therefore in a such as therefore only only captures captures information information organized organized in the In contrast, contrast, a mining tool tool may the schema. schema. In a mining may exploit exploit information information contained contained in in the the database that was was not not organized in the database that organized in the schema schema and and therefore therefore not not accessible accessible by by a a traditional database query. traditional database query. Most of the the query capabilities expected expected by by scientists Most of query capabilities scientists fall fall under under searching searching and and mining. confirmed by results of of a a survey survey of of biologists biologists in in academia academia mining. This This was was confirmed by the the results
engine was developed developed and is maintained 6. The Glimpse Glimpse search search engine maintained at the University University of Arizona Arizona and is available available at http://glimpse.cs.arizona.edu/. http.//glimpse.cs.arizona.edu/.

88

88

4 4

Issues ress While igning a ical IInformation nformation System Issues to to Add Address While Des Designing a Biolog Biological System
"::h <,":'7":':',,;;;';; ;:" < ; ' :':: ='::::= :;; ':;,:::;,'==

and 1 5 tasks and industry industry in in 2000 2000 [40], [40], where where 3 315 tasks and and queries queries were were collected collected from from the the answers to the following questions:
1 . What 1. What tasks tasks do do you you most most perform? perform?
2. What What tasks tasks do do you you commonly commonly perform, perform, that that should should be be easy, easy, but but you you feel feel are are

too difficult?
3. 3. What What questions questions do do you you commonly commonly ask ask of of information information sources sources and and analysis analysis

tools?

4. What What questions questions would would you you like like to to be be able able to to ask, ask, given given that that appropriate appropriate

sources sources and and tools tools existed, existed, that that may may not not currently currently exist? exist? Interestingly, 54% of Interestingly, 54% of the the collected collected tasks tasks could could be be organized organized into into three three categories: categories: (2) multiple 3) (1) similarity search, search, (2) multiple pattern pattern and and functional functional motif motif search, search, and and ((3) ( 1 ) similarity sequence f the f interest sequence retrieval. retrieval. Therefore, Therefore, more more than than half half o of the identified identified queries queries o of interest to biologists involve to biologists involve searching searching or or mining mining capabilities. capabilities. Traditional database systems provide provide SQL as a query language, based on Traditional the relational algebra composed of ), projection n ) , Cartesian prod the relational algebra composed of selection selection (a (or), projection ((Jr), Cartesian product x ), join join (J><l), uct ((x), (>~), union union ((u), and intersection intersection (n). (n). These These operators operators perform perform data data U ) , and manipulation ]. manipulation and and provide provide semantics semantics equivalent equivalent to to that that of of first first order order logic logic [41 [41]. I n addition, In addition, SQL SQL includes includes all all arithmetic arithmetic operations, operations, predicates predicates for for comparison comparison and and existential existential quantifiers, summary operations operations and string string matching, matching, universal universal and quantifiers, summary for maxlmin count/sum, and GROUP and HAVING clauses to to partition partition tata for max/min or or c o u n t / s u m , and GROUP BY and HAVING clauses bles by Commercial database database systems the query query capabilities capabilities bles by groups groups [42]. [42]. Commercial systems extended extended the to a a variety variety of of functionalities, functionalities, such such as complex datatypes as to as manipulation manipulation of of complex datatypes (such (such as numeric, string, string, date, time, and and interval), interval), OLAP, limited navigation. navigation. However, However, numeric, date, time, OLAP, and and limited none of of these these capabilities capabilities can can perform perform the the complex tasks specified by biologists biologists in in none complex tasks specified by the 2000 2000 survey [40]. Other approaches approaches provide provide search search capabilities only, and and while while failing failing to support Other capabilities only, to support standard data data manipulation, manipulation, they they are are useful useful for handling large large data sets. They They standard for handling data sets. are made made available available to to life life scientists scientists as as Web Web interfaces interfaces that that provide provide textual textual search search are facilities such such as as GeneCards GeneCards [9], [9], which which uses uses the the powerful powerful Glimpse Glimpse textual textual search search facilities engine [43], [43] , the the Sequence Sequence Retrieval Retrieval Service Service (SRS) (SRS) [44], [44], and and the the Entrez interface engine Entrez interface [45] . GeneCards GeneCards provides provides textual textual search search facilities facilities for for curated curated data data warehoused warehoused in in [45]. files. SRS, SRS, described in detail detail in in Chapter 5, integrates integrates data data sources sources by by indexing files. described in Chapter 5, indexing attributes. It enables enables queries queries composed composed of of combinations combinations of of textual on attributes. It textual keywords keywords on most attributes attributes available available at at integrated integrated databases. databases. Entrez Entrez proposes proposes an an interesting interesting most approach to to integrating integrating resources resources through through their their similarities. similarities. It It uses uses a a variety variety of of approach similarity search search tools tools to to index index the the data data sources sources and and facilitate facilitate their their access access through through similarity

4.3 4.3 B i o l o g i c a l , , , Q u eQueries ries ........................................................................................................................................................................................................................... 89

89

search search queries. queries. For For instance, instance, the the neighbors neighbors of of a a sequence sequence are are its its homologs, homologs, as as identified ] . On identified by by a a similarity similarity score score using using the the BLAST BLAST algorithm algorithm [38 [38]. On the the other other hand, hand, the the neighbors neighbors of of a a PubMed PubMed citation citation are are the the articles articles that that use use similar similar terms terms in in their their title title and and abstract abstract [46]. [46]. These These approaches approaches are are limited limited because because they they do do not not allow allow customized customized access access to to the the sources. sources. A A user user looking looking for for PubMed PubMed references references that that have have direct direct protein protein links links will will not not be be able able to to express express the the query query through through Entrez Entrez because because the the interface interface is is designed designed to to retrieve retrieve the the protein protein linked linked from from a a given given citation, citation, not not to to retrieve all citations retrieve all citations linked linked to to a a protein. protein. This This example, example, as as well well as as others others collected collected in in a 998 Access Accessarticle article by by L. L. Wong Wong [47], [47], illustrates illustrates the the weakness weakness of of these these approaches. approaches. a1 1998 A A real real query query language language allows allows customization, customization, whereas whereas a a selection selection of of capabilities capabilities limits limits significantly significantly the the range range of of queries queries biologists biologists are are able able to to ask. ask. Biologists Biologists involve involve in in their their queries queries a a variety variety of of search search and and mining mining tools tools and and employ employ traditional traditional data data manipulation manipulation operators operators to to support support their their queries. queries. In In fact, fact, they often wish to combine them all within within a they often wish to combine them all a single single query. query. For For example, example, a a typical typical query could start query could start with with searching searching PubMed PubMed and and only only retrieve retrieve the the references references that that have have direct protein links Most systems direct protein links [47] [47].. Most systems do do not not support support this this variety variety of of functionalities functionalities yet. Kleisli and 1, yet. Systems Systems such such as as Kleisli and DiscoveryLink, DiscoveryLink, presented presented in in Chapters Chapters 6 6 and and 1 11, specifically this issue. specifically address address this issue.

4.3.2 4.3.2

B rows i n g Browsing
Biologists . 1 , but Biologists aim aim to to perform perform complex complex queries queries as as described described in in Section Section 4.3 4.3.1, but they they also also need need to to browse browse and and navigate navigate the the data data sets. sets. Systems Systems such such as as OPM OPM and and TAMBIS TAMBIS (see (see Chapter Chapter 7) 7) are are designed designed to to provide provide a a user-friendly user-friendly interface interface that that allows allows queries queries through classes, as through ontologies ontologies or or object object classes, as presented presented in in Section Section 4.2.5. 4.2.5. But But they they do do not provide navigational capabilities that enable access to other scientific ob not provide navigational capabilities that enable access to other scientific objects jects through through a a variety variety of of hyperlinks. hyperlinks. Web Web interfaces interfaces such such as as GeneCards GeneCards offer offer a that enable other re a large large variety variety of of hyperlinks hyperlinks that enable users users to to navigate navigate directly directly to to other resources such as sources such as GenBank, GenBank, PubMed, PubMed, and and European European Molecular Molecular Biology Biology Laboratory Laboratory (EMBL) PubMed, GenBank, GenBank, and (EMBL) [34, [34, 48, 48, 76] 76].. Entrez, Entrez, the the Web Web interface interface to to PubMed, and an an increasing increasing number number of of resources resources hosted hosted at at the the National National Center Center for for Biotechnol Biotechnology ogy Information Information (NCBI) (NCBI) offer offer the the most most sophisticated sophisticated navigational navigational capabilities. capabilities. All 15 available resources (as of July 2002) are linked together. For All 15 available resources (as of July 2002) are linked together. For example, example, a a citation citation in in PubMed PubMed is is linked linked to to related related citations citations in in PubMed PubMed via via the the Related Related Art ls es, A r t ii cc le , linked linked to to relevant relevant sequences sequences or or proteins, proteins, respectively, respectively, via via the the Nuc ie de Link e links are completed by N u c ll eeot otid L i n k or or the the Prot Prote iin n Link. Link. The The links are completed by a a va variety available in of hyperlinks hyperlinks available in the the display display of of retrieved retrieved entries. entries. These These navigational navigational riety of capabilities complete capabilities complete the the query query capabilities capabilities and and assist assist the the biologists biologists in in fulfilling fulfilling their their needs. needs.

90

==::c;,==',;c::" ,:?"",:,:>""""" "''''''':'':': ",:' ' ,::,:':::::;::"'::;''':' : ,:;; :''''C':::=;::, ' : , ' :::::: :=:::::> ':;", ::"::,::,':::;:'::: :':;;:::::; ::'_

Issues to Address i l e Des i g n i n g a Biolog ical IInformation nformation System Address Wh While Designing Biological

The f XML The recent recent development development o of XML and and its its navigational navigational capabilities capabilities make make XPath XPath [49], [49], the the language language designed designed to to handle handle navigational navigational queries, queries, and and XQuery XQuery [50], [50], its its extension traditional data extension to to traditional data management management queries queries as as well well as as to to document document queries, queries, good good candidates candidates for for query query languages languages to to manipulate manipulate scientific scientific data. data. There There are are addi additional tional motivations motivations for for choosing choosing XML XML technology technology to to handle handle scientific scientific data. data. XML XML is is designed designed as as the the standard standard for for data data exchange exchange on on the the Web, Web, and and life life scientists scientists publish publish and and collect collect large large amounts amounts of of data data on on the the Web. Web. In In addition, addition, the the need need for for a a flexible flexible data data representation representation already already evoked evoked the the choice choice of of XML XML in in Section Section 4.2.3 4.2.3.. Scientific Scientific data NCBI already already offer data providers providers such such as as NCBI offer data data in in XML XML format. format. Although clearly needed, navigational foreground Although clearly needed, the the development development of of a a navigational foreground for for biological biological data data raises raises complex complex issues issues of of semantics, semantics, as as will will be be presented presented in in Sec Section tion 4.5.2. 4.5.2.

4.3.3 4.3.3

S e m a ntics of Queries Semantics


Traditional r false. Traditional database database approaches approaches use use bi-valued bi-valued semantics: semantics: true true o or false. When When a a query is output is Such semantics query is evaluated, evaluated, should should any any data data be be missing, missing, the the output is NULL. NULL. Such semantics are biological tasks. tasks. Indeed, are not not appropriate appropriate for for many many biological Indeed, biologists biologists often often attempt attempt to to collect missing information. collect data data with with exploring exploring queries, queries, despite despite missing information. An An attribute attribute NULL NULL does does not not always always mean mean that that the the value value is is null null but but rather rather that that the the information information is is not not available bases available yet yet or or is is available available elsewhere. elsewhere. The The rigid rigid semantics semantics of of traditional traditional data databases may may be be frustrating frustrating for for biologists biologists who who aim aim to to express express queries queries with with different different layers layers of of semantics. semantics. Knowledge-based 1 ] may Knowledge-based approaches approaches [5 [51] may be be used used to to provide provide more more flexibility. flexibility. In about the the possible possible courses courses of of action action replaces replaces the the In knowledge knowledge bases, bases, reasoning about typical rely on typical database database evaluation evaluation of of a a query. query. Knowledge Knowledge bases bases rely on large large amounts amounts of rules, and of expertise expertise expressed expressed through through statements, statements, rules, and their their associated associated semantics. semantics. Extending Extending BIS BIS with with knowledge-based knowledge-based reasoning reasoning provides provides users users with with customized customized semantics queries. BIS semantics of of queries. BIS can can be be enhanced enhanced by by the the use use of of temporal temporal logic logic that that assumes assumes the world to be ordered by time intervals and allows users to reason about the world to be ordered by time intervals and allows users to reason about time time (e.g., Retrieve all " ) or "Retrieve all symptoms symptoms that that occurred occurred before before event event A A") or fuzzy fuzzy logic logic that that (e.g., " allows allows degrees degrees of of truth truth to to be be attached attached to to statements. statements. Therefore, Therefore, a a solution solution con consists sists in in providing providing users users with with a a hybrid hybrid query query language language that that allows allows them them to to express express various build a various dependency dependency information, information, or or lack lack thereof, thereof, between between events events and and build a log logical ical reasoning reasoning framework framework on on top top of of such such statements statements of of probability. probability. Such Such an an ap approach databases [52] databases [53]. proach has has been been evaluated evaluated for for temporal temporal databases [52] and and object object databases [53]. BIS BIS also also could could benefit benefit from from approaches approaches that that would would cover cover the the need need for for addressing addressing object identity in Section object identity as as presented presented in Section 4.2.4, 4.2.4, as as well well as as semantic semantic issues issues such such as as those be addressed those to to be addressed in in Sections Sections 4.4.2 4.4.2 and and 4.5.2. 4.5.2.

4.3 4.3

Queries BiologicalQue~ries ...........................................................................

91

91

4.3.4 4.3.4

Tool-Drive n vs. Data-Drive n IIntegration nteg ration Tool-Driven Data-Driven


Most Most existing existing BIS BIS are are data-driven: data-driven: They They focus focus on on the the access access and and manipulation manipulation of of data. data. But But should should a a BIS BIS really really be be data-driven? data-driven? It It is is not not that that clear. clear. A A traditional traditional database database system system does does not not provide provide any any flexibility flexibility in in the the use use of of additional additional function functionalities. language is change. Public alities. The The query query language is fixed fixed and and does does not not change. Public or or commercial commercial platforms aim aim to to offer offer integrated integrated software; software; however, however, their their approach approach does does not not platforms provide provide the the ability ability to to integrate integrate easily easily and and freely freely new new softwares softwares as as they they become become available improve. Commercial Commercial integrated also are available or or improve. integrated platforms platforms also are expensive expensive to to use. use. For scientists scientists with with limited limited budgets, budgets, free software is is often often the the only only solution. solution. For free software Some to use use external Some systems systems provide provide APIs APIs to external programs, programs, but but the the system system is is no no longer longer the the central central query query processing processing system; system; it it only only processes processes SQL SQL queries, queries, and and an an external whole request. external program program executes executes the the whole request. The The problem problem with with this this approach approach is is that that the the system, system, which which uses uses a a database database system system as as a a component, component, no no longer longer bene benefits fits from from the the database database technology, technology, including including efficient efficient query query processing processing (as (as will will be be presented presented in in Section Section 4.4). 4.4). Distributed technology has Distributed object object technology has been been developed developed to to cope cope with with the the hetero heterogeneous distributed computing geneous and and distributed computing environment environment that that often often forces forces information information to to be moved moved from from one machine to to another, another, disks disks to to be be cross-mounted cross-mounted so so different different be one machine programs can be systems, and re-written in programs can be run run on on multiple multiple systems, and programs programs to to be be re-written in a a differ different programming be compiled and executed ent programming language language to to be compiled and executed on on another another architecture. architecture. The variety variety of of scientific scientific technology technology presented presented in in Section Section 4.1.2 4.1.2 often often generates generates The significant resources. Distributed significant waste waste of of time time and and resources. Distributed object object technology technology includes includes Common Common Object Object Request Request Broker Broker Architecture Architecture (CORBA), (CORBA), Microsoft Microsoft Distributed Distributed Component Model (DCOM), Component Object Object Model (DCOM), and and Java Java Remote Remote Method Method Invocation Invocation (RMI). (RMI). This This technology technology is is tools-driven tools-driven and and favors favors a a computational computational architecture architecture that that in interoperates Unlike traditional bases, it teroperates efficiently efficiently and and robustly. robustly. Unlike traditional data databases, it allows allows flexible flexible access access to to computational computational resources resources with with easy easy registration registration and and removal removal of of tools. tools. For For these these reasons, reasons, many many developers developers of of BIS BIS are are currently currently using using this this technology. technology. For sake new version no longer CPL but but provides For sake of of efficiency, efficiency, a a new version of of TAMBIS TAMBIS no longer uses uses CPL provides a CORBA clients a user-friendly user-friendly ontology ontology of of biological biological data data sources sources using using CORBA clients to to re retrieve sources [54] (see Chapter appears to trieve information information from from these these sources [54] (see Chapter 5). 5). CORBA CORBA appears to be be suitable for suitable for creating creating wrappers wrappers via via client client code code generation generation from from interface interface definition definition language language (IDL) (IDL) definitions. definitions. The The European European Bioinformatics Bioinformatics Institute Institute (EBI) (EBI) is is leading leading the the effort effort to to make make its its data data sources sources CORBA CORBA compliant compliant [48, [48, 55]. 55]. Unfortunately, Unfortunately, most data Concurrent to most data providers providers do do not not agree agree with with this this effort. effort. Concurrent to the the CORBA CORBA ef effort, American institute, provide their sources in fort, NCBI, NCBI, the the American institute, and and EBI EBI provide their data data sources in XML XML format. format. A has been been given A lot lot of of interest interest has given to to grid grid architectures. architectures. A A grid grid architecture architecture aims aims to to enable well as data resources and accessed enable computing computing as as well as data resources to to be be delivered delivered and accessed seamlessly, seamlessly,

92

92

4 4

Issues Issues to to Address Address While While

m nformation S\l.t"" Designing a Biological IInformation System

transparently, and dynamically, Internet. The name grid was transparently, and dynamically, when when needed, needed, on on the the Internet. The name grid was inspired by by the the electricity power grid. A biologist biologist should should be be able to plug into the the inspired electricity power grid. A able to plug into grid like like an an appliance appliance is is plugged plugged into into an an outlet outlet and and use use resources available on on the the grid resources available grid approach to generation of BIS. Examples grid transparently. transparently. The The grid grid is is an an approach to a a new new generation of BIS. Examples of of grids grids include include the the Open Open Grid Grid Services Services Architecture Architecture (OGSA) (OGSA) [56] [56] and and the the open open source problem-solving environment source problem-solving environment Cactus Cactus [57] [57].. TeraGrid TeraGrid [58] [58] and and DataGrid DataGrid [59] [59] are are international international efforts efforts to to build build grids. grids. These These tool-driven tool-driven proposals proposals do do not not yet yet solve solve the the many many problems problems of of resource resource selection, selection, query query planning, planning, optimization, optimization, and and other other semantics semantics issues issues as as will will be be pre presented 4.4 and sented in in Sections Sections 4.4 and 4.5. 4.5;.

4 4.4 .4

QUERY PROCESSING QU E RY PROCESS ING


In past, query query processing less attention In the the past, processing often often received received less attention from from designers designers of of BIS. BIS. Indeed, Indeed, BIS BIS developers developers devoted devoted most most of of their their effort effort to to meeting meeting the the needs needs for for inte integration applications and gration of of data data sets sets and and applications and providing providing a a user-friendly user-friendly interface interface for for sci scientists. entists. However, However, as as the the data data sets sets get get larger, larger, the the applications applications more more time-consuming, time-consuming, and more complex, and the the queries queries more complex, the the specification specification for for fast fast query query processing processing becomes becomes critical. critical.

4.4. 1 4.4.1

Biological u rces Biological Reso Resources


A A BIS BIS must must adequately adequately capture capture and and exploit exploit the the diverse, diverse, and and often often complex, complex, query query processing, processing, or or other other computational computational capabilities capabilities of of biological biological resources resources by by spec specifying ifying them them in in a a catalog catalog and and using using them them at at both both query query formulation formulation and and query query evaluation. The W3C Semantic 1 ] aims evaluation. The W3C Semantic Web Web Activity Activity [60, [60, 6 61] aims to to provide provide a a meta-data meta-data layer to layer to permit permit people people and and applications applications to to share share data data on on the the Web. Web. Recent Recent efforts efforts within the bioinformatics bioinformatics community use of OIL [62] within the community address address the the use of OIL [62] to to capture capture alter alternative to extend Such efforts native representations representations of of data data to extend biomolecular biomolecular ontologies ontologies [63]. [63]. Such efforts focusing on data representation must be focusing on data representation of of the the contents contents of of the the sources sources must be extended extended to meta-data along several dimensions, 1 ) the to capture capture meta-data along several dimensions, including including ((1) the coverage coverage of of the the information statistical patterns, information sources, sources, (2) (2) the the capabilities, capabilities, links, links, and and statistical patterns, (3) (3) the the data data delivery the resources, (4) data delivery patterns patterns of of the resources, and and (4) data representation representation and and organization organization at at the source. the source. The coverage of of information information sources sources is is useful useful in in solving solving the the so-called so-called source source The coverage relevance problem, relevance problem, which which involves involves deciding deciding which which of of the the myriad myriad sources sources are are rel relevant evant for for the the user user and and to to evaluate evaluate the the submitted submitted query. query. Directions Directions to to characterize characterize and exploit coverage include local, closed world and exploit coverage of of information information sources sources include local, closed world assump assumptions, tions, which which state state that that the the source source is is complete complete for for a a specific specific part part of of the the database database

4.4 Quer.yProce..~..ssing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4


~ . . , ~ , ~ , ~ , ~ , ~ , , , ~ / ~ , ~ ~ . . . ~ ~ , : ~ , ~ . . ~ . ~ , ~ . ~ , ~ , ~ ~ , ~ , ~ , ~ . ~ . ~ , ~ ~ , ~ ~ ~ : ~ ~ ~ . 9 ... . . . . , ~

93 93

[64, 65]; quantifications [64, 65]; quantifications of of coverage coverage (e.g., (e.g., the the database database contains contains at at least least 90% 90% of of the the sequences), sequences), or or intersource intersource overlaps overlaps (e.g., (e.g., the the EMBL EMBL Nucleotide Nucleotide Sequence Sequence Database has % likely overlap with Database has a a 75 75% likely overlap with DDBJ DDBJ for for sequences sequences annotated annotated with with "cal "calcium cium channel" channel")) [66-68]. [66-68]. Characterizing Characterizing coverage coverage enables enables the the exploitation exploitation of of cov coverage positioning of erage positioning of data data sources sources from from complement complement to to partial partial or or complete complete overlap overlap (mirror sites). Source Source capabilities capture capture the the types types of of queries queries supported supported by by the the sources, sources, the the access limitations, the to handle on. AlAl access pattern pattern limitations, the ability ability to handle limited limited disjunction, disjunction, and and so so on. though previous previous research research has has addressed addressed capabilities though capabilities [69-74], [69-74], it it has has not not addressed addressed the the diverse diverse and and complex complex capabilities capabilities of of biological biological sources. sources. Recent Recent work work aims aims to to identify capabilities including including text identify the the properties properties of of sophisticated sophisticated source source capabilities text search search engines, similarity similarity sequence sequence search search engines engines such such as as BLAST, BLAST, and and multiple multiple sequence sequence engines, alignment tools such alignment tools such as as Cluster Cluster [75, [75, 76]. 76]. Their Their characteristics characteristics are are significantly significantly more complex complex than capabilities addressed more than the the capabilities addressed up up to to now, now, and and their their use use is is dramat dramatically time. In ically costly costly in in terms terms of of processing processing time. In addition, addition, many many of of the the tools tools are are closely closely coupled to the the underlying underlying source, coupled to source, which which requires requires the the simultaneous simultaneous identification identification of of capabilities capabilities and and coverage. coverage. Statistical patterns include include the the description description of of information information clusters clusters and and the the selectivities of selectivities of all all or or some some of of the the data data access access mechanisms mechanisms and and capabilities. capabilities. The The use simple statistical statistical patterns studied [65, use of of simple patterns has has been been studied [65, 66, 66, 77] 77].. However, However, BIS BIS should exploit should exploit statistical statistical patterns patterns of of real real data data sources sources that that are are large, large, complex, complex, and and constantly (as opposed constantly evolving evolving (as opposed to to their their simplified simplified simulations). simulations). Delivery patterns include include the the response response time, time, that that is, is, units units of of time time needed needed to to receive Delivery receive the the first first block block of of answers, answers, the the size size of of these these blocks blocks and and so so on. on. Delivery patterns patterns may may affect affect the the query query evaluation evaluation process process significantly. significantly. Depending Depending on on the the availability the proper indices, a availability of of the proper indices, a source source may may either either return return answers answers in in decreasing decreasing order order of of matching matching (from (from best best to to worst) worst) or or in in an an arbitrary arbitrary (unordered) (unordered) manner. manner. Other delivery information can Other delivery profiles profiles include include whether whether information can be be provided provided in in a a sorted sorted manner attributes or or not. essential in manner for for certain certain attributes not. These These types types of of profiles profiles will will be be essential in identifying identifying sources sources to to get get the the first, first, best best answers. answers. This This is is useful useful when when a a user user expects expects to to a (the more to get get the the answers answers to a query query sorted sorted in in a a pre-defined pre-defined relevant relevant order order (the more relevant answers the first Delivery patterns also be relevant answers are are the first returned). returned). Delivery patterns can can also be exploited exploited to provide users with to provide users with a a faster, faster, relevant, relevant, but but maybe maybe incomplete incomplete answer answer to to a a query. query. BIS the actual BIS should should exploit exploit delivery delivery patterns patterns in in conjunction conjunction with with the actual capabilities capabilities supported supported by by the the sources. sources. The The access access and and exploitation exploitation of of the the previously previously mentioned mentioned meta-knowledge meta-knowledge of of biological resources offers several advantages. enables the comparison of biological resources offers several advantages. First, First, it it enables the comparison of diverse diverse ways ways to to evaluate evaluate a a query query as as explained explained in in the the next next section. section. Further, Further, it it can can characterize characterize the the most most efficient efficient way way to to evaluate evaluate a a query, query, as as will will be be presented presented in in Section Section 4.4.3. 4.4.3.

94

Issues to to Address Address Wh While Designing a a Biolog Biological System Issues i l e Designing ical IInformation nformation System

4.4.2 4.4.2

Query Planning Que ry P lanning


Query planning planning consists consists in in considering considering the the many many potential potential combinations combinations of of Query accesses query evaluation evaluation plan. plan. Con Conaccesses to to evaluate evaluate a a query. query. Each Each combination combination is is a a query sider (Q) defined sider the the query query (Q) defined as as follows: follows.
(Q) " Return accession "Return accession numbers numbers and definitions of of GenBank GenBank EST EST sequences sequences that that are similar (60% identical over 50AA) 50AA) to 'Calcium "Calcium channel' channel" sequences sequences in Swiss Swiss995 and mention 'brain. Prot that have references published published since 1 1995 "brain.'" [78] [78]
'"

There There exist exist many many plans plans to to evaluate evaluate the the query. query. One One possible possible plan plan for for this this query query is 1 ) access is illustrated illustrated in in Figure Figure 4.1 4.1 and and described described as as follows: follows: ((1) access PubMed PubMed and and retrieve retrieve all these references since 1 995 that mention brain; references published published since 1995 that mention b r a i n ; (2) (2) extract extract from from all these references Swiss-Prot identifiers; identifiers; (3) corresponding sequences sequences from references the the Swiss-Prot (3) obtain obtain the the corresponding from ium channel; BLAST search Swiss-Prot Swiss-Prot whose whose function function is is calc calcium channel; and and (4) (4) execute execute a a BLAST search using similar sequences using a a wrapped wrapped BLAST BLAST application application to to retrieve retrieve similar sequences from from GenBank GenBank ( gbest sequences). ( g b e s t sequences). Figure Figure 4.2 4.2 presents presents an an alternative alternative approach approach that that first first accesses accesses Swiss-Prot Swiss-Prot and and retrieves ium retrieves sequences sequences whose whose function function is is calc calciu m channel. c h a n n e l . In In parallel, parallel, it it retrieves retrieves 995 and the citations from PubMed that the citations from PubMed that mention mention brain b r a i n and and are are published published since since 1 1995 and extracts sequences from them. Then extracts sequences from them. Then it it determines determines which which sequences sequences are are in in common. common. Finally, Finally, it it executes executes a a BLAST BLAST search search to to retrieve retrieve similar similar sequences sequences from from GenBank GenBank ((gbest gbest sequences). sequences). Scientific Scientific resources resources overlap overlap significantly. significantly. The The variety variety of of capabilities, capabilities, as as well well as as the .4 . 1 , offer the coverages coverages and and statistical statistical patterns patterns presented presented in in Section Section 4 4.4.1, offer many many alter alternative query. The native evaluation evaluation plans plans for for a a query. The number number of of evaluation evaluation plans plans is is exponen exponential similar resources. resources. Therefore plans should should be tial to to the the size size of of similar Therefore not not all all plans be evaluated evaluated to to answer query. To plan to given query, query, first first the answer a a query. To select select the the plan to evaluate evaluate a a given the semantics semantics of of the the plan may be and yet not plan should should be be captured captured accurately. accurately. Indeed, Indeed, two two plans plans may be similar similar and yet not semantically For example, semantically equivalent. equivalent. For example, suppose suppose a a user user is is interested interested in in retrieving retrieving the the sequences relevant to sequences relevant to the the article article entitled entitled "Suppression "Suppression of of Apoptosis Apoptosis in in Mammalian Mammalian Cells P and published in Cells by by NAI NAIP and a a Related Related Family Family of of lAP IAP Genes" Genes" published in Nature Nature and and refer referenced 2 1 9 1 in in PubMed. first plan enced by by 8 5 5 52191 PubMed. A A first plan is is to to extract extract the the GenBank GenBank identifiers identifiers ex explicitly LINE format format of plicitly provided provided in in the the MED MEDLINE of the the reference. reference. A A second second plan plan consists consists eot ide in in using using the the capability capability Nuc N u cl le otid e Link, L i n k , provided provided at at NCBI. NCBI. The The two two plans plans are are not not semantically semantically equivalent equivalent because because the the first first plan plan returns returns four four GenBank GenBank identifiers identifiers when 1). when the the Nucleotide Nucleotide Link Link returns returns eight eight GenBank GenBank identifiers identifiers (as (as of of August August 200 2001). Verifying f the Verifying whether whether two two plans plans are are semantically semantically equivalent, equivalent, that that is, is, iif the answers answers that plans are non-trivial and that are are returned returned from from the the two two plans are identical, identical, is is non-trivial and depends depends on on the the meta-data meta-data of of the the particular particular resources resources used used in in each each plan. plan. This This issue issue is is closely closely

4.4 4.4

Query Processin

... ~..~,,.,~__:~..:z~~~~_ .....

. . . .~~~=~.----_____~-o~=.,~---,~.

95

95

I H 1
( AccNo , Def (AccNo, Def)) f1

( OepJ~ 1
Sequence

Dep Join

Sequence

i Intema' 1~
( Ca l) Ca channe channel )

x..
~ External Extemal~
(Sequence, ](Sequence,

Internal er

/ BLAST BLAST /
|

(J

t. 60%, 6o~, 50AA) 5oaA~j

Dep Join
swi s s - Protld Swiss-ProtId

DepJ~

f1 1-I
swi s s - Pr o t l d Swiss-ProtId

I xtema'~ I 1
Swiss-Prot (Swiss-Protld) (Swiss-Protld)
External (J

Swiss-Prot

(J I External External ~1 PubMed l~bMed (brain, (brain,1995) 1995)

4 .1 4.1 F IGURE FIGURE

First plan for evaluating query (Q) [75].

related to to navigation over linked resources as will be presented in Section 4.5.2. In addition, addition, two semantically equivalent plans may differ dramatically in terms of efficiency, as is explained in the next section.

4.4.3 4.4.3

Query pti m izatio n Query O Optimization


Query optimization optimization [79, 80] is the science and the art of applying equivalence rules to rewrite the tree of operators operators evoked in a query and produce an optimal plan. A plan is optimal if it returns returns the answer in the least time or using the least space.

96

96

4 4

Issues Issuesto to Address Address While While

.,,,,, nformation S\I'cot Designing a Biological IInformation System

i
/
n

AccNo, Def)

Dep Join DepJoin


( S equenc e )

(Sequence )3

-.,.

( S equenc e )

Sequence )

Cl External o 1 BLAST BLAST (60%, (60%, 50AA, 50AA,

External

Sequence) Sequence)

( SWi S S -protld

l (Swiss-ProtId j
Hash Join HashJoin

n
( Sw i s s -Prot I d )

_ .

( Swiss-Prot Id )

( Swi s s -Pro t I d )

Swiss-Prot Id

[xtema' PubMed~ l
External
Cl

PnbMed

(brain, 1995)

(brain, 1995)

xternal Cl o ] wiss-Prot (Ca a channel) channel)


External

Swiss-Prot

4.2 4.2 F IGURE FIGURE

Second plan for evaluating query (Q) [75].

There There are are well well known known syntactic, syntactic, logical, logical, and and semantic semantic equivalence equivalence rules rules used used dur during These rules rules can can be used to select an optimal plan plan among among ing optimization optimization [79]. [79]. These be used to select an optimal semantically each plan plan and semantically equivalent equivalent plans plans by by associating associating a a cost cost with with each and selecting selecting the the lowest lowest overall cost. cost. The The cost cost associated associated with with each each plan is is generated generated using using accurate accurate metrics such such as as the the cardinality cardinality or or the the number number of of result result tuples in in the the output of of each each operator, operator, the the cost cost of of accessing accessing a a source source and and obtaining results from from that that source, source, and and so also have calculate the processing cost so on. on. One One must must also have a a cost cost formula formula that that can can calculate the processing cost for for each each implementation implementation of of each each operator. operator. The The overall overall cost cost is is typically typically defined defined as as the the total total time time needed needed to to evaluate evaluate the the query query and and obtain obtain all of of the the answers. answers.

4.4 4.4

Query Processing

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97 97

The The characterization characterization of of an an optimal, optimal, low-cost low-cost plan plan is is a a difficult difficult task. task. The The complexity complexity of of producing producing an an optimal, optimal, low-cost low-cost plan plan for for a a relational relational query query is is NP NPcomplete ] . However, complete [79-81 [79-81]. However, many many efforts efforts have have produced produced reasonable reasonable heuristics heuristics to solve solve this this problem. problem. Both Both dynamic dynamic programming programming and and randomized randomized optimization optimization to based on on simulated simulated annealing annealing provide provide good good solutions solutions [82-84] [82-84].. based A A BIS BIS could could be be improved improved significantly significantly by by exploiting exploiting the the traditional traditional database database technology technology for for optimization optimization extended extended to to capture capture the the complex complex metrics metrics presented presented in Section Section 4.4.1 4.4.1.. Many Many of of the the systems systems presented presented in in this this book book address address optimization optimization in at different different levels. levels. K2 K2 (see (see Chapter Chapter 8 8 Section Section 8 8.1) uses rewriting rewriting rules rules and and a a cost cost at . 1 ) uses model. 9) combines traditional model. PIFDM P/FDM (see (see Chapter Chapter 9)combines traditional optimization optimization strategies, strategies, such such as query query rewriting rewriting and and selection selection of of the the best best execution execution plan, plan, with with a a query-shipping query-shipping as approach. 1 ) performs approach. DiscoveryLink DiscoveryLink (see (see Chapter Chapter 1 11) performs two two types types of of optimization: optimization: query rewriting rewriting followed followed by by a a cost-based cost-based optimization optimization plan. KIND (see (see Chapter Chapter query plan. KIND 12) 12) is is addressing addressing the the use use of of domain domain knowledge knowledge into into executable executable meta-data. meta-data. The The knowledge plan with knowledge of of biological biological resources resources can can be be used used to to identify identify the the best best plan with query query (Q) illustrated in (Q) defined defined in in Section Section 4.4.2 4.4.2 as as illustrated in the the following. following. The two possible possible plans plans illustrated illustrated in The two in Figures Figures 4.1 4.1 and and 4.2 4.2 do do not not have have the the same same cost. Evaluation Evaluation costs costs depend depend on on factors factors including including the the number number of of accesses accesses to to each each cost. data data source, source, the the size size (cardinality) (cardinality) of of each each relation relation or or data data source source involved involved in in the the query, the number number of query, the of results results returned returned or or the the selectivity selectivity of of the the query, query, the the number number of of queries queries that that are are submitted submitted to to the the sources, sources, and and the the order order of of accessing accessing sources. sources. Each access access to to a a data data source retrieves many many documents documents that need to be parsed. parsed. Each source retrieves that need to be Each object object returned returned may may generate generate further further accesses to (other) (other) sources. accesses Each accesses to sources. Web Web accesses are costly costly and and should should be be as as limited limited as as possible. possible. A A plan plan that that limits the number number of of are limits the accesses is is likely likely to to have have a a lower cost. Early Early selection selection is is likely to limit number accesses lower cost. likely to limit the the number to PubMed PubMed in the plan plan illustrated in Figure 1 of accesses. accesses. For For example, example, the the call call to of in the illustrated in Figure 4. 4.1 retrieves 81,840 8 1 ,840 citations, citations, whereas whereas the the call call to to GenBank GenBank in in the the plan plan in in Figure Figure 4.2 retrieves 4.2 retrieves 1616 1 6 1 6 sequences. sequences. (Note (Note that that the the statistics and results results cited cited in this paper retrieves statistics and in this paper were gathered between April April 2001 2001 and and April April 2002 2002 and no longer up to were gathered between and may may no longer be be up to date.) If If each each of of the the retrieved retrieved documents (from PubMed PubMed or generated date.) documents (from or GenBank) GenBank) generated an additional additional access access to to the the second second source, source, clearly clearly the the second second plan plan has has the the potential potential an to to be be much much less less expensive expensive when when compared compared to to the the first first plan. plan. size of of the the data sources involved involved in in the may also also affect affect the cost The size The data sources the query query may the cost of the the evaluation evaluation plan. plan. As As of of May May 4, 4, 2001, 200 1 , Swiss-Prot Swiss-Prot contained contained 95,674 95,674 entries entries of whereas PubMed PubMed contained contained more more than than 11 1 1 million million citations; citations; these these are are the the values values whereas of of cardinality cardinality for for the the corresponding corresponding relations. relations. A A query query submitted submitted to to PubMed PubMed (as (as whereas it it used in the first plan) plan) retrieves retrieves 727,545 references that that mention mention brain, whereas used in the first 727,545 references and were were published published since since 1995. 1 995. retrieves 206,317 206,3 1 7 references references that that mention mention brain and retrieves This is is the the selectivity selectivity of of the the query. query. In In contrast, contrast, the the query query submitted submitted to Swiss-Prot This to Swiss-Prot in returns 126 annotated with in the the second second plan plan returns 126 proteins proteins annotated with calcium channel.

8 98

4 Issues to n i n g a Biolog ica l IInformation nformation System to Address Address While While Desig Designing Biological System = = ... = ~~`~`~`~~~:~`~:~~~:~:~=~=~=~`~=~=:~`~`~`~=~t`~:~`~`~:~```~c1~%~`~?1~"~``~`~`~`~`~%~d~:~``~;~%:~%~`~``~`~``~?~`~`~`~`~`~~`~~ ~:~~t~.

In addition addition to to the the previously previously mentioned characteristics of of the the resources, resources, the the In mentioned characteristics order accessing sources also af order of of accessing sources and and the the use use of of different different capabilities capabilities of of sources sources also affects total cost cost of Pub Med and and extracts fects the the total of the the plan. plan. The The first first plan plan accesses accesses PubMed extracts values values for for identifiers identifiers of of records records in in Swiss-Prot Swiss-Prot from from the the results. results. It It then then passes passes these these values values to join operator. to the the query query on on Swiss-Prot Swiss-Prot via via the the join operator. To To pass pass each each value, value, the the plan plan may may have have to to send send multiple multiple calls calls to to the the Swiss-Prot Swiss-Prot source, source, one one for for each each value, value, and and this this can be be expensive. expensive. However, However, by by passing passing these these values values of of identifiers identifiers to to Swiss-Prot, Swiss-Prot, the the can Swiss-Prot Swiss-Prot source source has has the the potential potential to to constrain constrain the the query, query, and and this this could could reduce reduce the from Swiss-Prot. second the number number of of results results returned returned from Swiss-Prot. On On the the other other hand, hand, the the second plan parallel to plan submits submits queries queries in in parallel to both both PubMed PubMed and and Swiss-Prot. Swiss-Prot. It It does does not not pass pass values identifiers of values of of identifiers of Swiss-Prot Swiss-Prot records records to to Swiss-Prot; Swiss-Prot; consequently, consequently, more more results results may be be returned from Swiss-Prot. Swiss-Prot. The The results results from from both both PubMed and Swiss-Prot Swiss-Prot may returned from PubMed and have locally, and have to to be be processed processed (joined) (joined) locally, and this this could could be be computationally computationally expen expensive. plan, 206,3 1 7 PubMed sive. Recall Recall that that for for this this plan, 206,317 PubMed references references and and 126 126 proteins proteins from from Swiss-Prot that a Swiss-Prot are are processed processed locally. locally. However, However, the the advantage advantage is is that a single single query query has has been submitted submitted to to Swiss-Prot Swiss-Prot in the second second plan. plan. Also, Also, both both sources sources are are accessed accessed been in the in parallel. in parallel. Although Although it it has has not not been been described described previously, previously, there there is is a a third third plan plan that that should should be be considered considered for for this this query. query. This This plan plan would would first first retrieve retrieve those those proteins proteins annotated annotated LINE identifiers with with calcium calcium channel channel from from Swiss-Prot Swiss-Prot and and extract extract MED MEDLINE identifiers from from these these records. pass these records. It It would would then then pass these identifiers identifiers to to PubMed PubMed and and restrict restrict the the results results to to brain. In this particular case, this third plan has the those matching the keyword those matching the keyword brain. In this particular case, this third plan has the potential to submits one Swiss-Prot, and potential to be be the the least least costly. costly. It It submits one sub-query sub-query to to Swiss-Prot, and it it will 1 7 PubMed 17 will not not download download 206,3 206,317 PubMed references. references. Finally, Finally, it it will will not not join join 206,3 206,317 PubMed PubMed references references and and 126 126 proteins proteins from from Swiss-Prot Swiss-Prot locally. locally. Optimization an immediate in the the overall Optimization has has an immediate impact impact in overall performance performance of of the the system. The users' queries system. The consequences consequences of of the the inefficiency inefficiency of of a a system system to to execute execute users' queries may may affect affect the the satisfaction satisfaction of of users users as as well well as as the the capabilities capabilities of of the the system system to to return return any any output output to to the the user. user. These These issues issues are are presented presented in in Chapter Chapter 13. 13.

4.5

VIS UALIZATION VISUALIZATION


An important issue issue when designing a BIS is Scientific data An important when designing a BIS is visualization. visualization. Scientific data are are avail available in a variety of media, and life scientists expect to access all these data able in a variety of media, and life scientists expect to access all these data sets sets by browsing through by browsing through correspondences correspondences of of interest, interest, regardless regardless of of the the medium medium or or the the resource used. used. The combine and resource The ability ability to to combine and visualize visualize data data is is critical critical to to scientific scientific discovery. example, KIND 2 provides provides several discovery. For For example, KIND presented presented in in Chapter Chapter 1 12 several visual visual interfaces interfaces to to allow allow users users to to access access and and annotate annotate the the data. data. For For example, example, the the spatial spatial

4.5

Visualization Visualization

99

annotation maps of slices when another interface shows annotation tool tool displays displays 2D 2D maps of brain brain slices when another interface shows the UMLS UMLS concept space. the concept space.

4.5. 1 4.5.1

M u ltimedia Data Multimedia


Scientific Scientific data data are are multimedia; multimedia; therefore, therefore, a a BIS BIS should should be be designed designed to to manage manage images, 3D structures, images, pathways, pathways, maps, maps, 3D structures, and and so so on on regardless regardless of of their their various various for formats (e.g., raster, raster, bitmap, bitmap, GIF, GIF, TIFF, TIFF, PCX). PCX). An An example example of of the the variety variety of data for formats (e.g., of data 0. mats and and media mats media generated generated within within a a single single application application is is illustrated illustrated in in Chapter Chapter 1 10. Managing multimedia multimedia data data is is known known to to be be a a difficult A multimedia multimedia manage manageManaging difficult task. task. A ment ment system system must must provide provide uniform uniform access access transparent transparent to to the the medium medium or or format. format. Designing Designing a a multimedia multimedia BIS BIS raises raises new new challenges challenges because because of of the the complexity complexity and and variety variety of of scientific scientific queries. queries. The The querying querying process process is is an an intrinsic intrinsic part part of of scientific scientific discovery. should enable enable scientists visualize the discovery. A A BIS BIS user's user's interface interface should scientists to to visualize the data data in in an intuitive way an intuitive way and and access access and and query query through through this this representation. representation. Not Not only only do do scientists scientists need need to to retrieve retrieve data data in in different different media media (e.g., (e.g., images), images), but but they they also also need need the the ability ability to to browse browse the the data data with with maps, maps, pathways, pathways, and and hypertext. hypertext. This This means means a objects. The a BIS BIS needs needs to to express express a a variety variety of of relationships relationships among among scientific scientific objects. The difficulties mentioned identification of difficulties mentioned previously previously regarding regarding the the identification of scientific scientific objects objects (see Section Section 4.2.4) increased by capture the (see 4.2.4) are are dramatically dramatically increased by the the need need to to capture the hier hierarchy of Scientific objects genes, proteins, archy of relationships. relationships. Scientific objects such such as as genes, proteins, and and sequences sequences can classes in (ER) model; model; and can be be seen seen as as classes in an an entity-relationship entity-relationship (ER) and a a map map can can be be seen the visualization visualization of seen as as the of a a complex complex ER ER diagram diagram composed composed of of many many classes, classes, isa relationships, relationships, relationships, relationships, and and attributes. attributes. Each Each class class can can be be populated populated by by data data sources data collected collected from from different different data sources and and the the relationships relationships corresponding corresponding to capabilities. For classes, g gene to different different source source capabilities. For example, example, two two classes, e n e and and publ p u b l ii cca at t ii oon, n , can can be be respectively respectively populated populated with with data data from from GeneCards GeneCards and and PubMed. PubMed. The The relationship relationship from from the the class class gene g e n e to to the the class class publ p u b l ii ccation, a t i o n , expressing expressing the the publications publications in in which which the the gene gene was was published, published, can can be be implemented implemented by by captur capturing available at lists all all publications ing the the capability capability available at GeneCards GeneCards that that lists publications associated associated with with a a gene gene and and provides provides their their PubMed PubMed identifiers. identifiers. The The integration integration data data schema schema is very complex because data is very complex because data and and relationships relationships must must be be integrated integrated at at different different levels of nested hierarchy. information systems systems address levels of the the nested hierarchy. Geographical Geographical information address similar similar issues issues by by representing representing maps maps at at different different granularities, granularities, encompassing encompassing a a variety variety of of information. information. Many Many systems systems have have been been developed developed to to manage manage geographical geographical and and spatial spatial data, data, medical data, Databases--with Applica Applicamedical data, and and multimedia multimedia data. data. Refer Refer to to Spatial Databases-with tions to GIS [85], [85], Neural Neural Networks Networks and Arti{ical Artifical Intelligence for Biomedical Biomedical En Engineering, [86] [86] and and Principles of of Multimedia Multimedia Database Systems [87] [87] for for more more

1 00

100

4 4

Issues ile Issues to to Address Address Wh While

nformation System Designing a Biological IInformation System

information. information. However, However, very very little little has has been been done done to to develop develop a a system system to to integrate integrate scientific following section scientific multimedia multimedia systems systems seamlessly. seamlessly. The The following section partially partially addresses addresses the problem problem by by focusing focusing on on relationships relationships between between scientific scientific objects. objects. the

4.5.2 4.5.2

B rows i n g Scientific bj ects Browsing Scientific O Objects


Scientific Scientific entities entities are are related related to to each each other. other. A A gene gene comprises comprises one one or or more more sequences. sequences. A (DNA) into A protein protein is is the the result result of of a a transcription transcription of of dioxyribonucleic dioxyribonucleic acid acid (DNA) into RNA RNA followed followed by by a a translation. translation. Sequences, Sequences, genes, genes, and and proteins proteins are are related related to to reference reference publications. represented by links (and publications. These These relationships relationships are are often often represented by links (and hyperlinks) hyperlinks).. For sa n instance fa For example, example, there there iis a relationship relationship between between a an instance o of a gene gene and and instances instances of of the the set set of of sequences sequences that that comprise comprise the the gene. gene. The relationship associating The attributes attributes describing describing an an entity, entity, the the relationship associating the the entity entity to to other most importantly, relationships, correspond correspond other entities, entities, and and most importantly, the the semantics semantics of of the the relationships, to functional characterization to the the complete complete functional characterization of of an an entity. entity. Such Such a a characterization, characterization, from multiple sources from multiple sources and and representing representing multiple multiple points points of of view, view, typically typically intro introduces discrepancies. Examples duces discrepancies. Examples of of such such discrepancies discrepancies include include dissimilar dissimilar concepts concepts (GenBank is sequence-centric GeneCards is (GenBank is sequence-centric whereas whereas GeneCards is gene-centric), gene-centric), dissimilar dissimilar at attribute tribute names names (the (the primary primary GeneCards GeneCards site site has has an an attribute attribute protein p r o t e i n whereas whereas a a mirror represents represents the same information information as mirror the same as an an attribute attribute product), p r o d u c t ) , and and dissimilar dissimilar values values or or properties properties (the (the gene gene TP53 is is linked linked to to a a single single citation citation in in the the data data source source HUGO, 35 35 citations GOB, and HUGO, citations in in the the data data source source GDB, and two two citations citations in in the the data data source source GeneCards). GeneCards). A BIS A BIS integrating integrating multiple multiple data data sources sources should should allow allow life life scientists scientists to to browse browse the data data over over the the relationships between scientific scientific objects. objects. the the links links representing representing the relationships between A path is a sequence sequence of of classes, classes, starting starting and and ending ending at at a a class class and and intertwined intertwined A path is a with links. Two paths with identical starting with links. Two paths with identical starting and and ending ending classes classes may may be be equivalent equivalent if if they they have have the the same same semantics. semantics. The The resolution resolution of of the the equivalence equivalence of of paths paths is is also critical discussed in also critical to to developing developing efficient efficient systems systems as as discussed in Section Section 4.4.3 4.4.3.. Semantic Semantic equivalence problem, as example. Consider equivalence is is a a difficult difficult problem, as illustrated illustrated in in the the following following example. Consider a physically a link link from from PubMed PubMed citations citations to to sequences sequences in in GenBank. GenBank. This This link link can can be be physically implemented 1 ) by implemented in in two two different different ways: ways: ((1) by extracting extracting GenBank GenBank identifiers identifiers from from the the l eotide MED LINE format MEDLINE format of of the the PubMed PubMed citation, citation, or or (2) (2) by by capturing capturing the the Nuc Nucleotide Link L i n k as as implemented implemented via via the the Entrez Entrez interface. interface. Both Both implementations implementations expect expect to to capture all the PubMed citation. capture all the GenBank GenBank identifiers identifiers relevant relevant to to a a given given PubMed citation. These These two classes; however, two links links have have same same starting starting and and ending ending classes; however, they they do do not not appear appear to 91 to be be equivalent. equivalent. Using Using the the first first implementation, implementation, the the PubMed PubMed citation citation 85521 8552191 refers n contrast, refers to to four four GenBank GenBank identifiers. identifiers. I In contrast, the the Nucleotide Nucleotide Link Link representing representing the identifiers. Based Based on the second second property property returns returns eight eight GenBank GenBank identifiers. on the the dissimilar dissimilar

4.6 Conclusion Conclusion 4.6

101 101

cardinality of of results results (the (the number number of of returned returned sequences), sequences), the the two two properties properties are are cardinality not identical. identical. This This can can also also be be true true for for paths paths (informal (informal sequences sequences of of links) links) between between not entities. To make the the scenario scenario more more complex, complex, there there could could be be multiple multiple alternate alternate entities. To make paths (links) ( links) between between a a start start entity entity and and an an end end entity entity implemented implemented in in completely completely paths different sources. different sources. A BIS BIS able able to to exploit exploit source source capabilities capabilities and and information information on on the the semantics semantics A of the the relationships relationships between scientific objects objects would would provide provide users users the the ability ability to to of between scientific browse scientific scientific data data in in a a transparent transparent and and intuitive intuitive way. way. browse

4.6 4.6

CO N C L U S I O N CONCLUSION
Traditional technology often does does not not meet meet the the needs scientists. Each Each rere Traditional technology often needs of of life life scientists. search laboratory uses significant significant manpower manpower to adjust and and customize customize as much as search laboratory uses to adjust as much as possible the available Because of failure of possible the available technology. technology. Because of the the failure of traditional traditional approaches approaches to support scientific discovery, to be highly creative to support scientific discovery, life life scientists scientists have have proven proven to be highly creative in in developing tools and and systems database developing their their own own tools systems to to meet meet their their needs. needs. Traditional Traditional database systems lack flexibility: of the tools and and systems lack flexibility: Life Life scientists scientists use use flat flat files files instead. instead. Some Some of the tools systems scientific laboratories laboratories may not meet systems developed developed in in scientific may not meet the the expectations expectations of of com computer but they they perform puter scientists, scientists, but perform and and support support thousands thousands of of life life scientists. scientists. The The de development of BIS is driven by the needs of a community. But practice shows that velopment of BIS is driven by the needs of a community. But practice shows that the the community community now now needs needs the the development development of of systems systems that that are are more more engineered engineered than should be than before, before, and and computer computer scientists scientists should be involved. involved. There There are are good good reasons reasons for for traditional traditional technology technology to to fail fail to to meet meet the the require requirements BIS. Databases ments of of BIS. Databases are are data-driven data-driven and and lack lack flexibility flexibility at at the the level level of of data data representation. representation. XML XML and and other other semi-structured semi-structured approaches approaches may may offer offer this this needed needed flexibility, but but the the development development of of native native semi-structured semi-structured systems systems still still is is in in its its in inflexibility, fancy. fancy. Knowledge Knowledge bases bases offer offer different different semantic semantic layers layers to to leverage leverage queries queries with with the bases to the exploration exploration process, process, but but they they should should be be coupled coupled with with data databases to perform perform traditional data traditional data manipulation. manipulation. On On the the other other hand, hand, agent agent architectures architectures and and grids grids provide provide flexible flexible and and transparent transparent management management of of tools. tools. Each Each of of these these approaches approaches may may and and should should contribute contribute to to the the design design of of BIS. BIS. The The systems systems presented presented in in this this book book constitute constitute the the first first generation generation of of BIS. BIS. Each Each system system addresses addresses some some of of the the requirements requirements presented presented in in Chapter Chapter 2. 2. Each Each presented presented system system still still is is successfully successfully used used by by life life scientists; scientists; however, however, the the development development of of each each of of these these systems systems told told a a lesson. lesson. To To be be successful, successful, the the design design of of the the next next generation generation of of BIS BIS should should take take advantage advantage of of these these lessons lessons and and exploit exploit and and combine combine all all existing existing approaches. approaches.

1 02

102

4 4

nformation Issues i l e Designing a Biological IInformation Issues to to Address Address Wh While

System

ACKNOWLEDGMENTS ACK N OWLE DG M E NTS


The The author author wishes wishes to to thank thank Louiqa Louiqa Raschid Raschid and and Barbara Barbara Eckman Eckman for for fruitful fruitful discussions discussions that that contributed contributed to to some some of of the the material material presented presented in in this this chapter. chapter.

REFERENCES
[[1] 1] [2] [2] [3] A. Baxevanis. "The Molecular Biology " Baxevanis."The Biology Database Collection: 2002 Update. Update."

1 (2002): 1-12. Nucleic Acids Research 30, no. 1


C. Burks. " Nucleic Acids Research 27, no. 1 Burks. "Molecular Biology Biology Database Database List. List." 1 1999): http://nar.oupjournals.org/cgi/content/full/27/1/1.. ((1 999): 1-9, http://nar.oupjournals.orglcgilcontentlfuIU27/1/1 A. Baxevanis. "The Molecular Biology Database Collection: An Online Compilation of Relevant Database Resources." Nucleic Acids Acids Research 28, no. 1(2000): 1 (2000) : 1-7. A. Baxevanis. "The "The Molecular Biology Database Collection: An Updated Compilation of Biological Database Resources. " Nucleic Acids Research 29, no. 1 Resources." 1 (2001): (2001 ): 1-10. U. S. Department U.S. Department of Energy Human Human Genome Program. "Report from the 1999 0, no. 3-4 U. U. S. S. DOE DOE Genome Meeting: Informatics." Human Human Genome News News 1 10, 3-4 (1999): ( 1 999): 8.
1. Karsch-Mizrachi, D. Lipman, et al. "GenBank." " GenBank." Nucleic Acids D. Benson, Benson, I. D. 2000) : 15-18. 1 5- 1 8 . http://www.ncbi.nlm.nih.gov/Genbank. 1 (January 2000): Research 28, no. 1

[4]

[5]

[6]

[7] [7]

Databank and A. Bairoch and R. Apweiler. "The "The SWISS-PROT Protein Sequence Databank 1 (January Its Supplement TrEMBL in 1999." Nucleic Acids Research 27, no. 1 1 999): 49-54. 49-54. http://www.expasy.ch/sprot. http://www.expasy.chlsprot. 1999): Database (SGD). (SGD) . http://genome-www.Stanford.edu/ http://genome-www.Stanford.edul Saccharomyces Genome Database saccaromycesl. saccaromyces/. Department Department of Genetics, Genetics, Stanford Stanford University. M. Rebhan, Rebhan, V. Chalifa-Caspi, ChaIifa-Caspi, J. Prilusky, et al. "GeneCards: "GeneCards: A Novel Functional M. Compendium with Automated Data Mining And Query Reformulation Reformulation Genomics Compendium Automated Data Bioinformatics 14, no. 8 (July 1998): 1 99 8 ) : 656-664, 656-664, Support. " Bioinformatics Support." http://bioinformatics.weizmann.ac.illcardsICABIOS_paper.html. http://bioinformatics.weizmann.ac.il/cards/CABIOS_paper.html.

[8]
[9]

[ 1 0] PubMed. PubMed. http://www.ncbi.nlm.gov/pubmed/. http://www.ncbi.nlm.gov/pubmed!. National National Library of of Medicine. Medicine. [10]

NCBI GenBank GenBank Statistics, Statistics, revised revised March March 12, 1 2, [ 1 1 ] GenBank. "Growth " Growth of of GenBank." GenBank." NCBI [11] 2002, http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html. http://www.ncbi.nlm.nih.gov/Genbanklgenbankstats.html. 2002,
[ 1 2] International International Human Human Genome Genome Sequencing Sequencing Consortium. Consortium. "Initial "Initial Sequencing Sequencing and and [12] Analysis of of the the Human Human Genome." Genome. " Nature Nature 409 409 (February (February 2001): 200 1 ) : 860-921. 860-9 2 1 . Analysis [ 1 3] J. ]. Venter, M. M. Adams, Adams, E. Myers, Myers, et e t al. a l . "The "The Sequence Sequence of o f the the Human Human Genome." Genome. " [13] Science Science 291, 291 , no. n o . 5507 5507 (February (February 2001): 2001 ) : 1304-1351. 1 3 04-1 3 5 1 .

References References

1 03
E Clark, et al. "Isis: The Intron Information System [14] L. Croft, S. Schandorff, F. Reveals the High Frequency of Alternative Splicing in the Human Genome. " Genome." Nature Genetics 24 (2000): 340-34 1. 340-341. [15] B. Splicing: Increasing Diversity iin [ 15] B . Graveley. "Alternative Splicing: n the Proteomic World." Trends in Genetics 1 7, no. 2 (2001 ): 1 00-107. 17, (2001)" 100-107.
[ 1 6 ] S. Misener and S. Krawetz. Bioinformatics: Methods and Protocols. Protocols. Methods in [16] Molecular Biology, no. 1 32. Totowa, N]: 999. 132. NJ: Humana Humana Press, 1 1999. [ 1 7] D [17] D.. Hollingsworth. The Workflow Reference Model. Hampshire, UK: Workflow Management Coalition, 1 995, http://www.wfmc.orglstandards/docs/tc003v1 1 .pdf. 1995, http://www.wfmc.org/standards/docs/tc003v11.pdf.

[18] P. Buneman, S. S. Davidson, G. Hillebrand, et al. "A Query Language and [ 1 8] P. Optimization Techniques 99 6 Techniques for Unstructured Data." In Proceedings of of the 1 1996
A CM SIGMOD n Management of SIGMOD International Conference Conference o on of Data (Montreal, 1 996), 505-5 1 6 . New York: ACM Press, 1 996. June 4-6, 1996), 505-516. 1996.
[ 1 9] S. [19] S. Nestorov, ]. J. Ullman, ]. J. Wiener, et al. "Representative Objects: Concise Representations Representations of Semi-Structured Hierarchical Data. In Proceedings of of the Thirteenth International Conference 1, 1 997 Conference on Data Engineering (April 7-1 7-11, 1997 Birmingham U.K.), 79-90. Washington, D.e.: 997. D.C.: IEEE IEEE Computer Society, Society, 1 1997.

[20] W. Fan. "Path "Path Constraints for Databases with or without Schemas." Ph.D. [20] dissertation, University of Pennsylvania, 1 999. 1999.
C) [21 ] D. Fallsidc. XML Schema Part 0: O: Primer: Primer: World Wide Web Consortium (W3 (W3C) [21] Fallside. XML Recommendation, 1, Recommendation, May 2, 200 2001, http://www.w3c.org/TRl2001IREC-xmlschema-0-2001 05021. http ://www.w3 c. org/TR/2001/REC-xmlschema- 0-20010502/.
[22] 1: Structures: XML Schema Part 1. [22] H. Thompson, Thompson, D. Beech, Beech, M. Maloney, et al. XML 1, World Wide Web Consortium (W3C) Recommendation, Recommendation, May 2, 200 2001, http://www.w3c.org/TRl2001/REC-xmlschema-2001 0502/. http://www.w3c.org/TR/2001/REC-xmlschema- 1 1-20010502/. [ 23] P. Biron and A. Malhotra. [23] Malhotra. XML XML Schema Part 2: Datatypes: World Wide Web Consortium (W3C) Recommendation, May 2, 200 1, 2001, http://www.w3.org/TRl2001IREC-xmlschema-2-2001 0502. http ://www.w3. org~R/2001/REC-xmlschema-2-20010502. [24] S. Abiteboul, P. P. Buneman, and D. Suciu. Data on the Web. San Francisco: Morgan Kaufmann, 2000. [25] [25] H. Wain, E. Bruford, R. Lovering, et al. "Guidelines for Human Human Gene Nomenclature (2002) . " Genomics 79, no. 4 (April 2002): 464-470. (2002)." [26] Human [26] Human Genome Database (GDB). (GDB). http://www.gdb.org. The Hospital for Sick Sick Children, Baltimore, MD: ]ohns Johns Hopkins University. [27] J GenAtlas Database, " Comptes J.. Frezal. " "GenAtlas Database, Genes and Development Defects. Defects." Rendus de I'Academie 0 l'Acaddmie des Sciences-Series HI: III: Sciences Sciences de la Vie 321, no. 1 10 (October 1 99 8 ) : 805-81 7. 1998): 805-817. [28] irection des Systemes [28] GenAtlas. D Direction Syst~mes d'Information, Universite Universit(~ Paris 5, France. http://www.dsi.univ-paris5.frlgenatlas/. http://www.dsi.univ-paris5.fr/genatlas/.

1 04

104

4 4

Issues ile Issues to to Address Address Wh While

"4>T<.rn ical IInformation nformation .... Designing a a Biolog Biological System

[29] [29] A. Hamosh, A. F. E Scott, J. Amberger, et al. al. "Online Mendelian Inheritance in Man 15, 1 (2000): 5 57-61. (OMIM)." (OMIM ) . " Human Mutation 1 5, no. 1 7-61 .

[30] On Online (OMIM).. [30] line Mendelian Inheritance in Man (OMIM) http://www.ncbi.nlm.nih.gov/entrez/query.fcgi ?db=omim. World Wide Web http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=omim. interface developed by the National Center for Biotechnology information (NCBI, National Library of Medicine.
[3 1 ] K. Pruitt, K. Katz, H. Sciotte, [31] Sciotte, et al. "Introducing Refseq and LocusLink: Curated Human Genome Resources at the NCBI." Trends 6, no. 1 Trends in Genetics Genetics 1 16, 1 (2000): 44-47. 44-47. [32] [32] LocusLink. http://www.ncbi.nlm.nih.gov/locuslinkl. http://www.ncbi.nlm.nih.gov/locuslink/. National Center for Biotechnology Information (NCBI), National Library of Medicine. [33] D. Schuler. [33] G. G.D. Schuler. "Pieces of the Puzzle: Expressed Sequence Tags and the Catalog of 1 997): 694-698. Human Human Genes." Journal of of Molecular Medicine 75, no. 10 ((1997): 694-698. [34] M. A. Chen, V. [34] I. I.M.A. V. M. Markowitz. "An Overview of the Object-Protocol Model (OPM) and OPM Data Management Management Tools." Information Information Systems Systems 20, no. 5 (1995): 393-418. ( 1 995): 393-4 18. [[35] 3 5 ] J. Berlin and and A. Motro. Motro. "Autoplex: Automated Automated Discovery of Content for Virtual Databases. " In Proceedings Databases." Proceedings of of the 9th International Conference Conference on Cooperative Information 08-122. New York: Information Systems Systems (Trento, (Trento, Italy, Italy, September 5-7, 5-7, 2001), 1 108-122. Springer, 200 1. 2001. [36] [36] A A.. Doan, Doan, P. Domingos, and A A.. Levy. Levy. "Learning Source Descriptions for Data Proceedings of Integration." In Proceedings of the Third International Workshop on the Web and Integration." Databases (Dallas, (Dallas, May 18-19, 1 8- 1 9, 2000), 81-86. 8 1-86. [37] Pearson and D . Lipman. "Improved [37] W. Pearson and D. "Improved Tools Tools for Biological Biological Sequence Sequence Proceedings of Academy of Comparison." Comparison." Proceedings of the National National Academy of Science 85, no. 8 (April 1988): 1 98 8 ) : 2444-2448. 2444-2448 .

[38] S. Altschul, W. Gish, W. Miller, et al. "Basic Local Alignment Search Tool." (October 1 990): 403-410. 403-410. Journal of of Molecular Biology 215, no. 3 (October 1990): http://www.ncbi.nlm.nih.gov/BLAST. http://www.ncbi.nlm.nih.gov/BLAST.
[39] E. Glemet and J-J. Codani. "LASSAP: A LArge Scale Scale Sequence Sequence CompArison Codani. "LASSAP: CompArison ( 1 997): 137-143. 1 3 7-143. http://www.gene-it.com. http://www.gene-it.com. Package." Bioinformatics 13, no. 2 (1997): R. D. Stevens, Stevens, C. A. Goble, Goble, P. Baker, et al. "A Classification of Tasks in [40] R.D. Bioinformatics. " Bioinformatics 17, 1 7, no. no. 2 (2001): (2001 ): 180-188. 1 8 0-1 8 8 . Bioinformatics." [41 ] S. S . Abiteboul, Abiteboul, R. Hull, Hull, and and V. V. Vianu. Foundations of of Databases. Boston: [41] Addison-Wesley, Addison-Wesley, 1995. 1 995. [42] National National Institute Institute of of Standards Standards and and Technology. Database Language SQL, June http://www.itl.nist.gov/fipspubs/fip127-2.htm. June 2,1993. http://www.itl.nist.gov/fipspubs/fip 127-2.htm. Manber and and W. Sun. "GLIMPSE: A Tool to to Search Through Entire File [43] U. Manber Search Through of USENIX Conference, 23-32. 23-32. Berkeley, CA: USENIX Systems." In Proceedings of Association. Association. 1994. 1 994.

References References
~ ~ ~

~ ........

~==== : ~ . ~ ~ ~ ~ ~ ~ . . ~ . _ _ . ~ o == ========: . .o . . . . . . - ==-

10 5 1 05

File Data [44] T. Etzold and P. Argos. "SRS: "SRS: An Indexing and Retrieval Tool for Flat File 1 993): 49-57. Libraries." Computer Applications of of Biosciences 9, no. 1 1 ((1993): 49-57. See also http://srs.ebi.ac.uk. [45] Entrez: Molecular Biology Database [45] G. Schuler, Schuler, J. Epstein, H. Ohkawa, Ohkawa, et aI. al. " "Entrez: Database and 1 996): 1 4 1-162. Retrieval System." Methods in Enzymology 266, ( (1996): 141-162. [46] [46] W. Wilbur and Y. Y. Yang. Yang. "An Analysis of Statistical Term Strength and and Its Use in the Indexing and and Retrieval of Molecular Biology Texts." Computers in Biology and Medicine 26, no. 3 ( 1 996): 209-222. (1996): 209-222. [47] L. Wong. "Some MEDLINE Queries Powered by Kleisli." A CCESS 25 (June ACCESS 1 99 8 ) : 8-9. 1998):

[48] G. Stoesser, Stoesser, W. Baker, Baker, A. Van Den Broek, et al. "The EMBL Nucleotide Sequence [48] " Nucleic Acids Research 3 1 , no. 1 ): 31, 1 (2003 (2003): Database: Major New Developments. Developments." 1 7-22. 17-22.
ark and S. DeRose. XML XML Path Language (XPath); (XPath). World Wide Web [49] ]. J. Cl Clark 6, 1 999. Consortium (W3C) Recommendation, November 1 16, 1999. http://www.w3.orgITRlxpath. http://www.w3.org/TR/xpath. [50] D. Chamberlin, D. Florescu, ]. [50] J. Robie, et al. XQuery; XQuery" A Query Language for XML; XML. World Wide Web Consortium (W3 C) Recommendation, 2000. (W3C) http://www.w3.orgITRlxmlquery. http://www.w3.org/TR/xmlquery. [5 1 ] S. Russel and P. Norvig. Artificial [51] Artificial Intelligence; Intelligence: A Modern Approach. Upper Saddle River, NJ: Prentice 995. Prentice Hall, 1 1995. [52] S. Kraus, J. Dix, and V. S. Subrahmanian. "Probabilistic Temporal Databases. " Databases." Artificial Intelligence Journal 1 27, no. 1 ) : 87-135. 127, 1 (2001 (2001):

CM [53] T. Eiter, "A ACM Eiter, ]. J. Lu, T. Lucasiewicz, Lucasiewicz, et al. "Probabilistic Object Bases. Bases." Transactions on Database Systems 26, no. 3 (200 1 ): 264-3 12. (2001): 264-312.
[54] [54] R. Stevens Stevens and A. Brass. Brass. "Using CORBA Clients in Bioinformatics Applications." Collaborative Computational Project 1 1 Newsletter 2.1, no. 3 ( 1 99 8 ) . http:// 11 (1998). www.hgmp.mrc.ac.uklCCPI 11CCP1 newslettersICCP 11NewsletterIssue3.pdf. I INewsletterIssue3.pdf. www.hgmp.mrc, ac.uk/CCP 11/CCP 11 lnewsletters/CCP [55] L. Wang, P. P. Rodriguez-Tome, Rodriguez-Tom~, N. Redaschi, et al. "Accessing and Distributing EMBL Data Using CORBA (Common Object Request Broker Architecture) ." Architecture)." Genome Biology 1 , no. 5 (2000). 1, [56] [56] Globus. Open grid services services architecture (OGSA). (OGSA). http://www. Globus.orglogsa/. Globus.org/ogsa/. [57] [57] Cactus: open source problem solving environment. environment, http://www.cactuscode.orgl. http://www.cactuscode.org/. [58] [58] NSF funded. Teragrid. http://www.teragrid.orgl. http://www.teragrid.org/. [59] [59] European Union. Datagrid. http://eu-datagrid.web.cern.chleu-datagrid!. http://eu-datagrid.web.cern.ch/eu-datagrid/. [60] T. Berners-Lee, D. Connoly, and R. Swick. "Web Architecture: Describing and Exchanging Data." C Note, June 1999. http://www.w3.org/1999/04IWebData. W3C http://www.w3.org/1999/04/WebData. Data." W3 " Scientific [ 6 1 ] T. Berners-Lee, ]. [61] J. Hendler, and o. O. Lassila. "The Semantic Web. Web." American (May 2001 ). 2001).

1 06 106

Issues to Address Address While While


= ~

. "eT'"'' nformation .... Designing a Biological IInformation System

[62] . " In [62] D. Fensel, I. Horrocks, R. van Harmelem, et al. "OIL in a Nutshell Nutshell." Proceedings of 2 th International European Knowledge Acquisition of the 1 12th Conference, EKAW 2000, Juan-les-Pins, France, October 2-6, 2-6, 2000. LNAI: Springer Veriag, Verlag, 2000. W A-OO: The XML-Enabled Wide-Area Searches [63] T. Critchlow. Report on XE XEWA-O0. Searches for Bioinformatics Bioinformatics Workshop. Workshop. New York: IEEE Computer Society, Society, 2000. [64] M. Friedman and and D D.. Weld. "Efficiently Executing Information-Gathering Plans." In Proceedings of of the Fifteenth International International Joint Conference on Artificial Intelligence (Nagoya, 997), 785-791. (Nagoya, Japan, Japan, August 23-29, 1 1997), 785-791. San Francisco: 997. Morgan Morgan Kaufmann, 1 1997. [65] E. Lambrecht, S. Kambhampati, and and S. Gnanaprakasam. Gnanaprakasam. "Optimizing Recursive Information-Gathering of the Sixteenth Sixteenth International International Joint Information-Gathering Plans." In Proceedings of Conference on Artificial Intelligence (Stockholm, 1 -August 6, 1 999), (Stockholm, July 3 31-August 1999), 1204-1 2 1 1 . San Francisco: Morgan 999. 1204-1211. Morgan Kaufmann, 1 1999.
Y. Levy. [66] [66] D. Florescu, D. Koller, A. Y. Levy. "Using Probabilistic Information in Data Integration." In Proceedings of of 23rd International International Conference Conference on Very Very Large Large Data 1997, 216-225. Bases (August 25-29, 1 997, Athens), 2 1 6-225. San Francisco: Morgan Kaufmann, 1997. 1 997.

[67] [67] G. Mihaila, L. Raschid, and M-E. Vidal. "Using Quality of Data Metadata Metadata for Source Selection and Ranking." In Proceedings Proceedings of of the Third International International Workshop on the Web and Databases 8-1 9, 2000), 93-98. Databases (Dallas, (Dallas, May 1 18-19, 93-98. In conjunction conjunction with the ACM SIGMOD, 2000. [68] G. A. Mihaila, [68] G.A. Mihaila, L. Raschid, and M.-E. Vidal. Vidal. "Using "Using Quality of Data Metadata Metadata for Source Selection of the Selection and Ranking." In D. Suciu, G. Vossen (eds.). Proceedings of Third International Databases, WebDB 2000. Dallas, International Workshop on the Web and Databases, Texas, May 1 8-19, 2000, 18-19, 2000, in conjunction conjunction with with ACM PODS/SIGMOD 2000. Informal proceedings.

[69] c. C. Baru, A. Gupta, Gupta, B. Ludascher, Lud/ischer, et al. "XML-Based Information Information Mediation with MIX." International Conference on Management MIX." In Proceedings A CM SIGMOD International of -3, 1 999, Philadelphia), 597-599. of Data (June (June 1 1-3, 1999, 597-599. New York: Association for 999. Computing Machinery (ACM) Press, 1 1999. [70] A. Levy, Querying Heterogeneous Levy, A. Rajaraman, Rajaraman, and and J. Ordille. " "Querying Heterogeneous Information Sources Using Source Descriptions." In Proceedings of of 22nd International International Conference on Very 996, Mumbai, India), Very Large Data Bases Bases (September 3-6, 1 1996, 996. 25 1-262. San Francisco: Morgan Kaufmann, 1 251-262. 1996.

[71 ] Y. Y. Papakonstantinou structured Data." [71] Papakonstantinou and and V. V. Vassalos. "Query Rewriting for Semi Semistructured In Proceedings of CM SIGMOD of the A ACM SIGMOD International International Conference on Management of -3, 1 999, Philadelphia), 455-466. 999. of Data (June (June 1 1-3, 1999, 455-466. New York: ACM Press, 1 1999. [72] E. Vidal. " Mediation Techniques for Multiple Autonomous [72] M. M.E. "Mediation Autonomous Distributed
Information " Ph.D. dissertation, Universidad Simon Information Sources. Sources." SimOn Bolivar, Caracas, Venezuela, 2000. 2000.

References References
~ . o

........~ o ~ o = ~ ~ ~ o ~ = ~ o ~ , ~ o ,

..... = = = = ~ o ~ = ~ . o ~ . ~ ~

107 1 07

V. Vassalos and Y. Papakonstantinou. "Describing and Using Query Capabilities [73] V. of Heterogeneous Sources." In Proceedings Proceedings of of 23rd International Conference Conference on Very Large Data Bases 99 7, Athens), 256-265. San Francisco: Bases (August (August 25-29, 25-29, 1 1997, 997. Morgan Kaufmann, Kaufmann, 1 1997. [74] R. Yerneni, e. C. Li, H. Garcia-Molina, et al. "Computing Capabilities of [74] Mediators. " In Proceedings Proceedings A CM SIGMOD International Conference Conference on Mediators." Management of -3, 1 999, Philadelphia), of Data (June (June 1 1-3, 1999, Philadelphia), 443-454. New York: ACM Press, 1 999. 1999.
Optimized Seamless [75] B. Eckman, Z. Lacroix, and and L. Raschid. " "Optimized Seamless Integration of Biomolecular Data." Data." In 2nd IEEE International Symposium on Bioinformatics , 23 32 . 23-32. and Bioengineering Bioengineering (Bethesda, (Bethesda, Maryland, Maryland, November 4-5, 4-5, 2001) 2001), Washington D.e.: 1. D.C.: IEEE IEEE Computer Society, Society, D.e., D.C., 200 2001.
-

" IEEE [76] [76] Z. Lacroix. "Biological Data Integration: Wrapping Data and Tools. Tools." Transactions Transactions on Information Technology Technology in Biomedicine 6, no. 2 (June 2002): 1 23-128. 123-128.

[77] Z. Nie Nie and and S. Kambhampati. "Joint Optimization of Cost and and Coverage of Query 2001 A CM CIKM International Plans in Data Integration. " In Proceedings Proceedings of the 2001 ACM Integration." Conference on Information and Knowledge Management (November 5-1 0, 2001, 5-10, Atlanta), 223-230. 1. 223-230. New York: ACM Press, 200 2001.
Extending Traditional Query-Based [78] B. Eckman, A A.. Kosky, and L. Laroco. " "Extending Integration Integration Approaches for Functional Characterization of Post-Genomic Data." Data." Bioinformatics 1 7, no. 7 (200 1): 5 8 7-601 . 17, (2001): 587-601.

[79] J J.. Ullman. Principles Principles of of Database and Knowledge-Base Knowledge-Base Systems, volume n II.. Palo 989. Alto, CA: Computer Science Press, 1 1989.

[80] J. Ullman. "Information Integration Using Logical Views." In Proceedings Proceedings of of the Sixth International Conference 9-40. Springer, 1 997. Conference on Database Theory, 1 19-40. 1997. [81] K. Morris. "An Algorithm for Ordering Subgoals in NAIL! " In Proceedings Proceedings of of the [81] NAIL!" Seventh A CM SIGACT-SIGMOD-SIGART Symposium on Principles ACM Principles of of Database 98 8 . Systems (March 1 -23, 1 988, Austin, Texas), 8 . New York: ACM Press, 1 (March 2 21-23, 1988, Texas), 82-8 82-88. 1988.
Randomized Algorithms for Optimizing Large Join [82] Y. Y. Ioanidis and Y. Y. Kang. " "Randomized Queries." In Proceedings 990 A CM SIGMOD International Conference Proceedings of of the 1 1990 ACM Conference 12 321 . New on Management of 990) , 3 of Data Data (Atlantic City, City, Nj, NJ, May 23-25, 23-25, 1 1990), 312-321. York: ACM Press, 1 990. 1990.
-

amber Iin, et al. "Access [83] P. P. Selinger, Selinger, M. Astrahan, D D.. Ch Chamberlin, "Access Path Selection in a Relational Database Management System." In Proceedings 979 ACM Proceedings of the 1 1979 A CM SIGMOD International Conference Conference on Management of of Data (Boston, May 30-june1), 979. 30-June1), 23-34. New York: ACM Press, 1 1979.

[84] [84] M. Steinbrunn, G. Moerkotte, and and A. Kemper. "Heuristic and and Randomized Optimization for the Join Ordering Problem." VLDB journal 1 997): Journal 6, no. 3 ((1997): 1 9 1-208. 191-208.

108 1 08

4 4

Issues Issues to to Address While While

\I''' Q lrYl nformation .... Designing a Biological IInformation System

[85] P. P. Rigaux, M. Scholl, Scholl, and and A. Voisard. Spatial Databases-With DatabasesmWith Applications to GIS. San San Francisco: Morgan Kaufmann, Kaufmann, 2001 2001.. [86] [86] D D.. Hudson and M M.. Cohen. Neural Networks and Artificial Intelligences for IEEE Press, 2000. Biomedical Engineering. New York: IEEE [87] S. Subrahmanian. Principles of [87] V. V.S. of Multimedia Database Systems. San Francisco: 1998. Morgan Kaufmann, 1 99 8 .

CHAPTER CHAPTER

5 5

Platform for Databanks Databanks P l atform for

SR RS: An IIntegration nteg rati o n S S : An

I I
I

and l ysis Tool s in and Ana Analysis Tools in

Bioinformatics Bioinform atics


Thure Etzold, Etzold, Howard Howard Harris, Harris, and and Simon Simon Beaulah Beaulah Thure

The Sequence Sequence Retrieval Retrieval System System (SRS) (SRS) approach approach to to data data integration integration has has evolved evolved The many years over many years to to address address the the needs needs of of researchers researchers in in the the life life sciences sciences to to query, query, over retrieve, retrieve, and and analyze analyze complex, complex, ever ever increasing, increasing, and and changing changing biological biological data. data. SRS SRS follows follows a a federation federation approach approach to to data data integration, integration, leaving leaving the the underlying underlying data data sources in in their 1 ] is is used used in in flat sources their original original formats. formats. For For example, example, Genbank Genbank [ [1] flat file file format; the Genome (GO) [2] format or or as re format; the Genome Ontology Ontology (GO) [2] is is used used in in either either XML XML format as relational tables tables stored stored in MySQL [3]. [3]. Databanks Databanks generated generated and provided by by the lational in MySQL and provided the major technologies technologies available are integrated integrated through is provided provided major available are through meta-data, meta-data, which which is for majority of of the the common public data customers use use this this funcfunc for the the majority common public data sources. sources. SRS SRS customers tionality to integrate integrate their their own in-house data, data, such such as as gene expression databases, data bases, tionality to own in-house gene expression with third-party Foundation data data [4], [4], and public data with third-party data data such such as as Incyte Incyte LifeSeq LifeSeq Foundation and public data such as Swiss-Prot [6]. such as EMBL EMBL [5] [5] and and Swiss-Prot [6] . Databanks in in SRS SRS can can be be queried and analyzed Web interface or through through Databanks queried and analyzed via via a a Web interface or a variety of of application application programming programming interfaces (APls) as as described described in a variety interfaces (APIs) in Section Section 5.8. 5.8. Using one one of a variety variety of query forms, forms, the the user user can can search search a a single single or a combination combination Using of a of query or a of databanks. databanks. Search Search results results can can be be further further analyzed analyzed using using a a suite suite of of tools tools like like of BLAST [7] and FASTA [8] for sequence similarity searching. SRS provides support BLAST [7] and FASTA [8] for sequence similarity searching. SRS provides support for about about 200 200 tools tools including including a a major major part part of of EMBOSS EMBOSS [9]. [9]. This This is is further described for further described in Section 5.7. in Section 5.7. Meta-data is is at at the of SRS. SRS. Each source is is fully described, including including Meta-data the heart heart of Each data data source fully described, type and and structure structure of of data, data, relationships relationships to to other other data data sources, sources, how how the the data data the the type should be be indexed indexed or or presented to users, users, and and how how it it can can be be mapped mapped to to external external should presented to approach, which which is is based based on on its its interinter object models. models. SRS SRS uses uses a a meta-data only approach, object nal programming programming language, language, Icarus. !caruso Administrators Administrators can can customize customize SRS SRS by by editing editing nal

1 10

5 5

SRS: ntegration Platform lysis Tools SRS"An An IIntegration Platform for for Databanks Databanks and and Ana Analysis Tools

!carus Icarus files files or or through through the the use use of of a a graphical graphical user user interface. interface. No No access access programs programs or wrappers wrappers need need to to be be written written by by programmers programmers as as they they do do with with other other integration integration or systems like like DiscoveryLink DiscoveryLink (Chapter (Chapter 1 11) and Kleisli (Chapter 6). 6). An An exception exception systems 1 ) and Kleisli (Chapter is composed for is the the set set of of syntactic syntactic and and semantic semantic rules rules that that need need to to be be composed for the the inte integration of of flat flat file file data databanks. The result result of of the the meta-data meta-data approach approach is is a a flexible flexible gration banks. The and modular system that that has has adapted adapted to to all all the the changes changes and and developments developments in in and modular system bioinformatics past 1 0 years. Many approaches integration have bioinformatics over over the the past 10 years. Many approaches to to data data integration have been address the book, but been proposed proposed over over this this time time to to address the needs needs described described in in this this book, but SRS SRS has surpassed widely used used flexible has surpassed them them all all to to provide provide the the only only proven proven and and widely flexible data data integration environment. integration environment. SRS aims SRS aims to to remain remain independent independent of of the the technology technology used used for for data data storage. storage. Extensible markup language files, and databases bring Extensible markup language (XML), (XML), flat flat files, and relational relational databases bring with them a a range range of of benefits benefits and and problems problems that that often often create create a a particular particular mind-set with them mind-set for the the people people who who use use and and maintain maintain them. them. Flat Flat file file databanks databanks are are the the "dinosaurs" "dinosaurs" for in in this this field, field, albeit albeit very very successful successful in in defying defying extinction. extinction. Flat Flat file file data data are are compact compact and generally generally very very flexible flexible to to work work with. with. They They are are mostly mostly semi-structured semi-structured and and are are and presented formats, which make parser writ presented in in a a vast vast variety variety of of formats, which in in their their multitude multitude make parser writing an almost impossible impossible task. . 1 . XML XML is ing an almost task. This This is is further further described described in in Section Section 5 5.1. is an an elegant is ideally ideally suited suited for elegant way way of of representing representing data data and and is for transferring transferring information information between to a between tools tools and and applications applications (e.g., (e.g., communicating communicating genomic genomic data data to a genome genome browser) . XML XML offers flexibility, which challenges browser). offers great great flexibility, which can can present present formidable formidable challenges How SRS .2. Rela for for integration. integration. How SRS meets meets these these challenges challenges is is described described in in Section Section 5 5.2. Relational bases create a world tional data databases create a world of of tables, tables, columns, columns, and and relationships, relationships, providing providing a a structured and store. However, Structured Query structured and maintainable maintainable data data store. However, the the Structured Query Language Language (SQL) is is not common skill of most most researchers, researchers, and and the the scientific scientific concepts concepts re rea common skill of (SQL) not a searchers searchers wish wish to to analyze analyze are are often often lost lost somewhere somewhere in in the the ever-growing ever-growing database database schema. bases in schema. For For almost almost all all relational relational data databases in molecular molecular biology, biology, a a bespoke bespoke inter interface had be built. built. Section Section 5.3 5.3 covers technology. SRS supports these face had to to be covers this this technology. SRS supports these three three technologies (flat file, XML, and relational databases) and can map all data into technologies (flat file, XML, and relational databases) and can map all data into flexible and and extendable Section 5 .6. Using Using the flexible extendable object object models models as as described described in in Section 5.6. the object object loader, loader, users users can can define define their their own own views views of of the the data data to to display, display, for for example, example, gene gene expression GenBank and expression data data with with information information from from GenBank and from from InterPro InterPro [10] [10].. This This type type of XML, relational, of view view combines combines XML, relational, and and flat flat file file data data seamlessly seamlessly and and is is completely completely in in the the control control of of the the user. user. Section Section 5.6 5.6 also also describes describes how how data data can can be be exported exported as as XML XML to to other other applications applications in in a a standard standard or or customized customized way. way. Providing access to sources is Providing access to all all data data sources is only only the the first first step step of of data data integration. integration. The relationships between The relationships between the the different different data data sources sources are are represented represented in in SRS SRS and and are are used used to to form form an an interconnected interconnected set set of of data data sources sources referred referred to to as as the the SRS SRS Uni Universe. and approaches verse. Mapping Mapping of of attributes attributes and approaches to to semantic semantic integration integration is is addressed addressed

5 5

SRS: ntegration Platform lysis Tools SRS:An An IIntegration Platform for Databanks Databanks and and Ana Analysis Tools

111 1 11

visual meta data edito r

dynamic retrieval and creation service type information

m e a

indexing

Sal generation
automatic object relational mapping

database linking
i nd exi n g

tool options

d t

a a

DOM/SAX parsing

token server

XMl databanks

relational databanks

flatfile databanks

analysis tools

5.1
F IGURE FIGURE

The The SRS SRS architecture. architecture.

in data source in Section Section 5.5. 5.5. When When a a new new data source is is added added to to the the SRS SRS Universe, Universe, relationships relationships to integrated resources to already already integrated resources can can be be defined. defined. This This allows allows the the SRS SRS administrator administrator and and a a domain domain expert expert to to combine combine their their knowledge knowledge to to implement implement an an SRS SRS Universe Universe that reflects own data alongside a that reflects the the intricacies intricacies of of their their own data alongside a tried tried and and tested tested public public data Universe. In data SRS SRS Universe. In this this way, way, SRS SRS provides provides the the flexibility flexibility and and extensibility extensibility that that is is required required in in a a changing changing environment. environment. Figure 5 . 1 gives SRS architecture. Figure 5.1 gives an an overview overview of of the the SRS architecture. It It shows shows the the three three data data source types: types: XML, and flat analysis tools source XML, relational, relational, and flat file file databanks. databanks. The The output output of of analysis tools is treated flat file banks. SRS is treated in in the the same same way way as as flat file data databanks. SRS provides provides specific specific technology technology to deal with banks use to deal with each each data data source source type. type. For For example, example, flat flat file file data databanks use the the to token banks are ken server, server, whereas whereas relational relational data databanks are integrated integrated through through object object relational relational mapping modules. On mapping and and SQL SQL generation generation modules. On top top of of these these technologies technologies are are services services applied to such as the query such as the query service service and and the the object object loader. loader. They They can can be be applied to all all data data sources sources in in a a transparent transparent way. way. The The APIs APIs can can be be used used by by programmers programmers to to make make use use of of these these services. services. The The SRS SRS Web Web server server gives gives an an example example of of such such an an application. application. Meta-data SRS. All Meta-data plays plays an an important important role role in in SRS. All data data sources sources and and analysis analysis tools tools are are fully fully described described (e.g., (e.g., file file location, location, ftp ftp source source address, address, format) format) and and all all SRS SRS modules modules

1 12

112

Platform for Databanks and Ana 5 .......... SRS: An :lnteg/,,a,,~t!onP/atformforoDatabao,,,~,,nks ............................................................. ~ S RS!.An andT Analysis o olysis l s Tools

are are configured configured through through meta-data. meta-data. A A visual visual editor editor can can be be used used to to access, access, modify, modify, and create create all all SRS SRS meta-data. meta-data. All All of of the the components components are are described described in in this this book, book, and if only only briefly. briefly. Unfortunately, Unfortunately, the the scope scope of of this this book book does does not not allow allow a a complete complete if description description of of them them all. all. The 1 1 ] has The SRS SRS server server at at the the European European Bioinformatics Bioinformatics Institute Institute (EBI) (EBI) [ [11] has pro provided vided genomic genomic and and related related data data to to the the European European bioinformatics bioinformatics community community since since 1 994. It 1994. It now now serves serves more more than than 4 4 million million hits hits per per month, month, returning returning results results in in sec seconds, SRS server has approx onds, and and supporting supporting thousands thousands of of researchers. researchers. The The EBI EBI SRS server has approximately 200 data analysis tools. tools. It imately 200 data sources sources integrated integrated with with many many analysis It also also links links to to an an access access page page with with other other major major academic academic SRS SRS servers servers and and gives gives access access to to the the freely freely available available SRS SRS meta-definition meta-definition files files for for (currently) (currently) more more than than 700 700 public public databases databases (see (see also also Krell Krell and and Etzold's Etzold's article article "Data "Data banks" banks" [12]). [12]). SRS SRS is is extensively extensively used used in in large large pharmaceutical pharmaceutical and and biotech biotech companies companies and and is is the the basis basis of of the the Celera Celera Dis Discovery covery System System [13], [13], Incyte Incyte LifeSeq LifeSeq Foundation Foundation distribution, distribution, Affymetrix Affymetrix NetAffyx NetAffyx portal [14], [14], and and Thomson Thomson Derwent Derwent Geneseq Geneseq portal portal [15]. [15]. The The SRS community of of portal SRS community academic and commercial SRS the most widely academic and commercial companies companies makes makes SRS the most widely used used life life science science integration integration product. product. Note: Note: Throughout Throughout this this text text the the words words databank, database, and and library are are used used in in a a seemingly seemingly interchangeable interchangeable way. way. To To clarify, clarify, this this database is is used used to to refer refer to sum of and the actual system within which which it is stored, stored, databank to to the the sum of data data and the actual system within it is to bank or refer refer to to the the data data only, only, and and library to to refer refer to to the the representation representation of of a a data databank or database database within within SRS. SRS.

5.11 5.

T E GG RA T I N G N G FLAT F L A T FIIL E DATABAN D A T A B A N K S KS IN NTE RATI LE


Before Before the the advent advent of of XML, XML, almost almost all all data data collections collections in in molecular molecular biology biology were were available files, also also called called flat databanks are available as as sets sets of of text text files, fiat file databanks. New New databanks are now now generally available in generally available in XML XML and, and, increasingly, increasingly, flat flat file file databanks databanks can can be be obtained obtained in in an an alternative alternative XML XML format. format. However, However, flat flat files files continue continue to to be be the the only only available available form form for for many many data data collections collections and and will will stay stay an an important important source source of of information information for come. for years years to to come. Overall, flat banks have Overall, flat file file data databanks have a a simple simple structure structure and and usually usually consist consist of of a a sin single gle stream stream of of entries entries represented represented in in a a text text format format with with a a special special syntax. syntax. The The entries entries can rich, containing containing comprehensive comprehensive information information about about a protein, a can be be very very rich, a protein, a DNA DNA sequence, 3D structure. banks vary sequence, or or a a tertiary tertiary 3D structure. The The formats formats for for these these flat flat file file data databanks vary greatly rarely shared. defined, individual will change greatly and and are are only only rarely shared. Once Once defined, individual formats formats will change continuously continuously to to reflect reflect the the growth growth of of complexity complexity in in the the associated associated content. content. Hun Hundreds dreds of of these these formats formats have have been been created, created, which which makes makes parsing parsing data data in in molecular molecular biology biology a a highly highly daunting daunting task. task. SRS SRS meets meets this this challenge challenge by by providing providing tools tools that that

5. 1 5.1

Integrating Flat Flat File File Databanks Databanks


~ : .... :~ ~ ,

.....................................
~ . ~ ~ ~ ~ ~ . ~ : ~ , ~ : ~ ~ . ~ ~ _

,~,

113 1 13

make parsers easy easy and maintainable. Parsers Parsers to make writing writing parsers and very very maintainable. to disseminate disseminate flat flat file file entries, servers, are entries, or or token token servers, are written written in in Icarus, Icarus, the the internal internal programming programming language language for SRS. SRS. for

5. 1.1 5.1.1

Th eS R S Toke n Server The SRS Token


SRS SRS has has a a unique unique approach approach for for parsing parsing data data sources sources that that has has proved proved effective effective for for supporting many hundreds supporting many hundreds of of different different formats. formats. With With traditional traditional approaches, approaches, a a parser program would would then parser would would be be written written as as a a program. program. This This program then be be run run over over the the data source and would return parse tree data source and would return with with a a structure, structure, such such as as a a parse tree that that contains contains the the data data items items to to be be extracted extracted from from the the source. source. In In the the context context of of structured structured data data retrieval, retrieval, the the problem problem with with that that approach approach is is that that depending depending on on the the task task (e.g., (e.g., indexing indexing or or displaying) displaying) different different information information must must be be extracted extracted from from the the input input stream. stream. For For instance, instance, for for data data display, display, the the entire entire description description field field must must be be extracted, extracted, but field needs but to to index index the the description description field needs to to be be broken broken up up into into separate separate words. words. token server. server. A A token token server server can can be be asso assoA new new approach approach was was devised devised called called token A ciated bank entry, ciated with with a a single single entity entity of of the the input input stream, stream, such such as as a a data databank entry, and and it it responds responds to to requests requests for for individual individual tokens. tokens. Each Each token token type type is is associated associated with with a a name (e.g., used within descriptionLine) that that can can be be used within this this request. request. The The token token server server name (e.g., descriptionLine) parses only only upon upon request parsed tokens parses request (lazy (lazy parsing), parsing), but but it it keeps keeps all all parsed tokens in in a a cache cache so repeated requests requests can answered by so repeated can be be answered by a a quick quick look-up look-up into into the the cache. cache. A A token token server server must must be be fed fed with with a a list list of of syntactic syntactic and and optionally optionally semantic semantic rules. The rules. The syntactic syntactic rules rules are are organized organized in in a a hierarchic hierarchic manner. manner. For For a a given given data databank usually a bank, rules bank there there is is usually a rule rule to to parse parse out out the the entire entire entry entry from from the the data databank, rules to to extract extract the the data data fields fields within within that that entry, entry, and and rules rules to to process process individual individual data data fi elds. The fields. The parsed parsed information information can can be be extracted extracted on on each each level level as as tokens tokens (e.g., (e.g., the the entire entry, words within data field). entire entry, the the data data fields, fields, and and individual individual words within an an individual individual data field). Semantic Semantic rules rules can can transform transform the the information information in in the the flat flat file. file. For For example, example, amino amino acid acid names names in in three-letter three-letter code code can can be be translated translated to to one-letter one-letter code code or or a a particular particular deoxyribonucleic (DNA) mutation deoxyribonucleic acid acid (DNA) mutation can can be be classified classified as as a a missense missense mutation. mutation. Figure Figure 5.2 5.2 shows shows an an example example of of how how two two different different identifiers identifiers for for protein protein mutations can be transformed into four different tokens using a combination mutations can be transformed into four different tokens using a combination of of syntactic syntactic and and semantic semantic rules. rules. The AaChange, The following following example example of of Icarus Icarus code code defines defines rules rules for for the the tokens tokens AaChange, Prot einChangePos, tType Prote i n C h a n g e P o s , and and AaMu AaMut T y p e for for the the first firstvariation variation of of the the mutation mutation key key Leu3 L e u 3 9Arg. 9Arg.
Key Key : : AaChange AaChange : : $Out $ In : Entry In { $Wrt } - { {$Out Sin: Entry}} in {$Wrt} ~ :K Key - { {$ S In in: e y $Out $0ut $ : A arg R asn N asp D $ code= c o d e = {{aala la:A a r g :: R a s n :: N a s p :: D

114 14 1

=========

5 5

SRS: ntegration Platform lysis Tools SRS"An An IIntegration Platform for for Databanks Databanks and and Ana Analysis Tools

= aa aa { { $aaSave $aaSave = $Ct $ C t }} num n u m aa aa { code .o lowe $aal $aal : $ $c ode.. ( ($aaSave $aaSave.l w e r )r ) $ aa2 code t.. l lower $a a2 = : $ $c ode.. ( ($C $Ct ower) ) $ W r t -: [ [s"( ($aal) "1 ] s: " $ aal > (($aa2) $aa2 ) " $Wrt
=

cys c y s -:CC h i s ':HH his met m e t -:MM thr T t h r -:T

glu g l u -:EE i le e-:II il phe p h e -:FF try W t r y -:W

gln g l n -:QQ l eu le u -:LL pro p r o -:PP tyr Y t y r -:Y

gly g l y -:GG lys K l y s -:K ser s e r -:SS val v a l -:vv }}

ProteinChangePos In Key $Out ProteinChangePos. : - { {$ SI n ':K ey $ O u t }} aa a a num num { { $Wrt $ W r t }} In AaChange $Out AaMutType- : - { {$ $I n .:A aChange $ O u t }} AaMutType / [A[ AZ {r $Wrt :s sub tution /[ A - ZZ ]1>> [A -Z] /1{/ $W t- : [ [s su b ss tt ii tu tion] 1 } } !I/ [AZ]1>> *{ /${W$Wrt :t termination /[ i-Z \\ */ rt" : [ [s sermination] 1 } } In In-: aa-: aa num n u m -: [" \n - / /[ ^\ n ]1 ** \\n n/ / [a a--ZZ - / /[ ]1++ // [0 0-99]1++ - / /[ // -

The specified in !carus, the The rules rules are are specified in Icarus, the internal internal programming programming language language of of SRS. SRS. !carus Icarus is is in in many many respects respects similar similar to to Per! Perl [16]. [16]. It It is is interpreted interpreted and and object-oriented object-oriented with ability to with a a rich rich set set of of functionality. functionality. Icarus Icarus extends extends Per! Perl with with its its ability to define define formal formal rule rule sets sets for for parsing. parsing. Within Within SRS SRS it it is is also also used used extensively extensively to to define define and and manipulate manipulate the the SRS SRS meta-data. meta-data.

L> R AaChange: P rotei nChange Pos: 39 m issense .> R r--) --, ...., Rn a M utTyp e: (L 39 substitution AaM utType :

I Leu39Arg I

5.2 5.2
F IGURE FIGURE

The The SRS SRS token token server server applied applied on on mutation mutation identifiers. identifiers.

5 .1 5.1

i l e Databanks Integrating Flat Flat F File Databanks

1 15

The previous code example contains contains seven The previous code example seven rule rule definitions. definitions. Each Each starts starts with with a name, followed colon and enclosed within a name, followed by by a a colon and then then the the actual actual rule, rule, which which is is enclosed within - characters. characters. These These rules rules are are specified specified in in a a variant variant of of the the Extended Extended Backus Backus Naur Naur Form (EBNF) [ 1 7] and Form (EBNF) [17] and contain contain symbols symbols such such as as literals, literals, regular regular expressions expressions (de (delimited limited by by I / characters), characters), and and references references to to other other rules. rules. In In addition addition they they can can have have commands commands (delimited (delimited by by { { and and } }), ), which which are are applied applied either either after after the the match match of of the the entire rule rule (command (command at at the the beginning beginning of of the the rule) or after after matching matching a a symbol symbol entire rule) or (command (command directly directly after after the the symbol). symbol). In In the the example example three three different different functions functions are are called called within within commands. commands. $ $ In I n specifies specifies the the input input tokens tokens to to which which the the rule rule can can be be applied. applied. For For example, example, AaChange AaChange specifies specifies as as input input the the token token table table Key, Key, which which is is produced by produced by the the rule rule Key. $Out $Out specifies specifies that that the the rule rule will will create create a a token token table table with $Wrt writes with the the current current rule. rule. The The command command SWrt writes a a string string into into the the token token table table opened opened by by the the current current rule. rule. For For example, example, the the $Wrt SWrt command command following following the the refer reference to In rule in rule Key will will write write the In into the ence to the the i n rule in the the rule the line line matched matched by by i n into the token token table Key. Using table Using commands commands $ In I n and and $Out, $ou t, rules rules can can be be chained chained by by feeding feeding each each other other with with the the output output token token tables tables they they produce. produce. For For example, example, rule rule AaChange AaChange processes t processes the the tokens tokens in in token token table table Key and and provides provides the the input input for for rule rule AaMu AaMutType. only the Type. Lazy Lazy parsing parsing means means that that only the rules rules necessary necessary to to produce produce a a token token table table will will be be activated. activated. To To retrieve retrieve the the Key tokens, tokens, the the rules rules Key and and Entry E n t r y need need to to be processed. To be processed. To obtain obtain AaMutType, AaMutType, the the rules rules AaMutType, AaMutType, AaChange, AaChange, Key, and Entry and E n t r y are are invoked. invoked. Production Production AaChange AaChange uses uses an an associative associative list list to to convert convert three-letter amino acid codes to to their three-letter amino acid codes their one-letter one-letter equivalents. equivalents. AaMutType AaMutType is is a a se semantic rule that mantic rule that uses uses standard standard mutation mutation descriptions descriptions provided provided by by AaChange AaChange to to determine is a a simple determine whether whether a a mutation mutation is simple substitution substitution or or leads leads to to termination termination of of the by introducing the translation translation frame frame by introducing a a stop stop codon. codon. Advantages Advantages of of the the token token server server approach approach are: are:
9 +
+ 9

It It is is easy easy to to write write a a parser parser where where the the overall overall complexity complexity can can be be divided divided into into layers: fields, and field contents. layers: entry, entry, fields, and individual individual field contents. The parser is robust; a problem parsing will not The parser is very very robust; a problem parsing a a particular particular data data field field will not break break the the overall overall parsing parsing process. process. A A rule rule set set consists consists of of simple simple rules rules that that can can be be easily easily maintained. maintained. Lazy adding rules rules that Lazy parsing parsing allows allows adding that will will only only be be used used in in special special circumstances circumstances or or by by only only a a few few individuals. individuals. Lazy parsing Lazy parsing allows allows alternative alternative ways ways of of parsing parsing to to be be specified specified (e.g., (e.g., retrieval retrieval of of author author names names as as encoded encoded in in the the databank databank or or converted converted to to a a standard standard format) format).. The The parser parser can can perform perform reformatting reformatting tasks tasks on on the the output output (e.g., (e.g., insertion insertion of of hypertext hypertext links). links).

9 +
+ 9

+ 9 + 9

1 16

5 5

SRS: An An IIntegration Platform for Data Databanks and SRS: ntegration Platfo rm for banks and

Analysis Tools Tools

5.1.2 5. 1 .2

Subentry Libraries S u bentry L i b ra ri es


Flat banks are Flat file file data databanks are often often described described as as being being semi-structured. semi-structured. This This stems stems from from the the lack lack of of a a formal formal description description of of the the contents, contents, which which may may just just be be mentioned mentioned individual entries briefly briefly in in a a readme r e a d m e file file or or user user manual. manual. While While individual entries in in a a flat flat file file data bank describe a real world object such as a gene or protein, it is often possible databank describe a real world object such as a gene or protein, it is often possible to to discover discover entities entities within within these these entries entries that that are are worth worth querying querying and and retrieving retrieving as as independent entities. entities. independent Consider Consider a a nucleotide nucleotide sequence sequence entry entry in in EMBL EMBL or or GenBank GenBank that that describes describes an an entire entire genome, genome, or or a a large large part part of of it, it, encoding encoding hundreds hundreds of of genes. genes. It It contains contains for for every like the every gene, gene, or or coding coding sequence, sequence, a a sub-entity, sub-entity, or or sub-entry, sub-entry, which which can can look look like the one shown shown in in Figure Figure 5.3. 5.3. one With With SRS SRS these these sub-entries sub-entries can can be be parsed, parsed, indexed, indexed, and and retrieved retrieved as as separate separate entities. There There is is still a tight tight association association to to the the parent parent entry, entry, but but a a separate separate data databank entities. still a bank of bank of of sub-entries sub-entries is is created created effectively effectively next next to to the the data databank of parent parent entries. entries. Sequence Sequence features features have have a a special special property property in in that that they they are are contained contained within within the the sequence sequence of of the the parent parent entry. entry. The The exact exact location location of of that that sequence sequence can can be be specified specified in in the the join statement following the CDS sub-entry as shown in Figure 5.3 as a complex sub-entry as shown in Figure 5.3 as a complex join statement following the keyword. keyword. SRS SRS uses uses this this information information to to retrieve retrieve the the sub-sequence sub-sequence of of the the sequence sequence feature as as part part of of the the sub-entry. sub-entry. feature Many other other flat file libraries have sub-entries Many flat file libraries have sub-entries (e.g., (e.g., literature literature citations citations and and comments) addition, sequence sequence feature sequence data banks are comments).. In In addition, feature tables tables of of the the sequence databanks are is a list of of counts of each parsed a new sub-entry type type counter, counter, which which is a list counts of each parsed to to produce produce a new sub-entry feature type within an an entry. Indexing these scientist to to make make highly feature type within entry. Indexing these allows allows the the scientist highly specific queries "all Swiss-Prot Swiss-Prot entries with exactly exactly seven queries such such as as "all specific entries with seven trans-membrane trans-membrane segments." segments. "

5.2 5.2

IINTEGRATION NTE G RATI O N OF OF X M L DATABASES DATABASE S XML


XML i s becoming becoming increasingly increasingly important important within within the the bioinformatics bioinformatics community. community. XML is are several good reasons for using XML as a medium for the storage and and There are several good reasons for using XML as a medium for the storage There transmission of of bioinformatics bioinformatics data transmission data.. Because XML XML has has a a universally universally recognized recognized format format built built on on a a stable stable foundation foundation 9 Because [ 1 8], it it has has become become the the primary primary means means of of exchanging exchanging information information over over the the [18], Internet. Internet. A variety variety of of tools tools make make it it relatively relatively easy easy to to manage manage XML XML data data and and transform transform 9 A it into into other other formats formats (e.g., (e.g., an an extensible extensible stylesheet stylesheet language language transformation transformation it

5.2 Integration of of XML XML Databases Databases 5.2

1 17 17

FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT FT

CDS CDS

j o i n ( 1 2 1 5 1 . . 1 2 1 9 9 , 1 2 3 1 9 . . 1 2 4 8 3 , 2 6 1 54 . . 2 6 3 1 2 , 2 6 7 7 1 . . 2 7 0 0 4 , join(12151..12199,12319..12483,26154..26312,26771..27004,

28068..28415,29142..29342,30433..30554,30859..30926, 2 8 0 6 8 . . 2 8 4 1 5 , 2 9 1 4 2 . . 2 9 3 4 2 , 3 0 4 3 3 . . 3 0 5 54 , 3 0 8 5 9 . . 3 0 9 2 6 ,
31311 . . 31341) 31311..31341)

/codon start=l
/ db_xre f ; " SW I S S - PROT : P 0 1 7 3 0 " /dbxref="SWISS-PROT:P01730"

/note="major /not e ; " ma j or receptor receptor for for HIV-I; HI V - 1 ; member member of o f immunoglobulin immunog l obu l i n

supergene supergene family; fami ly ; T T cell c e l l surface surface glycoprotein glycopro t e i n T4" T4 "
/gene="CD4" /function="T-cell / f unction; " T - c e l l coreceptor; coreceptor ; involved involved in in antigen antigen
recogn i t i on ; participant part i c i pant in i n signal s i gnal transduction transduction pathway" pathway" recognition; / produc t ; " surface /product="surface antigen " antigen CD4 CD4"

/protein_id ; " AAB 5 1 3 0 9 . 1 " /protein_id="AAB51309.1"

/translation="MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQ / t ran s l a t i on ; " MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQ KKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDS KKSIQFHWKNSNQI KI LGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLI I KNLKI EDS DTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNI DTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSS PSVQCRS PRGKNI
QGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFS QGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFS FPLAFTVEKLTGSGELWWQAERASSSKSWI TFDLKNKEVSVKRVTQDPKLQMGKKL PLH FPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLH

LTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLML LTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLML SLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMAL SLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNI KVLPTWST PVQPMAL


IVLGGVAGLLLF IGLGI FFCVRCRHRRRQAERMS QI KRLLSEKKTCQCPHRFQKTCS P I IVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI

5.3 5.3
F IGURE FIGURE

A A Protein Protein Coding Coding Sequence Sequence (CDS) (CDS) Feature Feature in in EMBL. EMBL.

((XSLT) XSLT) [[19] 1 9] style style sheet sheet may may be be used used to to transform transform XML XML data data to to hypertext hypertext markup markup language language (HTML) (HTML) format format for for display display in in a a Web Web browser) browser).. 9 By By allowing allowing users users to to create create their their own own syntax syntax (element (element and and attribute attribute names) names) and and structure structure (hierarchical (hierarchical parent-child parent-child relationships relationships between between elements), elements), XML XML gives gives database database designers designers great great freedom freedom to to transform transform their their mental mental models models of of an an information information system system into into a a concrete concrete form. form. However, However, people people conceptualize conceptualize information information in in very very different different ways, ways, particu particularly larly in in a a complex complex field field like like bioinformatics. bioinformatics. This This makes makes it it difficult, difficult, if if not not impossible, impossible,

1 18

5 5

SRS: SRS: An An

a n ks and Integration Platform Platform for for Datab Databanks and Analysis Tools Tools

to to create create widely widely accepted accepted XML XML standards standards for for bioinformatics bioinformatics data. data. Furthermore, Furthermore, different organizations organizations are are interested interested in in different different aspects aspects and and constellations constellations of of the the different bioinformatics universe, which proteins, structures, bioinformatics data data universe, which includes includes DNA DNA sequences, sequences, proteins, structures, expressed sequence metabolic pathways, pathways, patents, expressed sequence tags tags (ESTs), (ESTs), transcripts, transcripts, metabolic patents, muta mutations, all of tions, publications, publications, and and so so forth. forth. If If all of these these data data types types were were incorporated incorporated into into a single format, complex and a single format, it it would would be be extremely extremely complex and unwieldy. unwieldy. For these these reasons, many companies companies and and organizations organizations have have given given up up the the quest quest For reasons, many for a a universal XML standard standard for for bioinformatics bioinformatics data. data. Instead, Instead, they they have have created created for universal XML their own own standards, which are are often often customized customized versions versions of of existing existing standards, standards, their standards, which optimized internal applications. optimized for for use use in in internal applications. SRS SRS has has remained remained neutral neutral in in the the stan standards dards war war by by striving striving to to develop develop flexible flexible tools tools that that support support all all the the existing existing and and emerging bioinformatics bioinformatics XML XML formats. emerging formats.

5. 2. 1 5.2.1

What M Makes XML Unique? What a kes X ML U n iq u e?


Data formatting formatting in in XML XML is is similar similar to to data data formatting formatting in in flat flat files. files. Figure Figure 5.4 5.4 shows shows Data how the the EMBL EMBL flat flat file file data data in in Figure 5.3 might might appear appear if if rendered rendered in in XML XML format. format. how Figure 5.3 The The key key features features that that make make XML XML formats formats different different from from flat flat file file formats formats are are as as follows. .4 illustrates follows. Figure Figure 5 5.4 illustrates both both types types of of data data encapsulation encapsulation (the (the only only piece piece of of D, CDS). data data expressed expressed as as an an attribute attribute value value is is the the feature feature I ID, CDS).

1 . XML 1. XML uses uses two two distinct distinct kinds kinds of of tags tags for for wrapping wrapping data: data: elements elements and and at attributes. There tributes. There is is no no hard-and-fast hard-and-fast rule rule for for what what kinds kinds of of data data should should be be encapsulated encapsulated in in attributes attributes rather rather than than in in elements. elements. In In general, general, attributes attributes tend tend to to be be used used for for short short pieces pieces of of data data that that have have a a one-ta-one one-to-one relationship relationship with with the IDs and the data data in in the the parent parent element, element, such such as as IDs and classifications. classifications.
2. There There are are two two types types of of syntax syntax that that can can be be used used for for XML XML elements. elements.

a. a. Normal Normal syntax syntax encloses encloses the the data data belonging belonging to to an an element element between between a a start start j oin /jjooin tag tag (e.g.,< (e.g.,<j oin>) and and an an end end tag tag (e.g., (e.g., < / in>). . b. Empty syntax may used for elements that either have b. Empty syntax may be be used for elements that either have no no data data content content or have content efficiently in attribute values. or have content that that may may be be stored stored efficiently in attribute values. The The InterPro InterPro format format created created by by the the EEl EBI uses uses empty empty db_xre d b _ x r e ff elements elements for for specifying references to data bases: specifying references to external external databases:
<db_xre f db= " EC " dbkey= 2 . 7 . 4 . 9 " /> <db_xref db=" EC" dbkey= " "2.7.4.9"/>
3. Some XML elements elements (e.g., feature_l i s t ) are used as structural components 3. Some XML (e.g.,feature_list) are used as structural components that that define define hierarchical hierarchical relationships relationships between between other other elements elements but but contain contain no no data own. data of of their their own.

5.2 J t~ "~2~o~, 5 n o~-~e~gf-a-~tJ~-~ -X,M-L,~pata~b~a~ses~ ........................................................................................................................................................................................................................... 119 1 19

<feature list>
< f eature id="CDS"> i d ; " CDS " > <feature

<join>(12151..12199,12319..12483,26154..26312,26771..27004,28068..28415, c j oi n > ( 1 2 1 5 1 . . 1 2 1 9 9 , 1 2 3 1 9 . . 1 2 4 8 3 , 2 6 1 5 4 . . 2 6 3 1 2 , 2 6 7 7 1 . . 2 7 0 04 , 2 8 0 6 8 . . 2 8 4 1 5 ,
2 9 1 4 2 . . 2 9 3 4 2 , 3 0 4 3 3 . . 3 0 5 54 , 3 0 8 5 9 . . 3 0 92 6 , 3 1 3 1 1 . . 3 1 3 4 1 ) c / j oi n > 29142..29342,30433..30554,30859..30926,31311..31341)</join>

<codon_start>l</codon_start>
c db_xre f >SWI S S - PROT : P0 1 7 3 0 c / db_xre f> <db xref>SWISS-PROT:P01730</db xref> c n o t e >ma j or receptor receptor for f o r HIV-I; H I V - 1 ; member member of o f immunoglobulin immunoglobu l i n supergene supergene family; f am i l y ; <note>major T cell ce l l surface surface glycoprotein glycopro t e i n T4</note> T4 c /note> T c ge n e > CD4 c /gene> <gene>CD4</gene> c funct i o n >T - c e l l coreceptor; corecept o r ; involved involved in in antigen antigen recogn ition; <function>T-cell recognition; part i c i pant participant i n signal signal transduction t ransduc t i on pathway</function> pathwayc / funct i on > in

c p roduct >surface antigen antigen CD4</product> CD4 c /product > <product>surface

<protein_id>AAB51309.1</protein_id>
c t rans l a t i on >MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQ <translation>MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQ KKS IQFHWKNSNQI KI LGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPL I I KNLKIEDS KKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDS I VLGGVAGLLLFIGLGI FFCVRCRHRRRQAERMSQI KRLLSEKKTCQCPHRFQKTCSPI IVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI c / trans l a t ion> </translation> c / f e a tu r e > </feature> c / f eature_l ist> </feature list>

5.4 5.4

EMBL

flat flat file file data data rendered rendered as as XML XML.

F IGURE FIGURE

Empty elements may be used as entries (or 4. Empty elements may be used as structure-only structure-only elements elements to to delimit delimit entries (or sub-entries). To support both normal and empty element entry delimiters, the sub-entries). To support both normal and empty element entry delimiters, the SRS SRS XML XML parser parser must must have have two two different different types types of of behavior. behavior. a. a. For For entries entries delimited delimited by by start start and and end end tags, tags, entry entry processing processing terminates terminates when when the the end end tag tag is is found. found. b. b. For For entries entries delimited delimited by by empty empty element element tags, tags, there there is is no no end end tag, tag, so so entry entry processing processing terminates terminates when when the the start start tag tag of of the the next next entry entry is is found found or or when when the the end end of of the the file file is is reached. reached.
5. 5. XML XML allows allows users users to to define define shorthand expressions expressions to to represent represent commonly commonly

used For example, example, the the used strings. strings. These These expressions expressions are are called called general entities. For

1 20

5 5

SRS: ntegration Platfo rm for nd SRS: An An IIntegration Platform for Databanks Databanks a and

Analysis Tools Tools

entity ss Prot entity &spdb; &spdb; could could stand stand for for the the name name of of a a database database (e.g., (e.g., Swi SwissProtn XML Rel ease). When Release). When a an XML parser parser encounters encounters a a general general entity entity reference reference like like &spdb;; in in an an attribute attribute value value or or element element content, content, it it must must replace replace the the reference reference &spdb with with the the replacement replacement text. text.
6. characters have 6. Some Some commonly commonly used used characters have special special meaning meaning in in XML. XML.

a. [<] and [>] are a. Less Less thans thans [<] and greater greater thans thans [>] are used used in in markup markup tags. tags.
b. Apostrophes Apostrophes [ ['] and quotation quotation marks marks [ ["] are used used to to delimit delimit attribute attribute b. , ] and ,, ] are values. values.

c. c. Ampersands Ampersands [&] [&] are are used used to to specify specify general general entity entity references. references. If If these these characters characters occur occur within within XML XML attribute attribute values values or or element element content, content, they ambiguities for they can can create create ambiguities for an an XML XML parser, parser, so so they they must must be be handled handled with with care. care.
7. 7. XML XML data data may may also also be be encapsulated encapsulated in in CDATA CDATA sections sections that that may may appear appear

wherever wherever character character data data may may appear. appear. Inside Inside CDATA CDATA sections, sections, less less thans thans and and ampersands ampersands are are treated treated as as literals literals (i.e., (i.e., they they do do not not need need to to be be replaced replaced with with entity references) . entity references).

5 .2.2 5.2.2

H ow Are XML Data ba n ks IIntegrated nteg rated iinto nto S R S? How Are XML Databanks SRS?
XML integrated into SRS universe universe of XML is is fully fully integrated into the the SRS of databanks, databanks, and and it it is is relatively relatively easy installation. The easy to to incorporate incorporate XML XML libraries libraries into into an an SRS SRS installation. The only only prerequisite prerequisite is is a a document document type type definition definition (DTD) (DTD) that that accurately accurately describes describes the the structure structure of of the the XML. XML. If If a a DTD DTD does does not not exist, exist, a a utility utility such such as as Michael Michael Kay's Kay's DTDGenerator DTDGenerator [20] can [20] can be be used used to to create create one. one. The The first first step step in in the the configuration configuration process process is is to to run run an an SRS SRS utility, utility, which which analyzes analyzes the the DTD DTD and and creates creates templates templates for for all all the the meta-data meta-data objects objects needed needed to to define new library. user must define the the new library. The The user must then then edit edit the the resulting resulting object object definitions. definitions. Initially, user must Initially, the the user must supply supply all all of of the the extra extra information information needed needed to to perform perform the the basic loading tasks. bank basic indexing indexing and and loading tasks. The The next next step step is is to to register register the the new new data databank with with SRS SRS and and index index the the library. library. Once Once the the library library has has been been indexed, indexed, all all the the standard standard library library operations operations become become available. available. If new XML If the the new XML library library contains contains sub-entry sub-entry libraries libraries or or takes takes advantage advantage of of any any special special indexing indexing or or loading loading features, features, the the administrator administrator must must perform perform additional additional editing editing to to define define the the sub-entry sub-entry libraries libraries or or to to activate activate these these features. features. Integrating Integrating an XML library library into easier than library because an XML into SRS SRS is is easier than integrating integrating a a flat flat file file library because SRS SRS does most of creating the does most of the the work work of of creating the library library meta-information. meta-information. Also, Also, the the use use of of

5.2 5.2

M L Databases Integration of of X XML Databases

121

a a built-in built-in generic generic XML XML parser parser eliminates eliminates the the need need for for writing writing a a library-specific library-specific parser. parser.

5.2.3 5.2.3

Overvi ew of ML S u pport Feat u res Overview of X XML Support Features


Support Support for for Complex Complex DTDs DTDs

A DTD a set declarations that defines the the syntax syntax and structure of of a a particu particuA DTD is is a set of of declarations that defines and structure lar class of DTDs may (inside an lar class of XML XML documents. documents. DTDs may consist consist of of an an internal subset (inside an XML document) document) and/or and/or any any number number of of external subsets subsets in in separate separate files. files. External External XML subsets may subsets. DTDs subsets may be be invoked invoked recursively recursively from from within within other other external external subsets. DTDs may may containing also also incorporate incorporate INCLUDE INCLUDE and and IGNORE IGNORE blocks blocks (conditional sections) sections)containing different used in different sets sets of of declarations declarations to to be be used in different different applications, applications, and and these these blocks blocks may using variables variables called called parameter may be be activated activated or or deactivated deactivated using parameter entities. entities. Thus, Thus, DTDs can can be be quite quite complex. DTDs complex. The The SRS SRS utility utility used used to to parse parse DTD DTD files files employs employs a a sophisticated sophisticated algorithm algorithm to to process guidelines laid laid down process external external DTDs DTDs recursively recursively in in accordance accordance with with the the guidelines down in in the .0 Recommendation 1 8 ] . This the World World Wide Wide Web Web Consortium's Consortium's XML XML Version Version 1 1.0 Recommendation [ [18]. This ensures that that if if a a DTD DTD includes includes multiple multiple declarations declarations of of the the same same general general entity entity ensures or default attribute attribute value, value, the the correct values are used in generating the SRS meta or default correct values are used in generating the SRS metainformation. utility also also supports information. The The utility supports the the use use of of parameter parameter entities entities and and correctly correctly processes conditional sections. processes conditional sections.
Support and Querying Support for for Indexing Indexing and Querying

SRS give users over the way XML SRS provides provides several several powerful powerful features features to to give users control control over the way XML data data is is indexed indexed and and queried. queried. Micro-parsing Micro-parsing allows allows users users to to pre-process pre-process data data be before fore it it is is written written to to an an index index field. field. For For example, example, suppose suppose the the data data contains contains an an thor>J 1author author element element that that uses uses initials-first initials-first formatting formatting (e.g., (e.g., < au aug hor>J.. K K.. Row Rowling< thor n initials-last i n g < // aau uth o r > ) , , but but the the user user would would like like to to index index this this iin initials-last format format (e.g., (e.g., author element would refer to Rowling, j. K.). The indexing metaphor for the Rowling, J. K.). The indexing metaphor for the a u t h o r element would refer to an an !carus Icarus syntax syntax file file containing containing a a production production to to transform transform the the data. data. Micro-parsing Micro-parsing allows allows users users to to apply apply the the same same types types of of syntactic syntactic and and semantic semantic rules rules used used for for flat flat file parsing the contents tags. file parsing to to the contents of of individual individual XML XML tags. Splitting allows lists into allows users users to to subdivide subdivide input input data data strings strings containing containing lists into their their component component index index values. values. For For example, example, suppose suppose the the data data contains contains an an authors authors ele element ment containing containing a a list list of of author author names names separated separated by by commas commas and and white white space space (e.g., (e.g., <authors>J . Rowl ing , W i l l iam Shakespeare <authors>J.. K K. Rowling, William Shakespeare,, Stephen Stephen King King < / au thors , but user would would like like to list as separate author </authors>), but the the user to index index this this list as three three separate author names. names. The The indexing indexing metaphor metaphor for for the the authors authors element element would would include include a a spl s p l ii tt

1 22

122

Platform for Databanks and ................................................................ 5............. S RoS:AnJntegrgtioono,p!a}formfozDatabanksAnd

Tools Ao~_~galysisoTools

attribute attribute (e.g., (e.g., spl s p l ii t: t[:, [ , l ])) specifying specifying a a regular regular expression expression containing containing a a list list of of the individual items the characters characters used used to to separate separate individual items in in the the list. list. Conditional indexing allows allows users users to to process process meta-data meta-data specified specified within within the the XML XML stream. stream. For For example, example, the the Bioinformatic Bioinformatic Sequence Sequence Markup Markup Language Language (BSML) ] uses (BSML) format format [21 [21] uses an an XML XML element element called called Attribute A t t r i b u t e as as a a container container for for three and organism. three different different types types of of data: data: version, version, source, source, and organism. The The name attribute attribute is is a a meta-data meta-data field field that that identifies identifies the the type of of data data contained contained in in the the associated associated content c o n t e n t attribute. attribute.
<At tr ibute name= " ve r s i o n " <Attribute name="version" <At tr ibute name = " s ource " <Attribute name="source" content= " C l oning vec t o r pAP3 neo DNA . " /> content="Cloning vector pAP3neo DNA."/> content= " C l oning vector " /> content="Cloning vector pAP3neo pAP3neo"/> content= " AB 0 0 3 4 6 8 . 1 GI : 2 6 5 6 02 1 " /> content="AB003468.1 GI-2656021"/>

<At tribute name = " organ i s m " <Attribute name="organism"

Conditional contained in Conditional indexing indexing may may be be used used to to channel channel the the data data contained in the the three three con conten t e n tt attributes attributes into into three three separate separate index index fields fields designed designed to to hold hold version, version, source, source, and and organism organism data. data. SRS SRS provides provides solid solid support support for for indexing indexing and and querying querying subentry subentry libraries. libraries. In In some some XML XML formats, formats, a a single single type type of of element element is is used used in in more more than than one one sub-entry sub-entry library, library, and and the the element element may may have have a a different different meaning meaning in in each each library. library. To To index index the the data data contained contained in in these these elements elements into into the the correct correct set set of of target target index index fields, fields, SRS SRS allows allows users users to to create create separate separate fields fields and and indexing indexing metaphors metaphors for for each each unique unique path instance The indexing instance of of the the element. element. The indexing metaphors metaphors use use a a special special p a t h attribute attribute to to determine determine which which sub-entry sub-entry library library the the element element currently currently being being processed processed belongs belongs to field. Conversely, to so so that that the the data data can can be be indexed indexed into into the the correct correct field. Conversely, SRS SRS also also allows allows users users to to index index data data from from a a single single element element or or attribute attribute into into multiple multiple index index fields. fields.

5 .2.4 5.2.4

H ow Does RS M eet the h a l l e nges of M L? How Does S SRS Meet the C Challenges of X XML?
Problems divided into main categories: Problems with with managing managing XML XML data data can can be be divided into two two main categories: syntactical/semantic syntactical/semantic (microscopic) (microscopic) and and structural structural (macroscopic) (macroscopic).. Data Data formatting formatting varies widely XML formats, varies widely between between standard standard XML formats, and and pre-processing pre-processing is is often often re required before indexed or . 1 describes quired before data data can can be be indexed or loaded. loaded. Table Table 5 5.1 describes several several common common syntactical syntactical problems problems and and explains explains how how SRS SRS solves solves them. them. XML XML formats, formats, like like flat flat file file formats, formats, can can be be large, large, complex, complex, and and unwieldy, unwieldy, mak making ing data data access access difficult difficult and and inefficient. inefficient. Table Table 5.2 5.2 describes describes several several common common struc structural problems that tural problems that occur occur in in XML XML formats formats used used in in bioinformatics. bioinformatics. SRS SRS provides provides solutions problems, but solutions to to some some of of these these problems, but some some can can only only be be solved solved by by restructuring restructuring the the data. data.

5.2 5.2

M L Databases Integration of of X XML Data base~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 23

123

Problem Fields Fields may may include include special special characters characters (e.g., (e.g., colons, colons, square square brackets, brackets, dashes, dashes, and and asterisks) asterisks) that that can can interfere interfere with with SRS SRS query query syntax. syntax.

SRS Solution
Use fields dur purify data data fields durUse micro-parsing micro-parsing to to purify

ing indexing. indexing.

Fields may contain data whose type depends on


the the value value of of another another (meta-data) (recta-data) field. field.

Use Use conditional conditional indexing indexing to to index index data data into into different different fields fields based based on on the the value value in in a a con condition dition (meta-data) (meta-data) field. field.

Fields Fields may may contain contain characters characters that that require require special special handling in in XML XML (e.g., (e.g., less less thans, thans, greater greater thans, thans, apostrophes, apostrophes, quotation quotation marks, marks, and and ampersands). ampersands). Entity Entity references references (both (both pre-defined pre-defined and and user user-

Use Use micro-parsing micro-parsing to to replace replace problematic problematic characters characters with with pre-defined pre-defined character character entity entity references. references.

SRS SRS provides provides sophisticated sophisticated entity entity replace replacement ment functionality. functionality.

indefined) must be replaced before data is in


dexed dexed or or loaded. loaded. Entity Entity references references may may include include markup. markup. A A single single element element may may be be used used in in rwo two or or more more subentry subentry libraries libraries to to contain contain different different types types of of information. information.

Users fields and Users can can create create separate separate fields and indexing indexing metaphors metaphors for for each each instance instance of of an an element element used used in in a a different different subentry subentry library. library. The The in indexing dexing metaphors metaphors use use a a path p a t h attribute attribute to to index index the the correct correct data data into into the the correct correct fields. fields.

Mixed Mixed content content elements elements are are difficult difficult to to parse parse because content because content belonging belonging to to the the parent parent ele element ment is is interspersed interspersed with with content content belonging belonging to to child child elements. elements.

SRS SRS provides provides rwo two special special loading loading comm commands l: - o f and l: value - o f ) (xsl :copy copy-of and xs xsl :value-of) ands (xs that that emulate emulate useful useful features features found found in in the the ex extensible stylesheet language transformations tensible stylesheet language transformations (XSLT) (XSLT) language language [19]. [19].

Fields Fields may may contain contain lists lists of of values values that that must must be be separated separated into into individual individual values. values.

Indexing Indexing metaphors metaphors can can include include a a spl s p l ii tt attribute sub-strings us attribute to to split split a a string string into into sub-strings using ing a a set set of of separator separator characters characters contained contained in in a a regular regular expression. expression.

5 .1 5.1

Syntactical Syntactical problems p r o b l e m s and a n d SRS SRS solutions. solutions.

TAB LE TABLE

1 24
Problem

SRS: An Integration Platform for Databanks and Analysis Tools Tools

SRS Solution No solution; inherent in XML.

Some libraries use large numbers of structure structureonly only tags. Tags can take up a lot of disk space without providing much useful information. Excessive numbers of tags make a format diffi difficult to understand and manage.

The SRS utility that generates generates library defini definition files eliminate files uses intelligent parsing to eliminate structure-only elements from the set of meta metadata objects that are included in the library

definition file. file.


Deep nesting and large numbers of sub-entries sub-entries slow down querying and loading performance. The SRS SRS loading algorithm builds a docu document object model (DOM) object for each entry. This approach provides both optimal performance and highly reliable handling of sub-entries. sub-entries. It also provides some spe special loading commands that improve perfor perfor-

mance for particular particular types of loading tasks. Content belonging to a single single entity may be
spread across multiple files. A single single entity may appear appear repeatedly in multi multiple files. Excessive Excessive data performance. performance. redundancy slows down No solution; inherent in certain XML for formats. Data should be restructured. No solution; inherent in certain XML for formats. Data should be restructured. No No solution; inherent inherent in in certain certain XML XML for formats. mats. Data Data should should be be restructured. restructured.

5.2 5.2 TABLE TABLE


ru ,

1 I

Structural problems problems and and SRS SRS solutions. solutions. Structural

5.3 5.3

IINTEGRATING NTE G RATI N G R E LATI O NAL DATABAS ES RELATIONAL DATABASES W h o l e Schema nteg rati o n Whole Schema IIntegration

5.3. 1 5.3.1

For bases, For relational relational data datab ases, a a schema s c h e m a organizes organizes the the data d a t a defining defining the the data d a t a entities entities and and their their relationships r e l a t i o n s h i p s to to each each other. other. Because Because individual i n d i v i d u a l entities entities can can only only be be modeled modeled as world as flat flat tables, tables, real real w o r l d concepts c o n c e p t s such such as as genes genes or or metabolic m e t a b o l i c pathways p a t h w a y s often often use use many m a n y tables tables to to store store the the information i n f o r m a t i o n faithfully. faithfully. Conversely, Conversely, to to make m a k e full full use use of of

5.3 5.3
.

Integrating Relation Relational Databases a l Databases


. . . . . . . . . . -

.... o

1 25 1 2 5

this this data, data, the the whole whole schema schema needs needs to to be be made made available available to to the the user. user. The The user user must must be be able able to to query query one one or or more more tables tables and and then then collect collect the the necessary necessary data data from from all all related related tables. tables. For For example, example, in in a a relational relational database database storing storing genes, genes, the the user user may may query the the author author table table for for Lee, which which will will return return the the set set of of genes genes published published by by the the query author Lee. Lee. Behind Behind the table need author the scenes scenes the the results results from from the the query query in in the the author author table need to to be be related related to to other other data data (e.g., (e.g., accession accession code, code, keyword, keyword, references, references, sequence), sequence), which which is is stored stored in in other other tables. tables. The The data data is is represented represented as as a a whole whole and and must must be be assembled tables before assembled from from many many different different tables before it it is is presented presented to to the the user. user. The problem of The problem of mapping mapping a a table table structure structure into into a a more more complex complex object object structure structure has been addressed addressed before before by by object object relational relational mapping techniques. Traditional Traditional has been mapping techniques. approaches approaches start start with with a a class class description description of of the the objects objects to to be be stored stored and and then then generate generate the the relational relational schema schema from from the the class class information. information. This This is is in in conflict conflict with with the SRS approach approach of model has the SRS of integrating integrating existing existing schemas schemas where where often often an an object object model has not yet yet been been defined. defined. The The overwhelming overwhelming majority majority of of the the schemas schemas relevant relevant to to life life not science traditional methods, methods, such science informatics informatics (LSI) (LSI) have have been been obtained obtained by by more more traditional such as entity-relationship entity-relationship (ER) (ER) modeling, and not not by by object-relational object-relational modeling. modeling. as modeling, and The The SRS SRS approach approach is is to to use use a a semi-automated semi-automated process process to to define define object objectrelational existing schema. selecting a relational mapping mapping on on top top of of an an existing schema. This This is is achieved achieved by by selecting a table containing table table manually manually to to be be the the hub table, or or the the table containing values values equivalent equivalent to to an an object object ID ID (usually (usually an an accession accession number number or or unique unique ID), ID), and and other other tables tables that that can can be be object. Using table, the tables, and defined defined to to belong to to the the object. Using the the hub hub table, the selection selection of of tables, and for foreign key relationships, can automatically an object eign key relationships, SRS SRS can automatically create create an object model, model, which which is is in introduced troduced to to the the system system as as a a dynamic dynamic type. type. The The resulting resulting object object can can then then be be queried queried and bank. and retrieved retrieved as as a a fixed fixed entity, entity, much much like like an an entry entry in in a a flat flat file file or or XML XML data databank. When When a a relational relational databank databank is is integrated, integrated, no no indexing indexing on on the the SRS SRS side side needs needs to to be be done. done. SRS SRS will will generate generate SQL SQL statements statements for for querying querying and and retrieval retrieval of of objects objects that will emulate users expect that will emulate the the same same behavior behavior as as users expect when when dealing dealing with with flat flat file file and and XML XML databanks. databanks.

5.3.2 5.3.2

Captu ri n g the e l ati o n a l Sch ema Capturing the R Relational Schema


SRS SRS Relational Relational includes includes a a Java Java program, program, schemaXML, schemaXML, which which uses uses a a standard standard Java Java Database (JDBC) [22] Database Connectivity Connectivity (JDBC) [22] interface interface to to capture capture the the relational relational database database schema, schema, including including all all the the tables, tables, columns, columns, keys, keys, and and foreign foreign key key relationships. relationships. The The program program schemaXML passes passes this this information information to to SRS SRS providing providing the the base base informa information relational databank. databank. All tion to to integrate integrate the the relational All further further meta-data meta-data for for customization customization can added by graphical interface can be be added by editing editing this this schema schema information information using using a a graphical interface or or by by direct direct manipulation manipulation of of Icarus Icarus files. files. This This provides provides a a much much simpler simpler solution solution than than would would be be required required by by writing writing individual individual integration integration programs programs for for each each relational relational s chemaXML just databank to be integrated. If the schema changes the program, databank to be integrated. If the schema changes the program, schemaXML just

1 26 126

============

An IIntegration Platform for for Databanks Databanks and and Ana Analysis Tools SRS: An ntegration Platform lysis Tools

needs needs to to be be re-run re-run and and the the edits edits reapplied. reapplied. A A tool tool is is being being developed developed that that reapplies reapplies edits to an updated version of the original schema definition. edits to an updated version of the original schema definition.

5.3.3 5.3.3

S e lecti n g a H u b Ta ble Selecting Hub Table


SRS Relational Relational is is based based o on the concept concept o of hub tables, tables, which which are are used, used, conceptually, conceptually, SRS n the f hub to Hub tables tables are central points to relate relate relational relational database database tables tables to to data data objects. objects. Hub are central points of of interest interest in in a a relational relational schema schema and and must must contain contain a a unique unique name name (typically (typically a a primary key) key) that can be an entry primary that can be used used as as an entry ID ID (e.g., (e.g., an an accession accession code code in in a a sequence sequence database). relationships, all database). Using Using foreign foreign key key relationships, all data data held held in in surrounding surrounding tables tables can can be be linked linked directly directly or or indirectly indirectly back back to to the the hub hub table table and and entry entry ID ID using using table table joins. joins. All All tables tables that that belong belong to to a a hub hub table table must must be be directly directly or or indirectly indirectly linked linked with with it. it. In In cases cases where where these these links links are are not not apparent apparent from from the the schema schema information information retrieved retrieved by by schemaXML, schemaT~L, they they can can be be set set manually manually within within the the visual visual administration administration interface. interface. Figure .5 shows shows a maintain the Figure 5 5.5 a section section of of the the relational relational schema schema that that is is used used to to maintain the GO MySQL relational relational database GO databank databank in in the the MySQL database management management system system (RDBMS). (RDBMS).
SRS Ra \Ional (GOTlRIoI.GOTlRIoI Edl _ Grid T.... _

!'It

D ' roUil

I}. !i' O il! x " , . ,. $ "

......

SRS) SRS Vis ...1 ,.thlnblt f!

lion Tool

-" EV
...
.::rJ
,
"

......

"'.

OK

5.5 F IGURE FIGURE

Visual Visual representation representation of of part part of of the the GO GO term term schema schema within within the the SRS SRS Visual Visual Admin Adminerrm istration Individual lines istration Tool. Tool. The The table table t te m is is selected selected as as a a hub hub table. table. Individual lines between between the represent foreign the tables tables represent foreign key key relationships. relationships.

5.3 5.3

IIntegrating ntegrati ng Relation a l Databases Relational Databases

1 27 127

The The term t e r m table table is is clearly clearly the the central central point point of of interest interest in in the the schema schema with with related related tables tables surrounding surrounding it. it. It It would would be be selected selected by by the the SRS SRS administrator administrator as as a a hub hub table bases, the table for for use use in in SRS. SRS. In In other other data databases, the hub hub table table selection selection may may be be less less clear. clear. For example, Information Management For example, a a Laboratory Laboratory Information Management System System (LIMS) (LIMS) database database has has many many concepts concepts such such as as sample, sample, project, project, and and experiment, experiment, each each with with its its own own collection these cases, collection of of related related tables. tables. In In these cases, multiple multiple hub hub tables tables can can be be selected selected and and associated associated to to separate separate SRS SRS libraries libraries by by the the SRS SRS administrator. administrator.

5.3 .4 5.3.4

G e n e ratio n of Generation of SOL SQL


SRS sa s nodes SRS sees sees the the relational relational schema schema a as a graph graph with with tables tables a as nodes and and foreign foreign key key relationships as edges. The table is graph. An relationships as edges. The hub hub table is at at the the center center of of this this graph. An idealized idealized form form of of such such a a graph graph is is shown shown in in Figure Figure 5.6. 5.6. To To translate translate an an SRS SRS query query into into SQL SQL it it maps maps the the predicates predicates to to the the appropriate appropriate columns columns and and then then to to tables tables in in the the graph, graph, and and a a shortest shortest path path is is derived derived to to relate relate these these predicate predicate queries queries to to rows rows in in the the hub hub table .6 (A), table using using joins. joins. An An example example is is shown shown for for three three predicates predicates in in Figure Figure 5 5.6 (A), which which are table. The are all all linked linked to to the the hub hub table. The SQL SQL query query will will return return a a number number of of rows rows in in the the hub table, which hub table, which are are processed processed to to create create a a list list of of entry entry IDs. IDs. To To retrieve retrieve particular particular entries, entries, a a search search path path is is again again used, used, this this time time radiating radiating out out from from the the hub hub table table and and including including the the required required tables tables using using joins. joins. See See Figure Figure 5.6 5.6 (B). (B).

Query i ng (A)

Converge to HUB

Entry Object A embly Diverge from HUB

(B)

5.6 F I G U RE FIGURE

Hub Hub table table data data access. access.

1 28 128

Tools SRS: Platform for for Databanks Databanks and a n d Analysis Tools SRS: An An Integration Platform

5.3.5 5.3.5

Restricti n g Access to Parts Pa rts of of the the Schema Schema Restricting


Once the the relational relational schema schema held held o n the the SRS SRS side side has has been been configured configured to to define define a a Once on library, library, it it can can be be further further modified modified to to restrict restrict or or allow allow access access to to the the tables tables within within the the schema. schema. Individual Individual tables tables can can be be hidden hidden from from SRS SRS so so general general access access to to the the data data is is not not available. available. In In addition, addition, the the SRS SRS access access permissions permissions can can also also be be used used to to control control access to to the the whole whole or or sections sections of of the the schema schema (when (when using using multiple multiple hub hub tables). tables). It It access is also also possible possible to to modify, modify, add, add, and and remove remove links links between between tables tables without without altering altering is the original original database database schema. schema. the

5.3.6 5.3.6

Que ry Performance Performa nce to Relational R e l ati o n a l Databases Data bases Query
During the the development use of SRS Relational a number number of performance During development and and use of SRS Relational a of performance optimizations have been added. outlined as optimizations have been added. A A few few of of these these are are outlined as follows. follows.
9 SRS it is is well known that that case case insensitive insensitive queries in SRS is is case case insensitive, insensitive, and and it well known queries in relational bases can be expensive. if all all the the values values in column relational data databases can be expensive. Therefore, Therefore, if in a a column are known to be be in in the the same this can within the are known to same case, case, this can be be indicated indicated within the meta metadescription the schema schema and used to significantly. description of of the and used to reduce reduce the the querying querying time time significantly.

When required required to many table table joins joins with ( l :N) relationships relationships 9 When to do do many with One-to-Many One-to-Many (I:N) within within a a single single SQL SQL query, query, the the creation creation of of the the result result table table will will suffer suffer from from combinatorial explosion. The definition of a table link contains a junction explosion. The definition of a table link contains a option, small queries option, which, which, if if turned turned on, on, will will generate generate multiple multiple small queries that that can can be be run run simultaneously and joined externally. simultaneously and joined externally. This This provides provides significant significant performance performance improvements improvements for for the the object object assembly assembly process. process.
9 For For text text and and pattern pattern searches searches it it is is possible possible to to make make use use of of text text indices indices produced produced by the the relational relational database. database. This This results results in in much much faster faster searches searches for for text-based text-based by queries, queries, such such as as keywords keywords or or author author name. name.

9 All All query query results results are are cached cached in in a a user-owned user-owned space. space. This This is is inexpensive inexpensive because because the the query query result result is is represented represented as as a a simple simple list list of of entry entry IDs. IDs. To To repeat repeat a a query, query, it it can can be be looked looked up up in in the the cache cache and and retrieved. retrieved. The The cache cache speeds speeds up up reinspection reinspection of of queries, queries, combining combining them them with with other other queries, queries, or or displaying displaying individual individual chunks chunks of of the the result result list list in in a a Web Web interface. interface.

5.3.7 5.3.7

Viewi ng E ntries fro ma R e l ati o n a l Data ba n k Viewing Entries from Relational Databank
As As mentioned mentioned previously, previously, the the selection selection of of a a hub hub table table and and associated associated tables tables is is used used to to build build an an object object model model automatically. automatically. The The resulting resulting object object can can be be displayed displayed as as an an XML XML stream, stream, which which is, is, however, however, inconvenient inconvenient for for the the user. user. One One option option for for

5.4 The 5.4 The SRS SRS Query L a n g u a g ~ .............................................................. ..............................................................

1 29

129

presenting na s to presenting the the object object iin a human, human, readable readable form form iis to apply apply XSLT XSLT to to the the XML XML output. output. Another Another more more convenient convenient one one is is to to use use the the mechanism mechanism SRS SRS provides provides to to present layout description. present objects objects to to the the Web Web by by a a layout description. Section Section 5.6 5.6 will will describe describe how how data relational data banks can combined with data from from relational databanks can be be combined with data data from from flat flat file file or or XML XML data banks into databanks into a a single single data data structure. structure.

5.3.8 5.3.8

S u m m a ry Summary
Consistent SRS philosophy, banks can Consistent with with the the SRS philosophy, relational relational data databanks can be be added added through through approach. With With the the exception exception of of defining defining the the HTML HTML represen represena meta-data only approach. tation tation of of the the entry entry data, data, the the entire entire process process of of creating creating and and editing editing the the meta-data meta-data can can be be done done through through mouse mouse clicks clicks in in the the visual visual administration administration interface interface of of SRS. SRS. Not Not all all the the options options in in the the configuration configuration have have been been described described here, here, including including setting setting up up sub-entry sub-entry libraries, libraries, automatic automatic handling handling of of binary binary data data (such (such as as images images and to handle handle recursive recursive and and con conand Microsoft Microsoft Office Office documents), documents), and and table cloning to ditional ditional relationships relationships between between tables. tables. SRS SRS uses uses a a simple simple interface interface class class to to interact interact with with the the relational relational systems. systems. For For speed speed and and efficiency efficiency the the C/C++ C/C++ interfaces interfaces are are preferred preferred over over ]DBC. JDBC. At At present present the the following (RDBMS) are following Relational Relational Database Database Management Management Systems Systems (RDBMS) are supported: supported: .. 9 Oracle Oracle [23] [23] .. 9 MySQL MySQL [24] [24] .. 9 Microsoft Microsoft SQLServer SQLServer [25] [25] .. 9 DB2 DB2 [26] [26] Relational banks offer Relational data databanks offer a a lot lot of of functionality, functionality, which which needs needs to to be be matched matched by by any any system system that that mediates mediates access access to to them. them. The The meta-data meta-data approach approach of of SRS SRS Relational Relational has has proven proven to to provide provide the the flexibility flexibility to to cope cope with with new new user user requirements requirements to to exploit exploit this this functionality. functionality.

8 4. . . . .

.4 5.4 5

TH E SRS E RY LAN G UAG E THE SRS QU QUERY LANGUAGE


SRS has has its query language. supports string SRS its own own query language. It It supports string comparison comparison including including wildcards wildcards or or regular regular expressions, expressions, numeric numeric range range queries, queries, Boolean Boolean operators, operators, and and the the unique unique link link operators operators (see (see Section Section 5.5). 5.5). Queries Queries always always return return sets sets of of entries entries or or lists lists of of entry entry IDs. obtained by all data banks can various criteria. IDs. Sets Sets obtained by querying querying all databanks can be be sorted sorted using using various criteria. The provision to The query query language language has has no no provision to specify specify sorting. sorting. Instead Instead it it is is invoked invoked using using a method of the result a method of the result set set object object that that has has been been obtained obtained by by evaluating evaluating a a query. query. To information from result sets, sets, further methods are To extract extract information from entries entries of of result further methods are available. available.

13 30 1 0 ~

Tools SRS: An An Integration Platform for Databanks and Analysis Tools SRS:
~ ~ :

These These methods methods can can retrieve retrieve the the entire entire entry entry as as a a text text or or XML XML stream, stream, retrieve retrieve individual token token or or field field values, values, or or load load entries entries into into data data structures structures using using prepre individual defined object object loaders loaders (see (see Section Section 5.6). 5.6). defined

5.4. 1 5.4.1

S RS Fields F i e l ds SRS
A A query query predicate predicate must must refer refer to to an an SRS SRS field, field, which which has has been been assigned assigned to to the the fields fields description in the the databank data bank before before query query time. time. A A query query into into the the Swiss-Prot Swiss-Prot d in escription field with with the the the the word word "kinase" "kinase" looks looks as as follows follows in in the the SRS SRS query query language: language: field

[ swissprot-descriptionswi sspro t - descript ion : kinase kina s e ] ] [

This denotes a string string search search enclosed n [[ and name swi ssprot This denotes a enclosed i in and ] ].. The The databank databank name swis sprot is followed by the e s c r i p t i o n . The search term k i n a s e follows the is followed by the field field name name d description. The search term kinase follows the field is shared with the databank EMBL, delimiter :. :. Because Because the the description delimiter d e s c r i p t i o n field is shared with the databank EMBL, the query can can be to search search both have the query be extended extended to both Swiss-Prot Swiss-Prot and and EMBL, EMBL, which which then then have to curly braces: braces: to be be enclosed enclosed in in curly
[ { swi ssprot ernbl } -description : kinase ] [{swissprot embl}-description:kinase]

Importantly, SRS fields are are entities entities outside given library which must Importantly, SRS fields outside a a given library definition, definition, which must be mapped each field field in in a a library. library. Whenever Whenever possible, possible, the be mapped onto onto each the same same SRS SRS field field is is mapped mapped to to equivalent equivalent fields fields in in different different libraries. libraries. Through Through that that mechanism mechanism each each SRS SRS library library has has a a list list of of associated associated SRS SRS fields fields that that can can be be used used for for searching. searching. When Whenever user selects libraries for ever the the user selects multiple multiple libraries for searching searching at at the the same same time, time, it it is is possible possible to find out all all the the SRS SRS fields fields that that these these have have in in common common and and represent represent them them in in to find out a a query query form. form. SRS SRS fields fields are are an an important important mechanism mechanism to to integrate integrate heterogeneous heterogeneous databanks with also provide databanks with different, different, but but overlapping, overlapping, content, content, and and they they also provide an an im important portant simplification simplification because because no no knowledge knowledge of of the the internal internal structure structure of of a a given given databank required when list of SRS fields. databank is is required when retrieving retrieving and and using using the the list of SRS fields. A special Text. It banks A special SRS SRS field field exists exists with with the the name name Al il l 1Text. It is is shared shared by by all all data databanks and and refers refers to to all all the the text text fields fields in in all all databanks. databanks. Through Through the the use use of of this this field, field, full fulltext queries can be specified. text queries can be specified.

5.5 5.5

LI N KI N G DATABAN KS
A common theme in in databanks databanks in in molecular molecular biology biology is is that that they they all all have have explicit explicit A common theme cross-references cross-references to to other other databanks. databanks. Especially Especially now, now, in in the the postgenomic postgenomic era era in in which which many many known known proteins proteins can can be be linked linked to to a a genome genome location location and and where where results results from from gene gene expression expression and and proteomics proteomics experiments experiments can can be be used used to to understand understand how how

5.5

Databanks

131

these proteins proteins are regulated within cell, individual these are regulated within the the cell, individual data data items items have have very very limited limited value banks. SRS value if if they they are are not not connected connected to to other other data databanks. SRS supports supports and and makes makes use use of of explicit explicit cross-references cross-references in in three three ways: ways: 9 Hypertext Hypertext links links 9 Indexed Indexed links links 9 Composite Composite structures structures (see (see Section Section 5.6) 5.6) Hypertext Hypertext links links are are the the simplest simplest mechanism mechanism and and are are ubiquitous ubiquitous on on the the Web. Web. They They are are inserted inserted into into the the appropriate appropriate places places when when displaying displaying information information to to the the user. user. Linking Linking in in this this form form can can be be operated operated on on single single entries entries and and is is very very convenient convenient and easy to links are and easy to understand. understand. These These links are easy easy to to set set up up for for an an SRS SRS Web Web server. server. Definitions among libraries like displaying Definitions can can be be shared shared among libraries and and include include options options like displaying a a link only only if contains a valid reference existing entry. link if it it contains a valid reference to to an an existing entry. More simple example example of query using More powerful powerful is is the the use use of of indexed indexed links. links. A A simple of a a query using indexed EMBL." SRS indexed links links is is "give "give me me all all entries entries in in Swiss-Prot Swiss-Prot that that are are linked linked to to EMBL." SRS has a a general general capability capability to to index index links links based based on on explicit explicit or or even even implicit implicit cross crosshas reference reference information. information. Link Link indices indices are are built built using using information information from from one one side side only. only. All All links, links, once once indexed, indexed, become become bi-directional. bi-directional. An An SRS SRS server server with with many many libraries libraries and and links links can can be be seen seen as as a a graph graph where where nodes nodes are are libraries libraries and and the the edges edges are the links between them. Figure 5.7 shows such a graph for a comparatively are the links between them. Figure 5.7 shows such a graph for a comparatively small small installation. installation. In banks that In this this graph graph it it is is possible possible to to link link data databanks that are are not not directly directly connected. connected. For banks, the For any any pair pair of of data databanks, the shortest shortest route route can can be be determined determined and and carried carried out out by by a a multi-step multi-step linking linking process. process. SRS SRS knows knows the the topology topology of of a a given given installation installation and and can can therefore therefore always always determine determine and and execute execute this this shortest shortest path. path. If If the the shortest shortest path path is is not not what what is is desired, desired, this this can can be be specified specified explicitly explicitly within within an an SRS SRS query query language language statement. statement.

5.5. 1 5.5.1

Co nstructi ng Li n ks Constructing Links


Links can constructed by SRS fields fields (see Section 5.4), each Links can be be constructed by identifying identifying two two SRS (see Section 5.4), each from contain identical field val from one one of of the the two two SRS SRS libraries libraries to to be be linked linked that that contain identical field values. ues. For For instance, instance, to to create create a a link link between between Swiss-Prot Swiss-Prot and and EMBL, EMBL, you you would would select select the the accession accession field field from from EMBL EMBL and and the the data data reference reference (DR) (DR) field field from from Swiss-Prot Another example Swiss-Prot with with explicit explicit cross-references cross-references to to other other databanks. databanks. Another example is is to to link link Swiss-Prot Swiss-Prot and and Enzyme Enzyme [27], [27], which which can can be be defined defined by by the the ID ID field field from from Enzyme Enzyme and and the the description description field field of of Swiss-Prot. Swiss-Prot. The The description description field field of of Swiss-Prot carries Swiss-Prot carries one one or or more more Enzyme Enzyme IDs IDs if if the the protein protein in in question question is is known known to banks, link to have have an an enzymatic enzymatic function. function. For For flat flat file file and and XML XML data databanks, link indices indices

1 32 132

============

SRS: ntegration Platform a n ks and lysis Tools SRS: An An IIntegration Platform for for Datab Databanks and Ana Analysis Tools

s..- - p rJ 0Ia. - p C1 . p rJ GEHIAIII - P -

rJ - P rJ IW5SHEW po rJ N "' _ C1 snAEMlIl . P' _ rJ REIn'RUIIII. . P rJ GeftI'T P

rJ SWAI.I. . P
rJ lIUD . "'

rJ _yc . p _ 11 ......- 1" rJ l'AOSITI! - 1" _ C1 1'AOS1T1!DOC . P' _ rJ OOIID . p _ C1 1'A1fT$ - I" _ rJ ..-..... "' rJ "-"' P C1 IW1SS1AI 1f I . p' _ rJ "-_ p C1 "-- - p rJ ""'** po rJ TUOIIOM'I . P _ Cl OOIETICCODE . 1" r"tIIi ....... .
_ _ .... .. .

rJ lIIGT . P Cl AAGatESaI . p rJ NAGaIESEQ - P

Cl fIlOCIIS .

--

- ...-

---

-.-

_ ..

NFASTA

"ASTA

ISEAIICIt

5.7 5.7

An SRS library library network. network. An

F IGURE FIGURE

must y comparing must be be built built to to make make the the link link queryable. queryable. Indices Indices can can be be built built b by comparing existing bank defined existing indices indices or or by by parsing parsing the the data databank defined to to contain contain the the cross-reference cross-reference information. information. Links banks are Links between between and and from from relational relational data databanks are defined defined in in the the same same way. way. However, need to link query can be However, no no indices indices need to be be built. built. A A link query can be executed executed by by querying querying the the information information provided provided in in the the table table structure. structure.

5.5.2 5.5.2

Th e Li nk O pe rators The Link Operators


Link operators unique to SRS query query language. link operators, Link operators are are unique to the the SRS language. The The two two link operators, < allow sets databanks to combined. Figure < and and >, >, allow sets of of data data from from different different databanks to be be combined. Figure 5.8 5.8 shows banks, A shows two two data databanks, A and and B, B, in in which which some some entries entries in in A A have have cross-references cross-references to to entries entries in in B. B. These These cross-references cross-references are are processed processed to to build build link link indices, indices, which which provide the basis for link operation. also shows shows the results of provide the basis for the the link operation. Figure Figure 5.8 5.8 also the results of two two link queries queries between using the operators < . link between sets sets A A and and B, B, using the operators < and and > >.

5.6 The Object Object Loader Loader 5.6

133 1 33

A A1 A2 A3 A4 A5 A6 2 A > B --+ 83 84

B 81 82 83 84 85 A1 2 A < B --+ A A5 A

Listall allthe theentries entriesin inset setB B List t t i t e s JIl . Il '{ . . f.l . w1 ttE1 A. J that have links with set A.
5.8 5.8

Listall allthe theentries entriesin inset setA A List Jht hlV.e links with set B that have links with set B. .

The The SRS link operators. operators.

F IGURE FIGURE

By By combining combining predicate predicate queries queries with with link link operators operators it it is is possible possible to to perform perform complicated complicated cross-databank cross-databank queries queries such such as as "retrieve "retrieve all all proteins proteins in in Swiss-Prot Swiss-Prot with with calcium calcium binding binding sites sites for for which which their their tertiary tertiary structure structure is is known known with with a a res resolution " olution better better than than 2 2 Angstrom. Angstrom." Another Another important important use use of of the the link link operator operator is is to to convert convert sub-entries sub-entries (e.g., (e.g., sequence sequence features) features) into into entries entries and and vice vice versa. versa. With With this this link link it it is is possible possible to to search search in in EMBL EMBL all all human human CDS CDS features features (i.e., (i.e., all all sequence sequence features features describing describing coding coding sequences sequences or or all all human human DNA DNA sequences sequences that that have have CDS CDS features). features).

5 .6 5.6

THE O BJ E CT LOADER OBJECT


The s a o transform The SRS SRS object object loader loader iis a technology technology originally originally designed designed t to transform semi semistructured structured text text data data into into well-defined well-defined data data structures structures that that can can be be accessed accessed in in a a programmatic programmatic way. way. The The object object loader loader processes processes data data according according to to a a loader loader

1 34 34 ~ = ~ ~ ~ ~ i ~ : 1

n d Analysis SRS: An Integration Platform Platform for for Databanks a and Analysis Tools

specification, specification, or or a a class class definition, definition, which, which, for for all all of of its its attributes, attributes, specifies specifies how how it file. it can can obtained obtained from from the the text text file. The following following example example shows shows such such a a loader loader for for the the example example of of the the mutation mutation The data in data in Figure Figure 5.2. 5.2.
$ LoadC lass : [ Mutation attrs $LoadClass: [Mutation attrs:: { { $ LoadAttr : [[mutation mutation load : AaChange ] $LoadAttr: load:: $Tok $Tok:AaChange] $LoadAt tr : [ pos i t i on type : int load : $Tok : AaChangePos ] $LoadAttr: [position type:int load:$Tok:AaChangePos] $ LoadAttr : [ aaMutType l oad : $Tok : AaMutType ] $LoadAttr: [aaMutType load: $Tok:AaMutType] $LoadAt t r :: [ $LoadAttr [rnaMutType rnaMutType load load:: $Tok $Tok ::RnaMutType RnaMutType ] ]

This needs no information on required information This definition definition needs no information on how how the the required information is is to to be parsed out out of name of token is the be parsed of the the flat flat file. file. Only Only the the name of the the token is needed needed to to make make the association. association. The The object object loader loader has has been been extended extended to to support, support, in in addition addition to to flat flat file file data databanks, banks. A banks, XML XML and and relational relational data databanks. A variety variety of of ways ways have have been been added added to to specify specify the the origin origin of of the the original original data data to to be be loaded. loaded. This This includes includes using using the the SRS SRS field abstraction, an (XPath is lan field abstraction, an XPath-like XPath-like syntax syntax for for XML XML files files (XPath is the the XML XML Path Path language used for addressing parts document) [28] table and guage used for addressing parts of of an an XML XML document) [28] or or pairs pairs of of table and row names banks. A row names for for relational relational data databanks. A single single loader loader can can be be defined defined for for a a broad broad range of databanks. range of databanks. For For example, example, a a single single sequence sequence loader loader can can be be specified specified for for all data banks with with sequence and the format can can be flat file, all databanks sequence information, information, and the original original format be flat file, XML, or or relational. relational. XML, In In Section Section 5.8 5.8 an an example example is is described described for for accessing accessing the the loaded loaded objects objects within within a a client client program. program.

5.6. 1 5.6.1

Creat i n g Co m pl ex a n d Nested bj ects Creating Complex and Nested O Objects


The The loader loader specification specification supports supports other other useful useful features, features, such such as as class class inheritance, inheritance, and value types various and supports supports various various value types like like string, string, integer, integer, and and real real values values and and various types types of of lists. lists. Using Using token token indices indices (TINs), (TINs), a a feature feature of of the the token token server server that that allows allows iteration iteration over complex structures object classes classes can over lists lists of of complex structures inside inside a a text text entry, entry, object can be be nested nested to to an an arbitrary loaders can build a arbitrary degree. degree. Object Object loaders can build a structure structure to to reflect reflect the the entry entry subentry subentry structure deeper levels levels of nesting. A good example structure used used for for indexing, indexing, but but can can have have deeper of nesting. A good example is objects, each is an an EMBL EMBL entry, entry, which which contains contains a a list list of of sequence sequence feature feature objects, each of of which which contains list of qualifier value contains a a list of qualifier value and and name name pairs pairs (see (see Figure Figure 5.3 5.3 for for an an example example of of an an EMBL EMBL sequence sequence feature). feature).

5.6

The Object Loader

1 35

5.6.2 5.6.2

S u pport fo r Load i n g from XML Data ba n ks Support for Loading from XML Databanks
SRS provides several features control over SRS provides several features to to give give users users control over loading loading from from XML XML data databanks. The banks. The utility utility that that generates generates library library definition definition files files from from an an XML XML document document type loaders, a type definition definition can can generate generate two two types types of of loaders, a flat fiat loader loader based based on on the the tradi traditional main library/sub-entry library/sub-entry library library structure structure or or a a special special structured structured loader, loader, which which tional main is is a a collection collection of of separate separate loaders loaders for for each each element element that that mimics mimics the the tree tree structure structure of original XML XML document. loader makes easier to load data of the the original document. The The structured structured loader makes it it easier to load data from libraries that from libraries that are are heavily heavily nested, nested, and and it it is is particularly particularly useful useful for for tasks tasks like like writing modules for writing SRS SRS page page layout layout modules for HTML HTML display. display. SRS provides SRS provides special special support support for for random random access access of of sub-entities sub-entities within within XML XML files files (e.g., (e.g., individual individual entries, entries, sub-entries, sub-entries, fields, fields, or or collections collections of of fields) fields) or or for for load loading ing high-level high-level structures structures without without some some of of the the data data nested nested within within them them (e.g., (e.g., main main entries entries without without sub-entries) sub-entries).. In In addition, addition, SRS SRS offers offers two two access access mechanisms mechanisms that that mimic functionality functionality available available in in XSLT. XSLT. Figure Figure 5.9 5.9 shows shows a a mixed mimic mixed content content element element called d_l called prolog p r o l og that that has has child child elements elements chi chi l ld _ l and and chi c h i ld_2 ld_2 interspersed interspersed with with its its content. content. : valueallows users The The xsl xsl-v a l u e - o f o f functionality functionality allows users to to extract extract and and concatenate concatenate pro child elements. elements. A all of all of the the character character data data content content of of the the p r o log 1 og element element and and its its child A val uecontain the field loaded using the field loaded using the xs xs 1 1-: v alu e - o0 f f keyword keyword word word would would contain the following following text: text: "To "To be, be, or or not not to to be, be, that that is is the the question." question." Note Note that that none none of of the the attribute attribute values are py of values are included. included. The The xsl xs 1-: c o p y --o f functionality functionality allows allows users users to to extract extract XML XML xsl : copy off tree fragments, complete with markup. A field loaded using the tree fragments, complete with markup. A field loaded using the x s l - c o p y --o keyword n Figure . 9 . If keyword word word would would contain contain the the entire entire XML XML fragment fragment shown shown iin Figure 5 5.9. If neither neither of of these these special special keywords keywords were were used, used, the the prolog p r o 1 og field field would would only only include include the " the content content of of the the prolog p r o l o g element: element: "To "To be, be, to to be, be, the the question. question."
<prol og>To be , <prolog>To be, =" "t thi s text 't t" "> > or not /c chi ld_ > < child <c hild 1 1 attribute attribute l 1= his t e x t won won' or n ot < </ hil d l 1> to , to be be, <child = "be included " > that is < / child_2 > <child 2 2 attribute attribute 2 2="be included"> that is </child 2> the que s t i on . the question. < /prolog> </prolog>

5.9 5.9 F I G U RE FIGURE

Mixed Mixed content content sample sample data. data.

1 36

5 5

SRS: SRS: An An

Integration Platform Platform for for Datab Databanks and Analysis Tools a n ks and

5.6.3 5.6.3

U s i n g Li n ks to res Using Links to Create Create Composite Composite Structu Structures


An An important important feature feature of of the the object object loader loader is is that that it it can can perform perform links links to to retrieve retrieve single attributes attributes or another linked linked library. Assuming that that single or entire entire data data objects objects from from another library. Assuming linked to example is adding an the the Mutation Mutation databank databank is is linked to Swiss-Prot, Swiss-Prot, an an example is adding an attribute attribute to to the the Mutation Mutation loader loader with with the the description description line line from from a a linked linked Swiss-Prot Swiss-Prot entry. entry. The The following line line instructs instructs the the object loader to to link to Swiss-Prot using the the shortest shortest following object loader link to Swiss-Prot using linked entry. path path and and to to extract extract the the description d e s c r i p t i o n token token from from the the linked entry.
$ LoadAt tr : [ prote inDescription $LoadAttr-[proteinDescription load : $Tok : [ descript ion l ink : swi s spro t J load-$Tok[description link-swissprot] ]

Another Another possibility possibility would would be, be, rather rather than than extracting extracting a a single single token, token, to to attach attach an an entire entire object object as as defined defined by by an an already already existing existing loader loader class class for for Swiss-Prot, Swiss-Prot, in in this this case the the loader loader SeqSimple. SeqS imp 1 e. case
$ LoadAt t r : [protein $LoadAttr[protein load : $Tok : [ l ink : swi ssprot loader : $ SeqSimple_Loader J load-$Tok[link-swissprot loader-$SeqSimple_Loader] ]

As As information information about about a a certain certain real-world real-world object object is is scattered scattered across across many many data databanks, valuable foundation banks, object object loaders loaders can can provide provide an an extremely extremely valuable foundation for for writing writing programs to programs to display display or or disseminate disseminate these these real-world real-world objects. objects. It It is is possible possible to to define define and design these and design these objects objects freely freely and and in in a a second second step step decide decide where where the the individual individual information information pieces pieces can can be be retrieved retrieved for for their their assembly. assembly.

5.6.4 5.6.4

Export ing O bjects to ML Exporting Objects to X XML


SRS SRS allows allows users users to to export export data data assembled assembled by by the the object object loader loader to to a a generic generic XML XML format format or or to to any any of of the the standard standard XML XML formats. formats. When When converting converting data data to to a a generic generic format, format, SRS SRS creates creates a a well-formed well-formed XML XML document document with with an an accompanying accompanying DTD. DTD. This This functionality functionality can can be be invoked invoked from from the the SRS SRS Web Web interface interface or or from from one one of of the (see Section the APls APIs (see Section 5.8). 5.8). This This functionality functionality is is provided provided for for every every object object loader loader specification specification by by default. default. If If users users wish wish to to convert convert data data to to a a specific specific format, format, a a pub public format the user invented, lic standard, standard, or or a a format the user invented, they they must must use use a a set set of of XML XML print print metaphor that represent metaphor objects objects that represent and and describe describe the the elements elements and and attributes attributes in in the the target format. target format. The process for metaphors is The process for creating creating XML XML print print metaphors is similar similar to to the the process process for for creating data bank definition print metaphors can creating an an XML XML databank definition file. file. Before Before a a new new set set of of print metaphors can

5.7 Scientific Scientific Analysis Ana lysis Tools 5.7

1 37

be generated, generated, the the user user must must obtain obtain an an accurate accurate DTD DTD for for the the target target XML XML format. format. be An SRS SRS utility utility analyzes analyzes the the DTD DTD and and creates creates print print metaphor metaphor object object templates templates for for An all of of the the XML XML elements elements and and attributes attributes in in the the target target format. format. The The user user must must then then all edit the the resulting resulting file file to to identify identify data data sources sources for each element element and and attribute attribute in in the the edit for each target format. format. The The new new SRS SRS Visual Visual Administration Administration Tool Tool includes includes a a graphical graphical user user target interface (GUI) (GUI) that that greatly simplifies the the process process of of editing print metaphors. metaphors. interface greatly simplifies editing print Data objects objects can can also also be be exported to a a target XML format using an an XSLT XSLT style style Data exported to target XML format using sheet. This This process process is is slightly slightly less less convenient convenient than than using using print print metaphors metaphors because because sheet. it involves involves an an extra extra conversion conversion step. step. The The user user must must first first export export the the data data to to the the it generic XML XML format, format, then then invoke invoke an an XSLT XSLT style style sheet sheet that that converts converts the the generic generic generic format to to the the target target format. format. format print metaphors metaphors can can also also be be used used to to transform transform data data from from any any source source XML XML print into an an XML XML format format that is compatible compatible with into that is with Microsoft's Microsoft's Office Office Web Web Components Components (OWC) [29]. [29]. This This technology technology allows allows data to be displayed and manipulated using (OWC) data to be displayed and manipulated using either an an Excel spreadsheet or or a a pivot table embedded The either Excel spreadsheet pivot table embedded in in the the SRS SRS browser. browser. The pivot table table component component allows allows the the user to do pivot user to do sophisticated sophisticated sorting sorting and and grouping grouping operations on the the data. operations on data. Both Both components components have have an an "Export "Export to to Excel" Excel" button button that that allows the be easily easily saved to an file. allows the data data to to be saved to an Excel Excel workbook workbook file.

5.7 5.7

SCIE NTI FIC ANALYSIS ANALYS I S TOO LS SCIENTIFIC TOOLS


A A key key feature feature of of SRS SRS is is its its ability ability to to integrate integrate and and use use scientific scientific analysis analysis tools tools that that can applied to can be be applied to user user data data or or to to data data resulting resulting from from database database queries. queries. The The results results generated banks, which generated by by these these tools tools can can be be stored, stored, in in turn, turn, in in tool-specific tool-specific data databanks, which can bank. The banks can then then be be treated treated like like any any other other SRS SRS data databank. The difference difference in in these these data databanks f the is SRS. is that that they they are are user user owned owned and and constitute constitute part part o of the user user session session with with SRS. All All tools tools that that can can be be integrated integrated fulfill fulfill the the following following requirements: requirements: .. 9 It It can can be be launched launched with with a a UNIX UNIX command command line. line. .. 9 It It receives receives input input through through command command line line argument argument or or input input files. files. .. 9 It It writes writes output output to to files files or or to to the the standard standard output output device. device. In In bioinformatics, bioinformatics, hundreds hundreds of of tools tools can can be be found found with with these these properties. properties. They They include include BLAST, BLAST, FASTA FASTA for for sequence sequence similarity similarity searching, searching, or or Clustal Clustal [30] [30] for for multi multiple ple sequence sequence alignment. alignment. A A selection selection of of these these can can be be combined combined within within an an automated automated annotation annotation pipeline pipeline to to predict predict all all genes genes for for a a genome genome or, or, for for all all proteins proteins derived derived from from these these genes, genes, the the protein protein function function annotation. annotation. Pipelines Pipelines like like this, this, together together with with their output, can be integrated as a single tool into SRS. Currently SRS supports their output, can be integrated as a single tool into SRS. Currently SRS supports

1 38 138

Tools SRS: An An Integration I ntegration Platform Platform for for Databanks Datab a n ks and and Analysis Tools

about and the about 200 200 tools, tools, including including BLAST, BLAST, FASTA, FASTA, stackPACK stackPACK [31], [31], and the majority majority of of the tools tools in in EMBOSS. EMBOSS. the tool can can be be added added to to SRS SRS through through meta-data meta-data by by defining defining the SRS library library A the SRS A tool with output, information input with the the syntax syntax and and data data fields fields of of the the tool tool output, information about about all all tool tool input options, validation validation rules rules to to test a parameter set specified by the the user, user, pre-defined pre-defined options, test a parameter set specified by parameter sets, sets, association association to to a a data data type, type, and and so so forth. forth. parameter SRS protein sequence, SRS has has a a growing growing set set of of pre-defined pre-defined data data types, types, such such as as protein which can can be be extended extended by by the the administrator. administrator. These These data data types types can can be be associated associated which with databanks data banks that that contain contain data data of of this this type, type, tools tools that that take take it it as as input, input, and and tools tools with that that produce produce it it as as output. output. This This information information can can be be used used to to build build user user interfaces interfaces that know know which which tools tools apply apply to to which which databases databases or or workflows workflows that that feed feed tools tools with with that outputs of of other other tools. tools. outputs

5.7 . 1 5.7.1

Processi n g of of Input I n put a nd O utput Processing and Output


Many some pre-processing steps like like setting up the the run-time run-time enviMany tools tools require require some pre-processing steps setting up envi ronment or of the the input input sequence to a a format format they they recognize, and postpost ronment or conversion conversion of sequence to recognize, and processing such such as as cleanup cleanup of of additional additional output values. processing output or or preserving preserving input input data data values. All of these can be specified specified as as part part of the tool or by by using pre-defined All of these can be of the tool definition definition or using pre-defined hooks for shell scripting. hooks for shell scripting. Output levels, depending Output can can be be processed processed at at many many levels, depending on on the the detail detail required. required. applications, but A A simple simple text text view view of of the the output output is is enough enough for for some some applications, but where where the the results can parsed for decision is results can be be parsed for object object loaders, loaders, this this is is much much preferred. preferred. A A key key decision is the the level level at at which which an an entry entry in in the the output output should should be be returned returned by by a a later later query. query. The The entire entire output simple analysis sequence, but output is is usually usually a a single single entry entry for for simple analysis of of a a sequence, but for for search search tools like it is preferable to bases as hit in in the the sequence sequence data databases as a a tools like BLAST BLAST it is preferable to represent represent each each hit separate entry linked to source data. This seriously separate entry so so these these can can be be linked to the the source data. This seriously complicates complicates the the task task of of developing developing a a parser parser as as the the entry entry information information is is split split in in several several sections sections of of a a file, file, which which can can be be several several megabytes megabytes in in size, size, but but the the increased increased flexibility flexibility more more than than justifies justifies the the extra extra effort. effort. An An important important implication implication of of parsing parsing and and indexing indexing tool tool outputs outputs is is that that the the respective respective tool tool libraries libraries can can become become part part of of the the SRS SRS Universe Universe if if link link informa information exists. For instance, all all outputs tools can tion exists. For instance, outputs from from sequence sequence similarity similarity search search tools can be be linked bank searched. bank is linked to to the the sequence sequence data databank searched. Assuming Assuming that that the the search search data databank is connected to the SRS Universe, questions like " How many proteins from a certain connected to the SRS Universe, questions like "How many proteins from a certain protein can be be asked. asked. Links Links also also can can be be protein family family or or metabolic metabolic pathway pathway were were found?" found?" can used used to to compare compare results results obtained obtained by by different different search search tools; tools; for for instance, instance, through through a a single single SRS SRS query query a a list list of of hits hits that that were were found found by by both both FASTA FASTA and and BLAST BLAST can can be be obtained. obtained.

5.8

IInterfaces nterfaces to to SRS SRS

1 39

5 .7.2 5.7.2

B atch Queues Batch Queues


Batch Batch queues queues allow allow the the administrator administrator to to specify specify where where and and when when analyses analyses can can be be run. Once run. Once batch batch queuing queuing is is enabled, enabled, it it is is possible possible to to associate associate a a tool tool with with one one or or more more queues queues with with different different characteristics. characteristics. SRS SRS provides provides support support for for several several popular batch popular batch queuing queuing systems systems such such as as LSF LSF [32], [32], the the Network Network Queuing Queuing System System (NQS) (NQS) [33], [33], the the Distributed Distributed Queuing Queuing System System (DQS) (DQS) [34], [34], or or the the SUN SUN Grid Grid Engine Engine [35]. If If a a tool tool associated associated with with a a batch batch queue queue is is launched, launched, the the job job is is submitted submitted to to this this batch SRS) reports batch queue queue and and the the Web Web interface interface (see (see Section Section 5.8, 5.8, Interfaces Interfaces to to SRS) reports the the command line line and provides a command and provides a link link to to the the job job status status page. page. This This page page displays displays the the full list of of batch runs. Selecting will bring full list batch runs. Selecting a a completed completed run run will bring up up the the results. results. When When an assigned to to a an application application has has not not been been assigned a batch batch queue queue it it will will be be run run interactively. interactively.

ZlIIlI'IllIIZ ililllZll'l rlllZll'l

5.8 5.8

IINTERFACES NTE R FACES TO TO SRS

Several exist, which provide full access to all its functions. They Several interfaces interfaces to to SRS SRS exist, which provide full access to all its functions. They include: include: 9 Creating Creating and and managing managing a a user user session session banks 9 Performing Performing queries queries over over the the data databanks 9 Sorting Sorting query query result result sets sets 9 Launching Launching analysis analysis tools tools 9 Accessing Accessing meta-information meta-information The CGI), a The Web Web interface interface is is implemented implemented as as a a Common Common Gateway Gateway Interface Interface ((CGI), a stateless using the stateless server server that that is is invoked invoked for for every every request. request. However, However, using the APIs APIs of of SRS SRS Objects to write Objects it it is is possible possible to write stateful stateful and and multi-threaded multi-threaded servers. servers.

5.8. 1 5.8.1

The Web nte rface Web IInterface


The The most most popular popular access access to to SRS SRS is is through through a a Web Web interface. interface. With With it it the the user user cre creates temporary or permanent. Within session the ates a a session session that that can can be be temporary or permanent. Within the the session the results results of of many many user user actions actions are are stored. stored. These These include include queries, queries, tool tool launches, launches, and and cre creations ations of of views. views. The The Web Web interface interface provides provides several several query query forms, forms, one one of of which which is highly customizable customizable canned is the the highly canned query query form form that that allows allows administrators administrators to to set set up launch even up intuitive intuitive forms forms that that enable enable an an inexperienced inexperienced user user to to launch even complex complex quenes. queries.

1 40 40

ntegration Platform a n ks and 5 SRS: SRS: An An IIntegration Platform for for Datab Databanks and Analysis Tools Tools 5 c;; ;"c;= ~ : ~ ~:~:~::~:~:~::~;::~;~::~:~:~:~:~:~:~:~::~;~:~:~:~::~:;:~:;:;~:::~:;:~:`:;~:~::~::~::~:~;~;:~;::~:~;:~;:~;~::~;~;;:~:;~:~::~;~:~;~;:~``~;:;~~~:~:

5.8.2 5.8.2

S RS O bjects SRS Objects


SRS SRS. It SRS Objects Objects is is a a package package of of object-oriented object-oriented interfaces interfaces to to SRS. It is is designed designed for for software want to access the SRS from within their software developers developers who who want to access the functionality functionality of of SRS from within their own object-oriented object-oriented application. SRS Objects Objects includes includes four four language-specific language-specific APIs, own application. SRS APIs, which which are: are: 9 C++ C++ 9 Java Java
9 Ped Perl

9 Python Python SRS also includes SRS Objects Objects also includes the the SRS SRS Common Common Object Object Request Request Broker Broker Archi Architecture CORBA 2.4 2.4 specification, tecture (CORBA) (CORBA) Server, Server, compliant compliant with with the the CORBA specification, which which is is generally generally referred referred to to as as SRSCS. SRSCS. The The C++ C++ API API represents represents the the foundation foundation both both for for the the other other three three APIs APIs (gen (generated C++ declarations erated automatically automatically from from the the C++ declarations using using the the public public domain domain SWIG SWIG package) C++ API package) and and SRSCS, SRSCS, whose whose interfaces interfaces and and operations operations wrap wrap the the C++ API classes classes and and methods. methods. As As a a consequence, consequence, in in terms terms of of SRS SRS interaction, interaction, the the four four APIs APIs and and SRS CORBA Server SRS CORBA Server are are almost almost identical identical and and provide provide the the same same types types and and method method signatures. signatures. The major functionalities: The package package SRS SRS Objects Objects provides provides the the following following major functionalities: 9 Creation Creation of of temporary temporary or or permanent permanent SRS SRS sessions sessions and and interaction interaction with with them them bank groups, 9 Access Access to to meta-information meta-information about about the the installed installed data databank groups, databanks, databanks, tools, tools, links, links, etc. etc.
9 Querying Querying of of databanks databanks using using the the SRS SRS query query language language

bank entries 9 Accessing Accessing data databank entries in in a a variety variety of of ways ways analysis tools 9 Launching Launching of of analysis tools and and managing managing their their results results 9 Use Use and and dynamic dynamic creation creation of of the the SRS SRS object object loaders loaders 9 Working Working with with the the SRS SRS Objects Objects manager manager system system to to create create and and use use dynamic dynamic types types In In addition, addition, SRS SRS Objects Objects abstracts abstracts from from the the developer developer tasks tasks such such as as the the ini initialization tialization of of the the SRS SRS system, system, SRS SRS memory memory management, management, and and SRS SRS error error handling. handling. Central Central to to SRS SRS Objects, Objects, as as in in the the Web Web server, server, is is the the session session object. object. It It must must always always be be created created at at the the beginning beginning of of the the program. program. As As in in the the Web Web server, server, the the session session is is associated associated to to a a directory directory where where the the results results of of the the user user actions actions are are stored. stored.

Automated Server Server Maintenance Maintenance with with SRS SRS Prisma Prisma 5.9 Automated 5.9

1 41

This means means the the Web Web client client and and a a program program written written with with SRS SRS Objects Objects can can share share a a This session and and its its contents. contents. session The following following program program example example in in Perl Ped illustrates illustrates the the use use of of SRS SRS Objects. Objects. It It The starts by by creating creating a a session session object, object, then then queries queries all all Swiss-Prot Swiss-Prot entries entries with with kinase kinase starts in the the description description field, field, and and finally finally prints prints a a few for each each entry in the the in few attributes attributes for entry in result. result.
$s se es ss s = n new Se es ss s $ iion on = ew S ii oon n;;
=

$set = S $s se es ss si io on [ swis spro t in on ] " , .... "") } ; ; $set n ->>query q u e r y ( "( " [ swiss prot --descript descriptio : k:ikinas n a s e ]e ", for ($i=0; ( $i=O ; $ $i i< <$ $s se e () ; ; ++$i) ++$ i } for tt ->> ss ii zz ee () Se entry = S $s se et t-> >getEntry $) i} ; ; S ntry = g e t E n t r y ( $( i

{ {

print " De sc cr ri ip p p rint " Des tt ii oon n: : print " SeqLength p rint " SeqLength- :

print " Ac ce es s s p rint " Acc s ii oon n: :

$ obj = = $ $e entry(S "w Swi ss ; $obj n t r y - >> lload oad(" iss E nEntry try") " ) ; " , ",

", $ $o obj -> >attrStr (A "c Ac cs e "} ' ", bjattrStr(" ce ss is oi non "),

", $ obj -> >at } , , "\n" " \n " ; ; ", $o bja t ttrlnt rInt ( (" " SeqLength SeqLength"" )

$ obj -> >attrStr (D "e Descript "} , $ objattrStr(" s c r i p t i oi non "),

" \n " ; "\n";

" \n " ; "\n";

5.8.3 5.8.3

a n d Web Web Services SOAP and Services


Currently SRSCS is the only only client interface to The others others are are in-process Currently SRSCS is the client server server interface to SRS. SRS. The in-process APIs and and require require the the client client application application to be run APIs to be run on on the the same same computer computer as as the the SRS SRS server. server. CORBA CORBA is is well well suited suited for for client client server server applications applications on on the the same same local local area area network network (LAN), (LAN), but but it it is is of of much much more more limited limited use use across across the the Internet Internet or or an an intranet. intranet. The object access The simple simple object access protocol protocol (SOAP) (SOAP) and and the the Web Web Services Services standard standard are are much much better compatible with better suited suited for for this this type type of of application application and and are are also also very very compatible with SRS SRS functionality. functionality. A A Web Web Services Services interface, interface, which which will will provide provide the the same same functionality functionality as as the the existing existing SRS SRS Objects Objects APIs APIs is is currently currently being being built. built.

5.9 5.9

AUTO MATED S E RVE R MAI NTE NANACE AUTOMATED SERVER MAINTENANACE WITH SRS P R I S MA WITH SRS PRISMA
SRS SRS Prisma Prisma is is an an extension extension package package for for SRS SRS that that can can assist assist a a site site administrator administrator with with the the sometimes sometimes onerous onerous task task of of keeping keeping the the flat flat files, files, XML XML files, files, and and indices indices for for installed installed libraries libraries as as up up to to date date as as possible. possible. This This is is done done by by comparing comparing the the status status of of the the local files at local files files and and indices indices with with the the corresponding corresponding data data files at an an appropriate appropriate remote remote FTP FTP site. files or found to site. Any Any files or indices indices found to be be out out of of date date are are replaced replaced by by downloading downloading new new data Prisma can data and/or and/or by by rebuilding rebuilding the the appropriate appropriate indices. indices. In In addition, addition, SRS SRS Prisma can be be used used as as a a more more general general data data management management tool, tool, carrying carrying out out tasks tasks such such as as reformatting reformatting newly newly downloaded downloaded data data files, files, or or creating creating new new data data files files from from existing existing SRS SRS data data files files

1 42

142

5 5

SRS: ntegration Platform ba n ks and lysis Tools SRS:An An IIntegration Platform for for Data Databanks and Ana Analysis Tools

and and indices. indices. SRS SRS Prisma Prisma can can be be used used on on an an ad ad hoc hoc basis basis by by the the administrator, administrator, but but it it is also also ideal ideal for for daily daily scheduling scheduling to to ensure ensure that that all all databases databases are are kept kept as as up up to to date date is as as possible. possible. To To assist assist the the administrator administrator in in monitoring monitoring the the completion completion status status of of any any update 1 update processes, processes, Prisma Prisma creates creates a a complete complete archive archive of of Web Web reports reports from from up up to to 3 31 days prior, prior, including including easy-to-use easy-to-use graphical graphical views. views. days In bases need In a a situation situation where where many many data databases need to to be be updated updated and and where where a a large large range range of tasks tasks is is involved involved (from (from downloading, downloading, to to indexing, indexing, to to data data reformatting), reformatting), Prisma Prisma of will determine determine the the minimum minimum number number of of tasks tasks to to be be carried carried out out and and the the dependencies dependencies will between tasks. For building of between these these tasks. For example, example, the the building of a a link link index index requires requires the the to t o and and from indices from indices to to be be up up to to date. date. In In such such a a case, case, the the link link task task would would be be delayed delayed until until any any required rebuilding of of the the to t o and and from from indices indices was was complete. complete. In In the the event event that that any any required rebuilding of fails, the ployed by of the the required required tasks tasks fails, the architecture architecture em employed by Prisma Prisma ensures ensures that that any any other other tasks completed. The job will will finish finish tasks that that do do not not depend depend on on failed failed tasks tasks are are completed. The Prisma Prisma job when all the when all the tasks tasks that that can can be be completed completed have have been been done. done. For For instance, instance, if if the the down download phase phase fails fails for indexing of bases load for SWISSNEW, SWISSNEW, the the downloading downloading and and indexing of other other data databases should be be unaffected. unaffected. Other Other important important features features of of SRS SRS Prisma Prisma are are as as follows: follows: should Prisma allows allows a failed job re-run from point at 9 Prisma a failed job to to be be re-run from the the point at which which it it failed, failed, thereby thereby minimizing minimizing the the repetition repetition of of tasks, tasks, which which can can be be time-consuming time-consuming and and processor-intensive. processor-intensive. For For example, example, if if the the download download phase phase for for a a particular particular data databank bank has has failed failed due due to to a a transient transient external external problem problem (e.g., (e.g., a a problem problem accessing accessing the relevant relevant FTP FTP site), site), the Prisma job job can problem has has been the the Prisma can be be re-run re-run once once this this problem been resolved. In In such a case case only only the failed tasks them resolved. such a the failed tasks and and those those dependent dependent on on them will will be be run. run.
9

Tasks can can be be carried carried out out in in parallel parallel to to optimize optimize performance performance on on multiple multiple Tasks processor machines. machines. This parallelization includes indexing/merging processor This type type of of parallelization includes indexing/merging and downloading. If a a databank data bank consists consists of of several several files files that be indexed indexed and downloading. If that can can be in parallel, parallel, Prisma Prisma will will interleave downloading and and indexing indexing of of these these files. files. in interleave downloading

processing of of downloads indexing ensures ensures that 9 Offline Offline processing downloads and and indexing that during during the the updat updating job job the SRS server to function in an new ing the SRS server continues continues to function in an uninterrupted uninterrupted way. way. The The new data banks and and indices indices are are only only moved moved online online after after completion completion of of the the entire entire databanks job. job. Staged Prisma Prisma runs runs allow allow controlled controlled and and automated automated decision decision making making to en 9 Staged to ensure robustness robustness and and minimized minimized maintenance. maintenance. This This allows allows Prisma Prisma to to bring bring the the sure update job job to to an an end end even even if if individual individual tasks tasks fail. fail. update An integral integral part of Prisma Prisma is is a a facility facility to to check check the quality of of all all integrated integrated 9 An part of the quality data banks. Every Every day day every every databank data bank is is checked checked for for configuration configuration errors, errors, databanks. compliance of of flat flat file file data data to to the the rules rules of of the the token token server, server, the the validity validity of of the the compliance schema information information that that SRS SRS holds holds for for relational relational databanks, databanks, and and so so forth. forth. schema

5. 10 5.10 .......

Conclusion Conclusion .~.o~~,~~o

....

14 3 1 43

Prisma Prisma is is normally normally set set up up to to run run every every night. night. It It provides provides extensive extensive reporting reporting for for the during the and all all the the jobs jobs it it ran ran during the night night and the quality quality checks. checks. Prisma Prisma archives archives the the reports reports within 1 days. within a a sliding window window of of 3 31 days. Apart banks that Apart from from relational relational data databanks that can can be be accessed accessed through through a a network network con connection, flat flat files files and and XML XML sources sources must must be be on on the the same same LAN LAN as as the the SRS SRS server. server. This This nection, is expensive expensive because because the the storage storage must must be be provided, provided, but but it it guarantees guarantees stability stability and and is speed. SRS a can banks current speed. SRS Prism Prisma can keep keep all all local local copies copies of of the the data databanks current in in a a completely completely automated automated fashion, fashion, checking checking every every day day the the integrity integrity of of the the system system and and the the consistency consistency of bank and of each each data databank and tool. tool.

5. 10 5.10
9 . . .~

CO N CL U S I O N CONCLUSION
SRS SRS can can integrate integrate the the main main sources sources of of structured structured or or semi-structured semi-structured data, data, flat flat file file databanks, databanks, XML XML files, files, relational relational databanks, databanks, and and analysis analysis tools. tools. It It provides provides technol technology ogy to to access access these these data, data, but but also also to to transform transform them them to to a a common common mind-set. mind-set. Data Data from from the the different different sources sources will will look look and and behave behave in in exactly exactly the the same same way, way, effectively effectively shielding users shielding users from from the the complexities complexities of of the the underlying underlying data data sources. sources. This This is is also also true true for for developers developers who who use use SRS SRS APls APIs to to write write custom custom programs. programs. SRS SRS forms forms a a truly truly scal scalable able data data and and analysis analysis tool tool integration integration platform platform onto onto which which developers developers can can build build new new data bases, analysis databases, analysis tools, tools, user user views, views, and and canned canned queries queries to to tailor tailor the the environment environment to institution. to the the needs needs of of their their company company or or institution. Using links, SRS Using bi-directional bi-directional and and high-speed high-speed links, SRS transforms transforms the the multitude multitude of of inte integrated banks into grated data databanks into a a network, network, which which paves paves the the way way for for the the full full exploration exploration of of the the relationships relationships between between the the data data sources sources (e.g., (e.g., through through cross-databank cross-databank queries). queries). The The different sources can different sources can be be combined combined using using object object loaders, loaders, which which are are able able to to build build data data objects objects by by extracting extracting data data fields fields from from the the entire entire network. network. The The federated federated approach approach to to integration, integration, in in combination combination with with the the use use of of meta metadata, data, means means that that data data can can be be maintained maintained in in its its original original format. format. This This is is important important so loss due normalization or loaders can so there there is is no no data data loss due to to normalization or reformatting. reformatting. Object Object loaders can be be designed either provide standardized sources or designed either to to provide standardized access access to to diverse diverse data data sources or to to extract extract information transparently across the bank network. SRS, therefore, information transparently from from across the entire entire data databank network. SRS, therefore, is capable of is both both capable of supporting supporting the the native native structure structure of of databanks databanks and and abstractions abstractions or or unified versions. It unified versions. It supports supports data data in in their their native native format, format, but but it it also also supports supports standards standards derived derived from from them them or or imposed imposed onto onto them. them. SRS SRS does does not not improve improve the the data data it it integrates, integrates, nor nor does does it it create create a a super super schema schema over capability and over the the data, data, but but with with its its linking linking capability and object object loaders, loaders, it it provides provides the the perfect perfect framework semantic integration sources in framework for for the the semantic integration of of the the data data sources in bioinformatics. bioinformatics. The The inherent flexibility bioinformaticians can inherent flexibility and and extensibility extensibility of of SRS SRS means means that that bioinformaticians can use use SRS as a solid foundation for development where they can incorporate their own SRS as a solid foundation for development where they can incorporate their own

1 44

===:: == ,= :;= := ==,,;===::.=..

5 5

SRS: SRS"An An

n TI'I,nr<n,ron Platform Integration Platform for for Databanks Databanks and and Analysis Analysis Tools Tools

data data and and knowledge knowledge of the scientific domain domain to provide provide a truly comprehensive comprehensive view of of genomic data. data.

_ _1 1 _ _

R E F E R E NCES REFERENCES
[[1] 1] [2] [3] [4] [5]
D " Nucleic Acids D.. Benson, I I.. Karsch-Mizrachi, D D.. Lipman, et al. "GenBank. "GenBank." , no. 28 (2000): 1 5-1 8 . http://www.ncbi.nlm.nih.gov/Genbank. 1, 15-18. Research 1 The Gene Ontology consortium. consortium, http://www.geneontology.com. MySQL open source database. database, http://www.mysql.com. Incyte LifeSeq LifeSeq Foundation database. http://www.incite.comlsequence.foundation.shtml. http://www.incite.com/sequence.foundation.shtml. G. Stoesser, Stoesser, W. Baker, Baker, A. van den Broek, et al. "The "The EMBL Nucleotide Sequence Database." 1 (2002): 21-26. Database." Nucleic Acids Research 30, no. 1 C. O'Donovan, High-Quality Protein O'Donovan, M. ]. J. Martin, A. Gattiker, et al. " "High-Quality Knowledge Resources: Resources: SWISS-PROT and TrEMBL." Briefings Briefings in Bioinformatics 3, no. 3 (2002): 275-284. 275-284. S. Altschul, W. W. Gish, W. Miller, et al. "Basic Local Alignment Search Tool."

[6]

[7]

Journal of 990): 403-410. of Molecular Biology 215, no. 3 (October 1 1990):


http://www.ncbi.nlm.nih.gov/BLAST.

[8] [9]

W. R. W.R.

Pearson. "Flexible Sequence Similarity Searching with the FASTA3 FASTA3 Program Package." In Methods in Molecular Biology 1 32 (2000): 1 8 5-2 1 9. 132 185-219.

P. P. Rice, I. Longden, and A. Bleasby. Bleasby. "EMBOSS: The The European Molecular Biology Open Software Suite." Trends in Genetics 16, no. 6 (2000): 276-277. 276-277. http://www.emboss.org. Documentation Resource for Protein Families, Domains and Functional Sites. " Sites." Briefings in Bioinformatics 3, no. 3 (2002): 225-235. http://www.ebi.ac.uklinterprol. http://www.ebi.ac.uk/interpro/.

[ 1 0] N. ]. Mulder, R. Apweiler, T. [10] N.J. T. K. Attwood, et al. "InterPro: An Integrated

[ 1 1 ] E. M. Zodobnov, R. Lopez, R. Apweiler, et al. "The EBI SRS Server: New [11] E.M. SRS Server: 8 , no. 8 (2002): 1 149-1 1 50. http://srs.ebi.ac.uk. Features." Bioinformatics 1 18, 1149-1150. [ 1 2] D [12] D.. P. E Kreil and T. Ezold. Ezold. "DATABANKS: A Catalogue Database of Molecular 1 999): 1 55-1 5 7. Biology Databases." 155-157. Databases." Trends in Biochemical Sciences 24, no. 4 ((1999): [ 1 3] Celera Discovery System. http://www.celeradiscoverysystem.com. [13] [14] [14] NetAffx Analysis Center from Affymetrix. http://www.netaffx.com. [ 1 5 ] Thomson Derwent [15] Derwent GENESEQ portal. portal, http;lIwww.derwent.comlgeneseqweb/. http://www.derwent.com/geneseqweb/. [ 1 6] Perl. http://www.perl.org, http://www.perl.com. [16]

1 45 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = 4 = = 5 = = = = = = = = = = = = = = = = = = = = = = = = = = =
[ 1 7] K. Jensen and N. Wirth. Pascal [17] Pascal User User Manual and Report, second edition. Heidelberg, Germany: Springer-Verlag, 974. Springer-Verlag, 1 1974. [ 1 8 ] T. Bray, [18] Bray, J. Paoli, Paoli, C. M. Sperberg-McQueen, et al. Extensible Extensible Markup Language (XML) 1 . 0: World C) Recommendation, 2nd edition, 1.0: World Wide Wide Web Consortium (W3 (W3C) October 6, 2000, http://www.w3.orgfTRlREC-xml/html. http://www.w3.org/TR/REC-xml/html. [ 1 9] J. Clark. XSL Transformations (XSLT) .0: World Wide Web Consortium [19] (XSLT) Version Version 1 1.0: (W3C) Recommendation, November 16, 1 999. http://www.w3 .orgfTRlxslt. 1999. http://www.w3.org/TR/xslt.

References References

[20] Michael Kay's Kay's DTDGenerator Utility. Utility. [20] Idtdgen.html. http://users.idway.co.uklmhkay/saxon/saxon5-5-1 http://users.iclway.co.uk/mhkay/saxon/saxon5-5-1/dtdgen.html. [21] Inc. BSML BSML XML format. format, http://www.labbook.com. [2 1 ] LabBook, Inc.
[22] [22] SUN Microsystems. Java Database Connectivity (JDBC). (JDBC). http://java.sun.comlproducts/jdbc. http://java.sun.com/products/jdbc.

[23] Oracle database. database, http://www.orade.com. http://www.oracle.com. [23] Orade


[24] [24] MySQL. http://www.mysql.com.

[25] Microsoft SQL Server. Server. http://www.Microsoft.comlsql!. http://www.Microsoft.com/sql/. [25] [26] IBM. IBM. DB2 Database Software. http://www-3 http://www-3.ibm.com/softward/data/db2/. [26] .ibm.comlsoftwardldata/db2/.
[27] A. Bairoch. [27] Bairoch. "The ENZYME Database in 2000." Nucleic Acids Research Research 28, no. 7 (2000) (2000):: 304-305. 304-305. http://www.expasy.chlenzyme/. http://www.expasy.ch/enzyme/. [28] J. Clark and S. S. DeRose. .0: World Wide DeRose. XML Path Language (XPath) (XPath) Version Version 1 1.0: Web Consortium (W3C) 999. (W3C) Recommendation, November 16, 1 1999. http://www.w3.orgfTRlxpath. http://www.w3.org/TR/xpath. [29] [29] "The Office Web Components Add Analysis Analysis Tools Tools to Your Your Web Page." Microsoft, Microsoft, 2003. http://office.microsoft.comlAssistance/2000/owe bcom l .aspx. http://office.microsoft.com/Assistance/2000/owebcoml.aspx. [30] J. D. Thompson, T. J. Gibson, E Plewniak, et al. "The ClustalX Windows [30] J.D. Interface: Flexible Flexible Strategies for Multiple Sequence Sequence Alignment Aided by Quality Analysis Tools. " Nucleic 1 997): 4876-4882. Tools." Nucleic Acids Research Research 24 ((1997): 4876-4882. [ 3 1 ] R. T. Miller, A. G. Christoffels, C. Gopalakrishnan, et al. "A Comprehensive [31] R.T. Approach to Clustering of Expressed Human Gene Gene Sequence: Sequence: The Sequence Sequence Tag 1 ( 1 999): Alignment and Consensus Knowledge Base." Research 9, no. 1 11 (1999): Base." Genome Research 1 143-1 155. 1143-1155. [32] [32] Platform LSE http://www.platform.comlproducts/wmlLSF/. http://www.platform.com/products/wm/LSF/. [33] . http://umbc7.umbc.edulnqs/nqsmain.html. [33] Network Queuing System System (NQS) (NQS). http://umbc7.umbc.edu/nqs/nqsmain.html.
www.scri.fsu.edulpasko/dqs.html. [34] Distributed Queuing System [34] System (DQS). (DQS). http:// http://www.scri.fsu.edu/~pasko/dqs.html.

[35] Sun ONE Grid Engine. [35] Engine. http://www.sun.comlsoftware/gridware/. http://www.sun.com/software/gridware/.

This Page Intentionally Left Blank

CHAPTER CHAPTER

6 6

The Kleisli K l eis l i Query Query System System The as a Backbone for for as a Backbone Bioinform atics Data Data Bioinformatics Integration and and Analysis Anal ysis Integration
Jing Chen, Chen, Su Chung, and and Limsoon Limsoon Wong Wong Jing Su Vun Yun Chung,

Biological data is characterized by a a wide wide range types from from the text Biological data is characterized by range of of data data types the plain plain text of laboratory laboratory records records and literature, to nucleic acid acid and and amino amino acid acid sequences, sequences, 3D of and literature, to nucleic 3D structures molecules, high-resolution high-resolution images tissues, diagrams diagrams of structures of of molecules, images of of cells cells and and tissues, of biochemical pathways regulatory networks, networks, to biochemical pathways and and regulatory to various various experimental experimental outputs outputs from technologies as as diverse diverse as as microarrays, microarrays, gels, gels, and spectrometry. These These from technologies and mass mass spectrometry. data are stored a large large number number of of databases databases across across the the Internet. Internet. In In addition data are stored in in a addition to to online online interfaces interfaces for for querying querying and and searching searching the the underlying underlying repository repository data, data, many many Web analysis of Web sites sites also also provide provide specific specific computational computational tools tools or or programs programs for for analysis of data. data. In In this this chapter chapter the the term term data data sources sources is is used used loosely loosely to to refer refer to to both both databases databases and and computational computational analysis analysis tools. tools. Until Until recently, recently, data data sources sources were were set set up up as as autonomous autonomous Web Web sites sites by by individual individual institutions institutions or or research research laboratories. laboratories. Data Data sources sources vary vary considerably considerably in in contents, contents, access methods, capacity, major difficulty access methods, capacity, query query processing, processing, and and services. services. The The major difficulty is is that the data elements in various public and private data sources are stored in ex that the data elements in various public and private data sources are stored in extremely heterogeneous tremely heterogeneous formats formats and and database database management management systems systems that that are are often often ad ad hoc, hoc, application-specific, application-specific, or or vendor-specific. vendor-specific. For For example, example, scientific scientific literature, literature, patents, patents, images, images, and and other other free-text free-text documents documents are are commonly commonly stored stored in in unstruc unstructured plain text tured formats formats like like plain text files, files, hypertext hypertext markup markup language language (HTML) (HTML) files, files, and and binary array gene binary files. files. Genomic, Genomic, micro microarray gene expression, expression, or or proteomic proteomic data data are are routinely routinely stored stored in in Excel Excel spreadsheets, spreadsheets, semi-structured semi-structured extensible extensible markup markup language language (XML), (XML), or or structured structured relational relational databases databases like like Oracle, Oracle, Sybase, Sybase, DB2, DB2, and and Informix. Informix. The The National Biotechnology Information Bethesda, Maryland, National Center Center for for Biotechnology Information (NCBI) (NCBI) in in Bethesda, Maryland, which which is is the the largest largest repository repository for for genetic genetic information, information, supplies supplies GenBank GenBank reports reports and GenPept reports in HTML format with an underlying highly nested and GenPept reports in HTML format with an underlying highly nested data data

1 48 148

The Kleisli Query System as a Backbone for Bioinformatics Data Integration


~ ~ , ~ ~ ~ ~ ~ i : ~ ~ ~ ~ ~ . ~ ~ ~ = ~ ~

model 1 ] . The model based based on on ASN.1 ASN.1 [ [1]. The computational computational analysis analysis tools tools or or applications applications suf suffer specific input input and and output output data data formats. fer from from a a similar similar scenario: scenario: They They require require specific formats. The immediately compatible compatible with The output output of of one one program program is is not not immediately with the the input input require requirement popular Basic ment of of other other programs. programs. For For example, example, the the most most popular Basic Local Local Alignment Alignment Search (BLAST) database database search called FASTA Search Tool Tool (BLAST) search tool tool requires requires a a specific specific format format called FASTA for for sequence sequence input. input. In In addition addition to to data data format format variations, variations, both both the the data data content content and and data data schemas schemas of these these databases databases are are constantly constantly changing changing in in response response to to rapid rapid advances advances in in re reof search bases continues search and and technology. technology. As As the the amount amount of of data data and and data databases continues to to grow grow on on the the Internet, Internet, it it generates generates another another bottleneck bottleneck in in information information integration integration at at the the semantic level. There standards in controlled vocabulary semantic level. There is is a a general general lack lack of of standards in controlled vocabulary for for consistent and processes and be consistent naming naming of of biomedical biomedical terms, terms, functions, functions, and processes within within and between data bases. In genes and and proteins much confusion. tween databases. In naming naming genes proteins alone, alone, there there is is much confusion. For For example, example, a a simple simple transcription transcription factor, factor, the the CCAAT/enhancer-binding CCAAT/enhancer-binding protein protein beta, is is referred referred to to by by more more than than a a dozen dozen names names in in the the public public data databases, beta, bases, including including CEBPB, CRP2, and CEBPB, CRP2, and IL6DPB. IL6DPB. For For research research and and discovery, discovery, the the biologist biologist needs needs access access to to up-to-date up-to-date data data and and best-of-breed best-of-breed computational computational tools tools for for data data analyses. analyses. To To achieve achieve this this goal, goal, the the abil ability ity to to query query across across multiple multiple data data sources sources is is not not enough. enough. It It also also demands demands the the means means to to transform transform and and transport transport data data through through various various computational computational steps steps seamlessly. seamlessly. For protein, users For example, example, to to investigate investigate the the structure structure and and function function of of a a new new protein, users must must integrate integrate information information derived derived from from sequence, sequence, structure, structure, protein protein domain domain pre prediction, diction, and and literature literature data data sources. sources. Should Should the the steps steps to to prepare prepare the the data data sets sets between between the the output output of of one one step step to to the the input input of of the the next next step step have have to to be be carried carried out out manually, manually, which which requires requires some some level level of of programming programming work work (such (such as as writing writing Ped Perl scripts), scripts), the the process process would would be be very very inefficient inefficient and and slow. slow. In short, many many bioinformatics require access In short, bioinformatics problems problems require access to to data data sources sources that that are are large, highly heterogeneous heterogeneous and large, highly and complex, complex, constantly constantly evolving, evolving, and and geographically geographically dispersed. usually involve dispersed. Solutions Solutions to to these these problems problems usually involve many many steps steps and and require require in information passed smoothly formation to to be be passed smoothly and and usually usually to to be be transformed transformed between between the the steps. steps. The Kleisli system The Kleisli system is is designed designed to to handle handle these these requirements requirements directly directly by by providing providing a a high-level query language, SQL (sSQL), express com high-level query language, simplified simplified SQL (sSQL), that that can can be be used used to to express com1 plicated plicated transformations transformations across across multiple multiple data data sources sources in in a a clear clear and and simple simple way. way.1 The design and influenced by The design and implementation implementation of of the the Kleisli Kleisli system system are are heavily heavily influenced by functional functional programming programming research, research, as as well well as as database database query query language language research. research. Kleisli's Kleisli's high-level high-level query query language, language, sSQL, sSQL, can can be be considered considered a a functional functional

1 1.. Earlier Earlier versions versions of of the the Kleisli Kleisli system system supported supported only only a a query query language language based based on on comprehension comprehension syntax (CPL) [2]. syntax called called Collection Collection Programming Programming Language Language (CPL) [2]. Now, Now, both both CPL CPL and and sSQL sSQL are are available. available.

6 .1 6.1

Motiva~ingEx, a mple~o~,o~=~,~:==~o~ ......~*~`~`~`~`~``~176176176176

1 49

149

programming language language 2 2 that that has has a a built-in notion of of bulk bulk data data types types 3 3 suitable suitable programming built-in notion for for database database programming programming and and has has many many built-in built-in operations operations required required for for mod modern ern bioinformatics. bioinformatics. Kleisli Kleisli is is implemented implemented on on top top of of the the functional functional programming programming language Kleisli uses language Standard Standard ML ML of of New New Jersey Jersey (SML). (SML). Even Even the the data data format format Kleisli uses to to exchange exchange information information with with the the external external world world is is derived derived from from ideas ideas in in type type inference inference in in functional functional programming programming languages. languages. This Kleisli system This chapter chapter provides provides a a description description of of the the Kleisli system and and a a discussion discussion of of various various aspects aspects of of the the system, system, such such as as data data representation, representation, query query capability, capability, opti optimizations, mizations, and and user user interfaces. interfaces. The The materials materials are are organized organized as as follows: follows: Section Section 6.1 6.1 introduces introduces Kleisli Kleisli with with a a well-known well-known example. example. Section Section 6.2 6.2 presents presents an an overview overview of of the the Kleisli Kleisli system. system. Section Section 6.3 6.3 discusses discusses the the data data model, model, data data representation, representation, and exchange exchange format format of of Kleisli. Kleisli. Section Section 6.4 gives more more example example queries queries in in Kleisli Kleisli and 6.4 gives and comments comments on on the the expressive expressive power power of of its its core core query query language. language. Section Section 6.5 6.5 and illustrates bases to illustrates Kleisli's Kleisli's ability ability to to use use flat flat relational relational data databases to store store complex complex objects objects transparently. Section 6.6 lists the sources supported transparently. Section 6.6 lists the kind kind of of data data sources supported by by the the Kleisli Kleisli system system and and shows shows the the ease ease of of implementing implementing wrappers wrappers for for Kleisli. Kleisli. Section Section 6.7 6.7 gives gives an an overview overview of of the the various various types types of of optimizations optimizations performed performed by by the the Kleisli Kleisli query query optimizer. optimizer. Section Section 6.8 6.8 describes describes both both the the Open Open Database Database Connectivity Connectivity (ODBC) (ODBC)or or Java Java Database Database Connectivity Connectivity (JDBC)-like (JDBC)-like programming programming interfaces interfaces to to Kleisli Kleisli in in Perl Perl and and Java, Java, as as well well as as its its Discovery Discovery Builder Builder graphical graphical user user interface. interface. Section Section 6.9 6.9 contains contains a a brief brief survey survey of of other other well-known well-known proposals proposals for for bioinformatics bioinformatics data data integration. integration.

6. 1 6.1

M OTIVATI N G EXAM PLE MOTIVATING EXAMPLE


Before Before discussing discussing the the guts guts of of the the Kleisli Kleisli system, system, the the very very first first bioinformatics bioinformatics data data integration problem solved solved using Kleisli is . 1 . 1 . The integration problem using Kleisli is presented presented in in Example Example 6 6.1.1. The query query was Kleisli in 994 [5] [5] and so-called "impossible" was implemented implemented in in Kleisli in 1 1994 and solved solved one one of of the the so-called "impossible" queries D.S. Department queries of of a a U.S. Department of of Energy Energy Bioinformatics Bioinformatics Summit Summit Report Report published published in 1 993 [6]. in 1993 [6].

2 2.. Functional Functional programming programming languages languages are programming programming languages languages that that emphasize emphasize a particular particular paradigm 4]. In [3, 4]. In this this paradigm, paradigm, paradigm of programming programming technique technique known known as as functional programming [3, all programs programs are expressed expressed as mathematical mathematical functions functions and are generally generallyfree free from from side side effects. effects. Ex Examples amples of functional functional programming programming languages languages are LISP, LISP,Haskell, HaskeU,and SML. SML. Some Some fundamental fundamental ideas ideas in functional functional programming programming languages, languages, such such as as garbage garbage collection, collection, have have also also been been borrowed borrowed by by other other modern modern programming programminglanguages languages such such as as Java. Java. 3 3.. Bulk data data types types refer refer to data data types types that that are collections collections of objects. objects. Examples Examplesof bulk bulk data data types types are sets, bags, bags, lists, lists, and arrays. sets, arrays.

1 50

nformatics Data Integration The The Kleisli Query SV',TAlm System as a Backbone Backbone for for Bioi Bioinformatics

Example 1 . 1 The Example 6. 6.1.1 The query query was was to to "find "find for for each each gene gene located located on on a a particular particular cytogenetic human chromosome cytogenetic band band of of a a particular particular human chromosome as as many many of of its its non-human non-human homologs " Basically, for each Basically, the the query query means means " "for each gene gene in in a a particular particular homologs as as possible. possible." position human genome, position in in the the human genome, find find dioxyribonucleic dioxyribonucleic acid acid (DNA) (DNA) sequences sequences from from non-human " non-human organisms organisms that that are are similar similar to to it. it." In 994, the main database cytogenetic band band information In 1 1994, the main database containing containing cytogenetic information was was the the Genome Genome Database Database (GOB) (GDB) [7], [7], which which was was a a Sybase Sybase relational relational database. database. To To find find homologs, actual DNA sequences were homologs, the the actual DNA sequences were needed, needed, and and the the ability ability to to compare compare them them was also needed. needed. Unfortunately, was also Unfortunately, that that database database did did not not keep keep actual actual DNA DNA sequences. sequences. The The actual actual DNA DNA sequences sequences were were kept kept in in another another database database called called GenBank GenBank [8]. [8]. At At the the time, time, access access to to GenBank GenBank was was provided provided through through the the ASN.1 ASN.1 version version of of Entrez Entrez [9], complicated retrieval system. Entrez [9], which which was was at at the the time time an an extremely extremely complicated retrieval system. Entrez also also kept logs of kept precomputed precomputed homo homologs of GenBank GenBank sequences. sequences. So, So, the the evaluation evaluation of of this this query query needed needed the the integration integration of of GOB GDB (a (a relational relational database database located located in in Baltimore, Baltimore, Maryland) Maryland) and and Entrez Entrez (a (a non-relational non-relational database database located located in in Bethesda, Bethesda, Maryland). Maryland). The The query query first first extracted extracted the the names names of of genes genes on on the logs of the desired desired cytogenetic cytogenetic band band from from GOB, GDB, then then accessed accessed Entrez Entrez for for homo homologs of these these homologs homologs were these genes. genes. Finally, Finally, these were filtered filtered to to retain retain the the non-human non-human ones. ones. This This query query was was considered considered "impossible" "impossible" as as there there was was at at that that time time no no system system that that could could work work across across the the bioinformatics bioinformatics sources sources involved involved due due to to their their heterogeneity, heterogeneity, complexity, and geographical locations. Given the complexity, and geographical locations. Given the the complexity complexity of of this this query, query, the sSQL is remarkably sSQL solution solution below below is remarkably short. short. sybase-add name : " gdb " , ....); ..); sybase-add ( (name"gdb", create ing c r e a t e view v i e w locus locus from from locus_cyto_location l o c u s _ c y t o _ l o c a t i o n us usi n g gdb gdb;; create rom obj ec t_genbank_ere ing c r e a t e view v i e w eref eref f from obje ct _ g e n b a n k _ e r e ff us usi n g gdb gdb;; s e l e c t accn .g genbank_re select accn-: g g. e n b a n k _ r e ff , , nonhuman-homol n o n h u m a n - h o m o l oogs gs.: H H from from l ocus c , eref , locus c, eref g g, { s elect u {select u from -homologsummary (g . genbank_re f) u from na-get na-gethomolog-s ummary(g .g enbank_ref) u where (u u.. t ti it tl le e l ike " %Human% " ) and (u u.. t ti it tl le e w h e r e not not( like "%Human%") and not not( l ike " .s sapi en% like "%H %H. apie n % ""))}} H H where where 2 2 " and .o obj ect_id c .c chrom_num .l lo oc cus id d c. hrom_num = " "22" and g g. bje ct_id = c c. us_i and (H = { ) ; and not not (H { } });
= =

The o GOB n GOB The first first three three lines lines connect connect t to GDB and and map map two two tables tables iin GDB to to Kleisli. Kleisli. After After that, if they they were that, these these two two tables tables could could be be referenced referenced within within Kleisli Kleisli as as if were two two locally locally defined defined sets, sets, locus l o c u s and and ere e r e ff . . The The next next few few lines lines extract extract from from these these tables tables the the

6.2 Approach

151 151

accession numbers numbers of of genes genes on on Chromosome Chromosome 22, 22, use use the the Entrez Entrez function function n na accession age ett summary to obtain obtain their their homologs, homologs, and and filter filter these these homologs homologs g -h homo o m o l logog-su m m a r y to for non-human non-human ones. ones. Notice that the the from-part from-part of of the the outer outer s se ell for Notice that e ect-construct ct-construct {s u ... ...} } H. This This means means that that H is is the the entire entire set set returned returned is of of the the form form { is se e llee cc tt u se ell u . .. .., , thus thus allowing allowing to to manipulate manipulate and and return return all all the the non-human non-human by s by eec ct t u homologs as as a a single single set set H. homologs Besides the the obvious obvious smoothness smoothness of of integration integration of of the the two two data data sources, sources, this this Besides query is is also also remarkably remarkably efficient. efficient. On On the the surface, it seems seems to to fetch fetch the the l lo ocus query surface, it cus table in in its its entirety entirety once once and the e er r table in in its its entirety entirety n n times times from from GDB (a table and the ee f f table GDB (a naive evaluation evaluation of of the the comprehension comprehension would would be two nested nested loops loops iterating iterating over over naive be two these two two tables). tables). Fortunately, Fortunately, in in reality, reality, the the Kleisli Kleisli optimizer optimizer is is able able to to migrate migrate the the these join, selection, selection, and and projections projections on on these these two two tables tables into into a a single single efficient efficient access access to to join, GDB using using the the optimizing optimizing rules rules from from a later section section of of this this chapter. chapter. Furthermore, Furthermore, GDB a later the accesses accesses to to Entrez are also also automatically automatically made made concurrent. concurrent. the Entrez are Since query, Kleisli Since this this query, Kleisli and and its its components components have have been been used used in in a a number number of of 4 Transbioinformatics projects projects such such as as GAIA GAlA at at the the University University of of Pennsylvania, Pennsylvania,4 Trans bioinformatics parent Access Access to to Multiple Multiple Bioinformatics Bioinformatics Information Information Sources (TAMBIS) at at the parent Sources (TAMBIS) the University of of Manchester [ 1 1 , 12], 12], and and FIMM FIMM at at Kent Kent Ridge Digital Labs Labs [13]. [13]. University Manchester [11, Ridge Digital It has also also been been used used in in constructing constructing databases data bases by pharmaceuticaVbiotechnology It has by pharmaceutical/biotechnology companies such such as as SmithKline SmithKline Beecham, Beecham, Schering-Plough, Schering-Plough, GlaxoWellcome, GlaxoWellcome, companies Genomics Collaborative, Collaborative, and and Signature Biosciences. Kleisli Kleisli is is also also the the backbone backbone Genomics Signature Biosciences. of the Discovery Discovery Hub product of of geneticXchange Inc.5 of the Hub product geneticXchange Inc. 5

6.2 6.2

APPROACH APPROACH
The . 1 . It The approach approach taken taken by by the the Kleisli Kleisli system system is is illustrated illustrated by by Figure Figure 6 6.1. It is is positioned positioned as as a a mediator mediator system system encompassing encompassing a a complex complex object object data data model, model, a a high-level high-level query query language, language, and and a a powerful powerful query query optimizer. optimizer. It It runs runs on on top top of of a a large large number number of of lightweight lightweight wrappers wrappers for for accessing accessing various various data data sources. sources. There There is is also also a a number number of of application that allow Kleisli to application programming programming interfaces interfaces that allow Kleisli to be be accessed accessed in in an an ODBC ODBCor or JDBC-like JDBC-like fashion fashion in in various various programming programming languages languages for for a a various various applications. applications. The The Kleisli Kleisli system system is is extensible extensible in in several several ways. ways. It It can can be be used used to to support support several several different different high-level high-level query query languages languages by by replacing replacing its its high-level high-level query query lan language guage module. module. Currently, Currently, Kleisli Kleisli supports supports a a comprehension comprehensionsyntax-based syntax-based language language called 5] and called CPL CPL [2, [2, 14, 14, 1 15] and a a nested nested relationalized relationalized version version of of SQL SQL called called sSQL. sSQL.

4 0] . 4.. Information Information about about the the GAlA GAIA project project is is available available at at http://www.cbil.upenn.edu/gaia2/gaia http.//www.cbil.upenn.edu/gaia2/gaiaand and [1 [10]. 5. Information Information about about Discovery Discovery Hub Hub is is available available at at http://www.geneticxchange.com. http.//www.geneticxchange.com.

152 1 52

6 6

The nformatics Data ntegration The Kleisli Kleisli Query Query System System as as a a Backbone Backbone for for Bioi Bioinformatics Data IIntegration

s_ _ _ ___ ry u e_ _ e Q r_ _ u '-_ _

)[

l I_ A ___ "S _ _ IO _ c_ _ _ at p _ P .... _ _ _

A Pls

Ontologle8 'rl Vocabularles H


.t

The Klelsll Query Eng i ne


Query Language Complex Object Data Model Query Optlmlzer

Meta-data D

D D D D 1 Wra pper 1 1 Wrapper I lwra pped


.t .t
----

B2
Sybase

6.1 6.1 FIGURE F IGURE

Kleisli positioned as a mediator.

Only sSQL sSQL is is used used throughout throughout this this chapter. Kleisli system also be used to to Only chapter. The The Kleisli system can can also be used support many many different types of of external external data data sources sources by by adding adding new new wrappers, wrappers, support different types Kleisli's requests requests to to these these sources sources and and translate translate their their replies replies into into which forward Kleisli's which forward Kleisli's exchange exchange format. format. These These wrappers wrappers are are lightweight and new new wrappers wrappers are are Kleisli's lightweight and generally easy easy to to develop develop and and insert insert into into the the Kleisli Kleisli system. system. The The optimizer of the the generally optimizer of Kleisli Kleisli system system can can also also be be customized customized by by different different rules rules and and strategies. strategies. When a a query query is is submitted submitted to to Kleisli, Kleisli, it it is is first first processed processed by by the the high-level high-level When query language language module, module, which which translates translates it it into into an an equivalent equivalent expression expression in in the the query abstract calculus calculus Nested Nested Relational Relational Calculus Calculus (NRC). (NRC). NRC NRC is is based based on on the the calcucalcu abstract lus described described in in Buneman's Buneman's "Principles "Principles of of Programming Programming with with Complex Complex Objects Objects lus and Collection Collection Types" Types" [16] [16] and and was was chosen chosen as as the the internal internal query query representarepresenta and tion because because it it is is easy easy to to manipulate manipulate and and amenable amenable to to machine machine analysis. analysis. The The tion NRC expression expression is is then then analyzed analyzed to to infer infer the the most most general general valid valid type type for for the the NRC expression and and is is passed passed to to the the query query optimizer. optimizer. Once Once optimized, optimized, the the NRC NRC exex expression pression is is then then compiled compiled into into calls calls to to a a library library of of routines routines for for complex complex objects objects pression underlying the the complex complex object object data data model. model. The The resulting resulting compiled compiled code code is is then then underlying executed, accessing accessing drivers drivers and and external external primitives primitives as as needed needed through through pipes pipes or or executed,

6.3 Data Data Model Model and and Representation 6.3 ~ ~ = ~ ~ ~ ~ ~ ~ ~ ~ ~

15 3 53

shared memory. memory. Each Each of of these these components components is is considered considered in in further further detail detail in in the the shared next several several sections. sections. next

6.3 6.3

M O D E L AND AN D REPRESENTATION R E PR E S E NTATI O N DATA MODEL


The data data model, model, data data representation, representation, and and data data exchange exchange format format o f the the Kleisli Kleisli The of system are presented presented in in this this section. The data data model model underlying underlying the the Kleisli Kleisli system system system are section. The of atomic atomic records or is a a complex complex object object type type system system that that goes goes beyond beyond the the sets is sets of records or fla t relations relations type of relational relational databases databases [17]. [ 1 7] . It It allows allows arbitrarily nested fiat type systems systems of arbitrarily nested tagged union union type, type, records, sets, sets, lists, lists, bags, bags, and and variants. variants. A A variant variant is is also also called called a a tagged records, and it it represents represents a a type type that that is is either either this this or or that. that. The The collection collection or or bulk bulk typesm types and sets, bags, and lists--are lists-are homogeneous. mix objects objects of types in in a set, sets, bags, and homogeneous. To To mix of different different types a set, bag, or list, it it is is necessary to inject inject these these objects into a variant type. type. bag, or list, necessary to objects into a variant In a relational bulk data data type type is is the the set. set. Furthermore, Furthermore, this In a relational database, database, the the sole sole bulk this set is allowed to contain records records where where each field is to contain set is only only allowed to contain each field is allowed allowed to contain an an atomic object such such as as a a number number or or a a string. such a restricted bulk bulk data data type atomic object string. Having Having such a restricted type presents at at least least two two problems problems in in real-life real-life applications. First, the the particular particular bulk bulk presents applications. First, data type may not not be be a a natural natural model model of real data. data. Second, Second, the the particular bulk data type may of real particular bulk data type may not not be be an an efficient efficient model of real real data. data. For restricted data type may model of For example, example, when when restricted to the the flat flat relational relational data data model, model, the the GenPept GenPept report report in be to in Example Example 6.3.1 6.3.1 must must be split into many separate tables to be stored database without split into many separate tables to be stored in in a a relational relational database without loss. loss. The The resulting resulting multi-table multi-table representation representation of of the the GenPept GenPept report report is is conceptually conceptually unnatural unnatural and and operationally operationally inefficient. inefficient. A A person person querying querying the the resulting resulting data data must must pay the mental overhead pay the mental overhead of of understanding understanding both both the the original original GenPept GenPept report report and and its badly fragmented its badly fragmented multi-table multi-table representation. representation. The The user user may may also also have have to to pay pay the the performance to reassemble performance overhead overhead of of having having to reassemble the the original original GenPept GenPept report report from from its fragmented fragmented multi-table answer queries. its multi-table representation representation to to answer queries. As As another another example, example, limited limited with with the the set set type type only, only, even even if if nesting nesting of of sets sets is is allowed, allowed, one one may may not not be be able able to model LINE reports LINE report to model MED MEDLINE reports naturally. naturally. A A MED MEDLINE report records records information information on on a a published published paper, paper, such such as as its its title title and and its its authors. authors. The The order order in in which which the the authors authors are are listed listed is is important. important. With With only only sets, sets, one one must must pair pair each each author author explicitly explicitly with with a a number number representing representing his his or or her her position position in in order order of of appearance. appearance. Whereas Whereas with with the the data list, this cumbersome explicit data type type list, this cumbersome explicit pairing pairing with with position position becomes becomes unnecessary. unnecessary.
Example 1 The Example 6.3. 6.3.1 The GenPept GenPept report report is is the the format format chosen chosen by by NCBI NCBI to to represent represent information information on on amino amino acid acid sequence. sequence. While While an an amino amino acid acid sequence sequence is is a a string string of of characters, characters, certain certain regions regions and and positions positions of of the the string, string, such such as as binding binding sites sites and and domains, domains, are are of of special special biological biological interest. interest. The The feature feature table table of of a a GenPept GenPept report report is is the the part part of of the the GenPept GenPept report report that that documents documents the the positions positions of of these these regions regions of of

1 54

nformatics Data Integration The Kleisli Query System as a Backbone Backbone for for Bioi Bioinformatics

special interest, as well as comments on these regions. special biological biological interest, as well as annotations annotations or or comments on these regions. The The following following type type represents represents the the feature feature table table of of a a GenPept GenPept report report from from Entrez Entrez [9]. [9].
((#uid:num, #uid : num , # t i t l e : string , #title:string, #access ion : s tring , # f eature : { #accession:string, #feature: { (( #name tring , # s tart : num , # end : num , #name.: s string, #start-num, #end-num, #anno #anno_name : string , #descr : s tr ing ) ] )} ) #anno:: [ [ ((#anno_name:string, #descr:string) ]) })

It interesting type because one # feature) is It is is an an interesting type because one of of its its fields fields ((#feature) is a a set set of of records, records, # anno) is one one of of whose whose fields fields ( (#anno) is in in turn turn a a list list of of records. records. More More precisely, precisely, it it is is a a l,e, #access eature. first record with fields #uid, record with four four fields #uid, #t t ii tt le #accessii o on, n , and and # f fe a t u r e . The The first ing respectively. The three these store three of of these store values values of of types types num, s s ttring, r i n g , and and s ttr rin g respectively. The # u i d field uniquely identifies the GenPept report. The # f e a t u r e field is a set #uid field uniquely identifies the GenPept report. The f eature field is a set of records, which table of of records, which together together form form the the feature feature table of the the corresponding corresponding GenPept GenPept #narne, #s end, and report. Each report. Each of of these these records records has has four four fields: fields. #name, # s ttart, art, # #end, and #anno. #anno. The The first first three three of of these these have have types types string, s t r i n g , num, hum, and and num respectively. respectively. They They represent represent the the name, name, start start position, position, and and end end position position of of a a particular particular feature feature in in the the list of feature feature table. table. The The #anno #anno field field is is a a list of records. records. Each Each of of these these records records has has two two fields #anno_name #anno_name and # d e s c r , both of type type string. s t r i n g . These records together together fields and #descr, both of These records represent all annotations annotations on corresponding feature. represent all on the the corresponding feature. In general, the In general, the types types are are freely freely formed formed by by the the syntax: syntax:
t : = num I[ bool t :::= n u m Il string string booll l , ln :"tn> tn > /1 :.t~, tl , ... , ln :"t,) tn ) 1I<l~ t1 , ... t t l } II [ t } II { IJti t ] 1i ((I~ <11 :"t~, t ::= ::= { {t} [t] ...,ln ...,1,

Here and Here num, s s ttr r i n ing, g, and bool b o o l are are the the base base types. types. The The other other types types are are constructors constructors t}, { tI } , and t ] respec and build new and build new types types from from existing existing types. types. The The types types { {t}, { IItl }, and [ [t] respectively construct set, bag, and list types from type t. The type ( /1 : t , ... , In tively construct set, bag, and list types from type t. The type (11 : h,..., In :: tn t,)) 1 constructs record record types , ... , tn , ... , In 4,> constructs constructs types from from types types tl tl,..., tn.. The The type type </ <11 tl,..., In :: tn> constructs 1 :: tl variant types ... , tn relations of relational databases variant types from from types types th tl,..., tn. The The flat flat relations of relational databases are are basically records, where field of basically sets sets of of records, where each each field of the the record record is is a a base base type; type; in in other other words, relational bases have bags, no lists, no variants, no words, relational data databases have no no bags, no lists, no variants, no nested nested sets, sets, and explicitly and and no no nested nested records. records. Values Values of of these these types types can can be be represented represented explicitly and exchanged exchanged as as follows, follows, assuming assuming that that the the instances instances of of e e are are values values of of appropriate appropriate types: /1 :: el, eh ... , en ... , In en ) for types: ( (11 el,..., In ::en) for records; records; <I <1 :: ee> > for for variants; variants; { {el,..., en}} for for sets; sets; { Il et , ... , en eh ... , en e~,..., e n II } for for bags; bags; and and [ [el,..., en ] ] for for lists. lists.
Example 3 1 470, a E x a m p l e 6.3.2 6.3.2 Part Part of of the the feature feature table table of of GenPept GenPept report report 1 131470, a tyrosine tyrosine phosphatase C sequence, phosphatase 1 1C sequence, is is shown shown in in the the following. following.

6.3 6.3

Data odel and Data M Model and

Representation

1 55

((#uid'131470, #uid : 1 3 l 47 0 , #acc e s s i on : " 13 l 4 7 0 " , #accession-"131470", PTP - l e ) . . . " , #feature-{( # feature : { ( # title : " . . . ( #title-"... (PTP-IC)...", #name : " source " , # s tart : O , # end : 5 9 4 , #anno #name-"source", #start-0, #end-594, #anno-: [ [ ((#anno_name-"organism", #anno_name : " organ i sm " , #descr : " Mus musculus "), #descr."Mus musculus"), ((#anno #anno_name : " db_xre f " , #descr : " taxon : 1 0 0 9 0 " ) ] ) , name-"db_xref", #descr-"taxon-10090")]), ...}) ... })
The particular particular feature feature goes goes from from amino amino acid acid 0 0 to to amino amino acid acid 594, 594, which which is is actually actually The the the entire entire sequence, sequence, and and has has two two annotations: annotations: The The first first annotation annotation indicates indicates that that this this amino amino acid acid sequence sequence is is derived derived from from a a mouse mouse DNA DNA sequence. sequence. The The second second is is a a cross cross reference reference to to the the NCBI NCBI taxonomy taxonomy database. database. The bases, flat The schemas schemas and and structures structures of of all all popular popular bioinformatics bioinformatics data databases, flat files, files, and software software are are easily easily mapped mapped into into this this data data model. model. At At the the high high end end of of data data and structure 1 8] , which structure complexity complexity are are Entrez Entrez [9] [9] and and AceDB AceDB [ [18], which contain contain deeply deeply nested nested mixtures mixtures of of sets, sets, bags, bags, lists, lists, records, records, and and variants. variants. At At the the low low end end of of data data structure structure complexity are 1 7] such complexity are the the relational relational database database systems systems [ [17] such as as Sybase Sybase and and Oracle, Oracle, which contain contain flat flat sets sets of of records. records. Currently, Currently, Kleisli Kleisli gives gives access access to to more more than than 60 60 of of which these these and and other other bioinformatics bioinformatics sources. sources. The The reason reason for for this this ease ease of of mapping mapping bioin bioinformatics Kleisli's data formatics sources sources to to Kleisli's data model model is is that that they they are are all all inherently inherently composed composed of of combinations Kleisli's data combinations of of sets, sets, bags, bags, lists, lists, records, records, and and variants. variants. Kleisli's data model model directly directly and naturally maps and naturally maps sets sets to to sets, sets, bags bags to to bags, bags, lists lists to to lists, lists, records records to to records, records, and and variants variants to to variants variants without without having having to to make make any any (type) (type) declaration declaration beforehand. beforehand. The The last last point point deserves deserves further further consideration. consideration. In In a a dynamic, dynamic, heterogeneous heterogeneous en environment vironment such such as as that that of of bioinformatics, bioinformatics, many many different different database database and and software software systems are thought of systems are used. used. They They often often do do not not have have anything anything that that can can be be thought of as as an an explicit explicit database database schema. schema. Further Further compounding compounding the the problem problem is is that that research research biol biologists ogists demand demand flexible flexible access access and and queries queries in in ad ad hoc hoc combinations. combinations. Thus, Thus, a a query query system integration mechanism system that that aims aims to to be be a a general general integration mechanism in in such such an an environment environment must conditions. First, must satisfy satisfy four four conditions. First, it it must must not not count count on on the the availability availability of of schemas. schemas. It It must must be be able able to to compile compile any any query query submitted submitted based based solely solely on on the the structure structure of of that that query. query. Second, Second, it it must must have have a a data data model model that that the the external external database database and and software software systems can without doing doing a lot of systems can easily easily translate translate to to without a lot of type type declarations. declarations. Third, Third, it it must must shield shield existing existing queries queries from from evolution evolution of of the the external external sources sources as as much much as as possible. possible. For For example, example, an an extra extra field field appearing appearing in in an an external external database database table table must must not not necessitate necessitate the the recompilation recompilation or or rewriting rewriting of of existing existing queries queries over over that that data data source. Fourth, source. Fourth, it it must must have have a a data data exchange exchange format format that that is is straightforward straightforward to to use use so so it it does does not not demand demand too too much much programming programming effort effort or or contortion contortion to to capture capture the the variety bases and variety of of structures structures of of output output from from external external data databases and software. software. Three addressed by sSQL's type Three of of these these requirements requirements are are addressed by features features of of sSQL's type system. system. sSQL has has polymorphic polymorphic record sSQL record types types that that allow allow to to express express queries queries such such as: as:

1 56

6 6

The Kleisli The Kleisli Query System as a Backbone for Bioinformatics Data Integration

.s sa a > 00 000;; s el le ec .n name rom se c tt x x. ame f fr om R R x x where where x x. ll aary ry > 1 i0

create t et t-r r c r e a t e func funct ii oon n g ge ii cch h -g guys uys

((R) R) as as

which defines defines a names of people in earning more which a function function that that returns returns names of people in R earning more than than $1000. This function function is is applicable applicable to to any any R that that has has at at least least the the name name and and the the $ 1 000. This s all sa aary r y fields, fields, thus thus allowing allowing the the input input source source some some freedom freedom to to evolve. evolve. In addition, sSQL sSQL does all. The In addition, does not not require require any any type type to to be be declared declared at at all. The type type and and meaning of of any any sSQL sSQL program program can can always always be be completely completely inferred inferred from from its its structure structure meaning without the makes it plug without the use use of of any any schema schema or or type type declaration. declaration. This This makes it possible possible to to plug in any source logically without doing doing any in any data data source logically without any form form of of schema schema declaration, declaration, at at a a small run-time errors the actual actual structure structure small acceptable acceptable risk risk of of run-time errors if if the the inferred inferred type type and and the are not not compatible. compatible. This This is is an an important feature because biological data data are important feature because most most biological sources not have sources do do not have explicit explicit schemas, schemas, while while a a few few have have extremely extremely large large schemas schemas that many pages write down-for of Entrez that take take many pages to to write down--for example, example, the the ASN.l ASN.1 schema schema of Entrez [1 ]--making it impractical to to have any form form of of declaration. declaration. [ 1 ]-making it impractical have any As for the the fourth fourth requirement, requirement, a a data data exchange exchange format format is is an an agreement agreement on on how As for how to is exchanged to lay lay out out data data in in a a data data stream stream or or message message when when the the data data is exchanged between between two two systems. In In this context, it is the format for for exchanging exchanging data between Kleisli Kleisli and and systems. this context, it is the format data between all the all the bioinformatics bioinformatics sources. sources. The The data data exchange exchange format format of of Kleisli Kleisli corresponds corresponds one-to-one Kleisli's data model. It variants, sets, bags, one-to-one with with Kleisli's data model. It provides provides for for records, records, variants, sets, bags, and these data data types to be and lists; lists; and and it it allows allows these types to be composed composed freely. freely. In In fact, fact, the the data data exchange exchange format format completely completely adopts adopts the the syntax syntax of of the the data data representation representation described described earlier earlier and and illustrated illustrated in in Example Example 6.3.2. 6.3.2. This This representation representation has has the the interesting interesting property For instance, instance, a symbol { property of of not not generating generating ambiguity. ambiguity. For a set set symbol { represents represents a a set, whereas a parenthesis (( denotes short, this exchange format set, whereas a parenthesis denotes a a record. record. In In short, this data data exchange format is self describing. basic specification exchange format Kleisli is is is self describing. The The basic specification of of the the data data exchange format of of Kleisli summarized detailed account, account, please summarized in in Figure Figure 6.2. 6.2. For For a a more more detailed please see see Wong's Wong's paper paper on Kleisli Kleisli from from the the 2000 Symposium on on Bioinformatics Bioinformatics and and Bio-engineering Bio-engineering on 2000 IEEE IEEE Symposium [19]. [ 1 9]. A self-describing A self-describing exchange exchange format format is is one one in in which which there there is is no no need need to to define define in advance structure of of the the objects objects being being exchanged. exchanged. That That is, there is is no no fixed fixed in advance the the structure is, there schema declaration. In object being schema and and no no type type declaration. In a a sense, sense, each each object being exchanged exchanged carries carries its description. A self-describing format has the its own own description. A self-describing format has the important important property property that, that, no no matter how complex complex the object being easily parsed matter how the object being exchanged exchanged is, is, it it can can be be easily parsed and and reconstructed without any ISO ASN.l [20] on reconstructed without any schema schema information. information. The The ISO ASN.1 standard standard [20] on open open systems systems interconnection interconnection explains explains this this advantage. advantage. The The schema schema that that describes describes its to be objects, making its structure structure needs needs to be parsed parsed before before ASN.l ASN. 1 objects, making it it necessary necessary to to write write two parser. two complicated complicated parsers parsers instead instead of of one one simple simple parser.

Data Type Unit Booleans

Data Layout

Remarks

()

true t rue
al se f false

I
Positive numbers

Numbers

123 1 23
1 23 . 12 3 123.123 23 ~1 123 12 3 . 1 2 3 "123.123

Negative numbers

Strings Records

"a

t ring " "a s string"

A string is put inside double quotes A record is put inside round brackets. brackets. The label-:-value iabel-:-value triplets enumerate

(#11 O1,0 ( #11 : 01

Bin :: #In
Variants Variants Sets

On ) ) On

the fields ofthe of the record A variant is put inside angle brackets A set is put inside curly brackets brackets

<#1 > <#/:: 0 O> { { 01 O1,


.,

O.} On }
Bags

{ 1 01 , {101,
.,

is A bag i s put inside curly-bar brackets

On I} o.I}
Lists

[ O1, 01 ,
.,

A list iis s put inside square brackets

On O.] ]
User-defined types Errors Errors

i
A user-defi ned type i s preceded by iits t s name user-defined is name An error message i s preceded b y error is by error
I i

SOE" l ong tude lo n g ii tu de " "50E" ed d e rrror er or " "i i tt goof goofe ""

6.2 6.2
F I G U RE FIGURE

The The basic basic form form of of the the Kleisli Kleisli Exchange Exchange Format. Format. Punctuations Punctuations and and indentations indentations are object. Multiple are not not significant. significant. The The semicolon semicolon indicates indicates the the end end of of a a complex complex object. Multiple complex complex objects objects can can be be laid laid out out in in the the same same stream. stream.

1 58

6 6

The The Kleisli Kleisli Query Query System System as as a a Backbone Backbone for for Bioinformatics Bioi nformatics Data Data Integration

6.4 6.4

QU E RY RY CAPABILITY CAPABI LITY QU


sSQL is is the the primary primary query query language language of of Kleisli Kleisli used used in in this this chapter. chapter. It It is is based based on on the the sSQL de facto facto commercial commercial database database query query language language SQL, SQL, except except for for extensions extensions made made to to de cater cater to to the the nested nested relational relational model model and and the the federated federated heterogeneous heterogeneous data data sources. sources. Rather Rather than than giving giving the the complete complete syntax, syntax, sSQL sSQL is is illustrated illustrated with with a a few few examples examples on a a set set of of feature feature tables tables DB. on
Example the titles those eleExample 6.4.1 6.4. 1 The The query query below below "extracts "extracts the titles and and features features of of those ele whose titles titles contain contain " ' tyros as a a substring." substring. " ments of of a a data data source source DB whose ments t y r o s iine' n e " as
c reate func ti io on n ge t-t ti i fa eatureTab cr eate f unct get tt ll ee -f fromrom-fe tureTable le s el le ec ct t t ti it tl le e x. .t ti it t se -: x ll ee ,, fr om r e x x. .t ti it t li ik e " "% si ine% f ro m DB DB x x whe wher e ll ee l ke % tyro tyros n e % "" ;; f eature .f fe ea fe a t u r e -: x x. a tture ure

( DB ) as as (DB)

This is a a simple query. A A project-select query is a query query that that This query query is simple project-select project-select query. project-select query is a operates (flat) relation operates on on one one (flat) relation or or set. set. Thus, Thus, the the transformation transformation such such a a query query can can perform perform is is limited limited to to selecting selecting some some elements elements of of the the relation relation and and extracting extracting or or pro projecting jecting some some fields fields from from these these elements. elements. Except Except for for the the fact fact that that the the source source data data and the the result result may may not not be be in in first first normal normal form, form, these these queries queries can can be be expressed expressed in in a a and relational relational query query language. language. However, However, sSQL sSQL can can perform perform more more complex complex restructur restructurings such such as as nesting and unnesting not found found in in SQL, SQL, as as shown shown in in the the following following ings nesting and unnesting not examples. examples.

Example 2 s is Example 6.4.2 6.4.2 The The following following query query flattens flattens the the source source DB DB completely. completely. 1 12s is a a function function that that converts converts a a list list into into a a set. set.
c ea at te e func ti io on n f tt te en -ff le cr re funct fl la at neeatureTab atureTable s le ec ct t se el t tl le e-: x x ti it ..tt ii tt ll ee ,, f f f eature e a t u r e -:f ..nname ame, , ((DB) DB ) as as end :f f.. e end endn d ,,

f , x .f fe eature , f . anno 12 2s s a ; fr r om o m DB DB x x, x. ature f f, f. a n n o .. 1 a;

anno -n name annoame : a a. 9 . anno_name a n n o _ n a m e ,,

anno -des c sc cr r annodesc rr : a a. 9 . de des

s f..s s s tart t a r t -: f tt aar r tt ,,

The next next query query demonstrates demonstrates how how to to express express nesting nesting in in sSQL. sSQL. Notice Notice that that the the The ent e s field e n t rr ii es field is is a a complex complex object object having having the the same same type type as as DB. DE.
c at te e func ti io on n ne st t-f fe eatureTab leb -y by cr re ea funct nes atureTable - oro grgani a n i s m sm s ct t se el le ec o is sm o rgan rgani m -: entr ie es s-: entri
z , z,

((DB) DB ) as as

((s t d is st ti in nc ct t x se el le ec ct di x

6.4 6.4

Query Capab i l i t y _ ~

............................................................ 159

1 59

f rom .f fe eature , f .a anno fr o m DB DB x x,, x x. ature f f, f. nno a a where .a anno_name sm .d descr ) where a a. nno_name = = " "organi organis m " " and and a a. escr = = z z) f rom st tinct .a anno fr om ( (select s e l e c t di dis inct y y. n n o --descr descr from .f fl lattenle y f r o m DB DB. a t t e n - ff eeatureTab atureTable y where " organi where y y.. anno-name anno-name = =" o r g a n i ssm m "" ) ) z z; ;

The The next next couple couple of of more more substantial substantial queries queries are are inspired inspired by by one one of of the the most most ]. advanced advanced functionalities functionalities of of the the EnsMart EnsMart interface interface of of the the EnsEMBL EnsEMBL system system [21 [21].
Example 6.4.3 6.4.3 The The feature feature table table o off a a GenBank GenBank report report has has the the type type below. below. The The field #pos is a list indicating indicating the field # p o s ii t it oi non of of a a feature feature entry entry is a list the start start and and stop stop positions positions of that that feature. feature. If If the the feature feature entry entry is is a a CDS, this list list corresponds corresponds to to the the list list of of of COS, this exons exons of of the the COS. CDS. The The field field #anno # a n n o is is a a list list of of annotations annotations associated associated with with the the feature feature entry. entry. ((#uid#uid : num ti it tllee tring ce es sss num,, # #t -: s st r i n g , , #ac #acc ii oon n - : string s t r i n g ,, # seq : s tring feature #seqst r i n g ,, # #f eature- : { {( ( #name tring #name. : s st r i n g ,, #po si ittii ( # start : num end : num #pos oon n. : [ [(#startnum,, # #endnum,, #negative l ,. . . .. . . )) ] , # n e g a t i v e - : boo bool ], #anno ( #anno_name ing , #anno- : [ [( # a n n o name-: str stri n g , , #descr # d e s c r - : string s t r i n g )) ] ], .. . . ..))} } ,. . ... . . ))} }

Given Given a a set set DB DB of of feature feature tables tables of of GenBank GenBank chromosome chromosome sequences, sequences, one one can can extract 500 bases extract the the 500 bases up up stream stream of of the the translation translation initiation initiation sites sites of of all all disease disease genes-in genes--in the the sense sense that that these these genes genes have have a a cross cross reference reference to to the the Online Online Mendelian Mendelian Inheritance Inheritance of of Man Man database database (OMIM)-on ( O M I M ) - - o n the the positive positive strand strand in in DB as as below. below.
s el le ec ctt se ank sttringspan x uid x . descr u i d -:x ..uui i dd , , protein protein-: rr .d escr, , f fl la n k .: s ring-sp a n ( x(. s.eseq q, , p .s st tart 00 , p .s star p. art - 5 500, p. t a r tt )) f rom B x .f fe eature , { . po si it tiioon .l th -head fr om D DB x,, x x. ature f f, {f f. pos n.l ii ss tead} } p p,, f .a anno , f .a anno f. n n o ..112 2 ss a a, f. n n o ..112 2 ss r r where . #negat w h e r e not not ( (p p. # n e g a t iive ve) ) a .d descr ike MIM : % " and .a anno_name and and a. escr l li ke " "MIM-%" and a a. nno name = "db x r e f " and r . anno_name and r. anno_name = " "protein_id protein_id""
=

Similarly, Similarly, one one can can extract extract the the first first exons exons of of these these same same genes genes as as follows: follows:
s el le ec ctt se u id x . descr span ui d -:x ..uui i dd , , protein protein-: rr .d e s c r , , exonl e x o n l -:sstringtring-sp an ((x.seq, x . seq , p .s start .e end) p. tart,, p p. nd)

16 0 1 60

6 6

nformatics Data "<'''un as The Kleisli Query .... System as a a Backbone Backbone for for Bioi Bioinformatics Data Integration The
~ ~ ~ ~

f r o m DB DB x x, x. ature f f, {f f. pos n . .ll ii ss t t-head -head} } p p, from , x .f fe eature , { . po si it ti io on , f . anno 12 , f . anno 12 f. a n n o .. 1 2 ss a a, f. a n n o .. 1 2 ss r r w h e r e not not ( (p p. # n e g a t i v e )) where . #negative :% %"" and .a anno_name a .d descr ike and and a. escr l li ke " "MIM MIMand a a. n n o n a m e : "db x r e f " and r. nno name = - " "protein_id p r o t e i n id" and r .a anno_name "
m m

These example queries makes it These two two example queries illustrate illustrate how how a a high-level high-level query query language language makes it possible possible to to extract extract very very specific specific output output in in a a relatively relatively straightforward straightforward manner. manner. The next query The next query illustrates illustrates a a more more ambitious ambitious example example of of an an in in silico silico discovery discovery kit (ISDK). Such a kit prescribes experimental steps carried out in computers kit (ISDK). Such a kit prescribes experimental steps carried out in computers very very much much like like the the experimental experimental protocol protocol carried carried out out in in wet-laboratories wet-laboratories for for a a specific specific scientific investigation. From in silica silico discovery discovery kit kit scientific investigation. From the the perspective perspective of of Kleisli, Kleisli, an an in is sSQL, and is just just a a script script written written in in sSQL, and it it performs performs a a defined defined information information integration integration task task very very similar similar to to an an integrated integrated electronic electronic circuit. circuit. It It takes takes an an input input data data set set and and parameters necessary computational parameters from from the the user, user, executes executes and and integrates integrates the the necessary computational steps steps of of database database queries queries and and applications applications of of analysis analysis programs programs or or algorithms, algorithms, and results for and outputs outputs a a set set of of results for specific specific scientific scientific inquiry. inquiry.
Example Figure 6.3 Example 6.4.4 6.4.4 The The simple simple in in silico silico discovery discovery kit kit illustrated illustrated in in Figure 6.3 demon demonstrates available ontology strates how h o w to to use use an an available ontology data data source source to to get get around around the the problem problem of of inconsistent inconsistent naming naming in in genes genes and and proteins proteins and and to to integrate integrate information information across across 6 With multiple data sources. is implemented implemented in the following following sSQL script. 6 multiple data sources. It It is in the sSQL script. With the the user input input of the ISDK ISDK performs tasks: "First, "First, it user of gene gene name name G, the performs the the following following tasks: it re retrieves a a list of aliases for G from from the nomenclature database by trieves list of aliases for the gene gene nomenclature database provided provided by the Human Human Genome Genome Organization Organization (HUGO). Then it it retrieves retrieves information for the (HUGO). Then information for diseases associated with this OMIM, and and finally it retrieves diseases associated with this particular particular protein protein in in OMIM, finally it retrieves all relevant relevant references references from MEDLINE . " all from MEDLINE." create funct ion info-by-genename c reate f uncti o n getget-i nfo-by-genename
s se el le ec ctt

(G) (G)

a s as

z , omim y , pmidl t : z, omim-: y, p m i d l -aabstrac bstractnum-medl -n entri li is st t-s sum (l le ec t m ml -getn u m - m e d l iine ne-e t r i e ses : 9 l um ( is se el ct lg e t - ccount ountgenera from x.. A Al . s2 n) g e n e r a ll (( nn )) f rom x l iiases ases.s 2 1 l n)

hugo w, h u g o -: w,

6 . s 21 2 1 denotes a function that converts a set into a list. 1 li is t t- sum i s a function to sum a list o fnumbers. 6. is of ml count-general accesses MEDLINE MED LINE and computes the number of MEDLINE MEDLINE The function m l-g ge et t --c o u n t - g e n e r a l accesses ml -a abs reports reports matching a given keyword, whereas m l-g ge et tb s trac t r a c tt-bby-uid y - u i d is a function that accesses MED LINE to retrieve reports identifier, and webomim-get-id webomim-g e t - i d accesses MEDLINE reports given a unique identifier, accesses the OMIM keyword. webomim-get-de webomim-ge t-detail database to obtain unique identifiers of OMIM reports matching a keyword, tai 1 accesses OMIM to to retrieve reports given a unique identifier, identifier. hugo-get-by-symbol hugo -get-by- symbol is a function that accesses that accesses the HUGO database and returns HUGO reports matching a given gene name. is a function that

6.4 6.4

Query ity Query Capabil Capability

1 61 161

f rom from hugo -get -byG) w , hugogetb y - ssyrnbo y m b o ll ( (G) w, webomim-ge t id ( s earcht ime :, O , maxhi ts : O , search f ie lds : { } , webomim-get-id(searchtime'0 maxhits-0, searchfields-{}, searchterms x, s earchterms ::G) G) x , webomim-get x . ui d ) y , w e b o m i m - g e t --detai d e t a i ll ( (x.uid) y, ml -get-abstract w . PMID1 mlg e t - a b s t r a c t --by-ui b y - u i dd ( (w. PMIDI)) z z where where x . title l ike (("%" "%" ^ G in ) ; x.title like G ^ " "%");
A A

For query get -i info-by-genename tran For instance, instance, this this query getn f o - b y - g e n e n a m e can can be be invoked invoked with with the the tranobtain the following result. scription scription factor factor CEBPB as as input input to to obtain the following result.
{ ( #hugo : ( #HGNC : " 1834 " , {(#hugo: (#HGNC: "1834", #Syrnbo l: " CEBPB " , # PMID1 : " 1 5 3 5 3 3 3 " , ... #Symbol"CEBPB", #PMIDI"1535333", #Narne CCAAT/ enhancer binding C / EBP ) , beta ", #Name:: " "CCAAT/enhancer b i n d i n g protein protein ( (C/EBP), beta", #Al iases : { " LAP " , " CRP2 " , " NFIL6 " , " IL 6 DBP " , " TCF5 " } ) , #Aliases : {"LAP", "CRP2", "NFIL6", "IL6DBP", "TCF5" }),
. . .

User Query (Keywords, terms)

Klelsli Query Engine

Output

Integration

6.3 6.3 F I G U RE FIGURE

An An "in silica silico discovery discovery kit" kit" that that uses uses an an available available ontology ontology data data source source to to get get around naming in genes and around the the problem problem of of inconsistent inconsistent naming in genes and proteins, proteins, and and integrates integrates information multiple data information across across multiple data sources. sources.

1 62

\I.0TI>'rTl as a Backbone for The Kleisli Query .... System for Bioinformatics Bioinformatics Data Integration

#omim-: ( (#uid189965, #gene_map_locus- : " "20q13.1", #omim #uid : 1 8 9 9 6 5 , #gene_map_locus 2 0q13 . 1 " , } ,. . . . .. . )), , # a l l e l i c_variants #allelic variants-: { {} #pmid1 - abstrac t : ((#muid#mui d : 1 535333 , #pmidl-abstract1535333, Departement #authors-: " "Szpirer C...", #address-: " "Departement #authors S zpirer C . . . " , #address " de de Biologie Biologie ...", #title"Chromosomal localization man and and rat rat o of # title : " Chromosomal loca l i zat ion in in man f the the genes . .", genes encoding encoding ....", #abs trac t : " By means f soma tic c e l l hybrids egregating #abstract"By means o of somatic cell hybrids s segregating ei ther human . . .", either human...", # j ournal : " Genomics 1 9 9 2 Jun 3 ( 2 ) : 292-300 " ) , #journal"Genomics 1992 Jun;i 1 13(2)-292-300"), #num-medl ine - entries : 1 93 6 ) } #num-medline-entries1936)}

Such queries fulfill of the the requirements requirements for for efficient efficient in silica silico discovery discovery pro proSuch queries fulfill many many of cesses: 1 ) Their modular nature cesses: ( (1) Their modular nature gives gives scientists scientists the the flexibility flexibility to to select select and and combine combine specific research projects; specific queries queries for for specific specific research projects; (2) (2) they they can can be be executed executed automatically automatically by Kleisli in by Kleisli in batch batch mode mode and and can can handle handle large large data data volumes; volumes; (3) (3) their their scripts scripts are are re-usable shared among re-usable to to perform perform repetitive repetitive tasks tasks and and can can be be shared among scientific scientific collabora collaborators; (4) they tors; (4) they form form a a base base set set of of templates templates that that can can be be readily readily modified modified and and refined refined to queries; and to meet meet different different specifications specifications and and to to make make new new queries; and (5) (5) new new databases databases and and new new computational computational tools tools can can be be readily readily incorporated incorporated to to existing existing scripts. scripts. The The flexibility flexibility and and power power shown shown in in these these sSQL sSQL examples examples can can also also be be experi experienced lesser extent enced in in Object-Protocol Object-Protocol Model Model (OPM) (OPM) [22] [22] and and to to a a lesser extent in in Discovery DiscoveryLink planning, a Link [23] [23].. With With good good planning, a specialized specialized data data integration integration system system can can also also achieve achieve great great flexibility flexibility and and power power within within a a more more narrow narrow context. context. For For example, example, the tool of ] is helps a the EnsMart EnsMart tool of EnsEMBL EnsEMBL [21 [21] is a a well-designed well-designed interface interface that that helps a non nonprogrammer build complex equivalent query programmer build complex queries queries in in a a simple simple way. way. In In fact, fact, an an equivalent query to to the first sSQL query in can be the first sSQL query in Example Example 6.4.3 6.4.3 can be also also be be specified specified using using EnsMart EnsMart with with a clicks of a few few clicks of the the mouse. mouse. Nevertheless, Nevertheless, there there are are some some unanticipated unanticipated cases cases that that cannot be expressed expressed in in EnsMart, sSQL query cannot be EnsMart, such such as as the the second second sSQL query in in Example Example 6.4.3. 6.4.3. While While the the syntactic syntactic basis basis for for sSQL sSQL is is SQL, SQL, its its theoretical theoretical inspiration inspiration came came from Buneman, and from a a paper paper by by Tannen, Tannen, Buneman, and Nagri Nagri [24] [24] where where structural structural recursion recursion was language. However, was presented presented as as a a query query language. However, structural structural recursion recursion presents presents two two difficulties. difficulties. The The first first is is that that not not every every syntactically syntactically correct correct structural structural recursion recursion program is logically well well defined program is logically defined [25]. [25]. The The second second is is that that structural structural recursion recursion has has too too much much expressive expressive power power because because it it can can express express queries queries that that require require exponential exponential time space. time and and space. In of data bases, which In the the context context of databases, which are are typically typically very very large, large, programs programs (queries) (queries) are restricted to are usually usually restricted to those those that that are are practical practical in in the the sense sense that that they they are are in in a a low low complexity complexity class class such such as as LOGSPACE, LOGSPACE, PTIME, PTIME, or or TC TC o ~. In In fact, fact, one one may may even even want want nx to prevent any to prevent any query query that that has has greater greater than than O( O(n x log log n) complexity, complexity, unless unless one one is is

Capability 6.5 W a re h o u s i n go~ooa 6.5 ......... C p.,...abi I it y.................................................................................................................................................................................................. 163

1 63

confident confident that that the the query query optimizer optimizer has has a a high high probability probability of of optimizing optimizing the the query query Database query to to no no more more than than O(n O(n x x log log n) n) complexity. complexity. Database query languages languages such such as as SQL, SQL, therefore, therefore, are are designed designed in in such such a a way way that that joins joins are are easily easily recognized recognized because because joins joins 2 are the the only only operations operations in a typical typical database database query query language language that that require require O(n O(n 2) are in a ) complexity complexity if if evaluated evaluated naively. naively. Thus, Tannen and and Buneman a natural on structural reThus, Tannen Buneman suggested suggested a natural restriction restriction on structural re cursion reduce its expressive power guarantee it well defined. Their cursion to to reduce its expressive power and and to to guarantee it is is well defined. Their restriction cuts structural recursion recursion down restriction cuts structural down to to homomorphisms homomorphisms on on the the commuta commutative id of tive idempotent idempotent mono monoid of sets, sets, revealing revealing a a telling telling correspondence correspondence to to monads monads [15]. A A nested nested relational relational calculus, calculus, which which is is denoted denoted here here by by NRC, A/'TEC, was was then then designed designed 1 6] . NRC around around this this restriction restriction [ [16]. HT~C is is essentially essentially the the simply simply typed typed lambda lambda calculus calculus extended extended by by a a construct construct for for building building records, records, a a construct construct for for decomposing decomposing records records by building sets, by field field selection, selection, a a construct construct for for building sets, and and a a construct construct for for decomposing decomposing sets sets by by means means of of the the restriction restriction on on structural structural recursion. recursion. Specifically, Specifically, the the construct construct 2 } , which for for decomposing decomposing sets sets is is U{et [.J{el II x x E ~e e2}, which forms forms a a set set by by taking taking the the big big union union [a/x] over of of et el[o/x] o v e r each each 0 o in in the the set set e2. e2. The The expressive expressive power power of of NRC A/'TEC and and its its extensions extensions are are studied studied in in numerous numerous all studies studies [16, 26-29] 26-29].. Specifically, Specifically, the the NRC A/'TEC core core has has exactly exactly the the same same power power as as all the the standard standard nested nested relational relational calculi calculi and and when when restricted restricted to to flat flat tables tables as as input inputoutput, has exactly calculus. In output, it it has exactly the the same same power power as as the the relational relational calculus. In the the presence presence of of arithmetics arithmetics and and a a summation summation operator, operator, when when restricted restricted to to flat flat tables tables as as input inputoutput, has exactly the power power of SQL. Furthermore, output, it it has exactly the of entry-level entry-level SQL. Furthermore, it it captures captures stan standard dard nested nested relational relational queries queries in in a a high-level high-level manner manner that that is is easy easy for for automated automated optimizer analysis. It optimizer analysis. It is is also also easy easy to to translate translate a a more more user-friendly user-friendly surface surface syntax, syntax, such the SQL such as as the the comprehension comprehension syntax syntax or or the SQL select-from-where select-from-where syntax, syntax, into into this this core recursion and core while while allowing allowing for for full-fledged full-fledged recursion and other other operators operators to to be be imported imported easily as easily as needed needed into into the the system. system.

6.5

i I

WAR E H O U S I N G CAPAB I LITY WAREHOUSING CAPABILITY


Besides Besides the the ability ability to to query, query, assemble, assemble, and and transform transform data data from from remote remote heteroge heterogeneous sources, it also important warehouse the neous sources, it is is also important to to be be able able to to conveniently conveniently warehouse the data data 1 ) It locally. reasons to local warehouses locally. The The reasons to create create local warehouses are are several: several: ((1) It increases increases effi effirisk of ciency; ciency; (2) it it increases increases availability; availability; (3) (3) it it reduces reduces the the risk of unintended unintended denial denial of of service attacks attacks on on the the original original sources; sources; and and (4) (4) it it allows allows more more careful careful data data cleans cleansing on the fly. The ing that that cannot cannot be be done done on the fly. The warehouse warehouse should should be be efficient efficient to to query query and easy to update. Equally important in the biology arena, the warehouse should and easy to update. Equally important in the biology arena, the warehouse should model model the the data data in in a a conceptually conceptually natural natural form. form. Although Although a a relational relational database database system system is is efficient efficient for for querying querying and and easy easy to to update, update, its its native native data data model model of of flat flat

1 64

164

6 6

The The Kleisli Kleisli Query Query System as as a a Backbone Backbone for for Bioinformatics Bioinformatics Data Data

Integration

tables forces forces an an unnatural unnatural and and unnecessary unnecessary fragmentation fragmentation of of data data to to fit fit third third nor nortables mal form. form. mal Kleisli Kleisli does does not not have have its its own own native native database database management management system. system. Instead, Instead, Kleisli has Kleisli has the the ability ability to to turn turn many many kinds kinds of of database database systems systems into into an an updatable updatable store model. In store conforming conforming to to its its complex complex object object data data model. In particular, particular, Kleisli Kleisli can can use use flat flat relational relational database database management management systems systems such such as as Sybase, Sybase, Oracle, Oracle, and and MySQL, MySQL, to to be be its its updatable updatable complex complex object object store. store. It It can can even even use use these these systems systems simultan simultaneously. Kleisli is eously. This This power power of of Kleisli is illustrated illustrated using using the the example example of of GenPept GenPept reports. reports.

Example 6.5. 1 Create 6.5.1 Create a a warehouse warehouse of of GenPept GenPept reports reports and and initialize initialize it it to to re reports protein tyrosine provides several ports on on protein tyrosine phosphatases. phosphatases. Kleisli Kleisli provides several functions functions to to access access sq eqfeat GenPept . One GenPept reports reports remotely remotely from from Entrez Entrez [9] [9]. One of of them them is is aa-ge a a - g e tt -se featgeneral, g e n e r a l , which which retrieves retrieves GenPept GenPept reports reports matching matching a a search search string. string.
! t to to our our Orac le em ! connec connect Oracl e database d a t a b a s e syst syste m oracl e-c cp lo obj ((name. name : " db " , ....); . .); oracle pl b j --add add "db", ! tore GenPept ! create create a a table table to to s store G e n P e p t reports reports create NUMBER " , detai l: " LONG " ) c r e a t e table table genpept g e n p e p t ((uui i dd -: " "NUMBER", detail"LONG") us ing usi n g db db;; !! ini ti ia al iz zee i t with init li it w i t h PTP PTP data data s el le ec ct t ((uidui d : x . ui d , detai l : x) rom se x.uid, detailx) into into genpept genpept f from aa-get -s seqfeat-general (P "T PTP ") x ing aa-geteqfeat-general(" P") x us usi n g db db;; ! the uid el ld d for ! index i n d e x the uid f fi ie for fast fast access access db-mkindex tabl e : " genpept " , index genpeptindex " , db-mkindex ( (table"genpept", index-: " "genpeptindex", schema ui d " ) ; schema.: " "uid"); ! et ' s use t now itle o f report 3 1470 ! l let's use i it n o w to to see see the the t title of report 1 131470 create rom genpept ing c r e a t e view v i e w GenPept GenPept f from g e n p e p t us usi n g db db;; s el le ec ct t x .d detail tii .u ui d = 147 0 ; se x. e t a i l ..t tt ll ee from from GenPept GenPept x x where where x x. id = 13 131470;

In In this this example, example, a a table table genpept g e n p e p t is is created created in in the the local local Oracle Oracle database database system. system. This columns, uid This table table has has two two columns, u i d for for recording recording the the unique unique identifier identifier and and de detai report. A used for t a i l l for for recording recording the the GenPept GenPept report. A LONG LONGdata data type type is is used for the the detai detaill column of recall from Example 6.3.2 that column of this this table. table. However, However, recall from Example that each each GenPept GenPept re report highly nested object. There port is is a a highly nested complex complex object. There is is therefore therefore a a mismatch mismatch between between LONG and the complex structure LONG (which (which is is essentially essentially a a big, big, uninterpreted uninterpreted string) string) and the complex structure of GenPept report. This mismatch system, which which au of a a GenPept report. This mismatch is is resolved resolved by by the the Kleisli Kleisli system, automatically tomatically performs performs the the appropriate appropriate encoding encoding and and decoding. decoding. Thus, Thus, as as far far as as the the Kleisli Kleisli user user is is concerned, concerned, x x.. detai d e t a i l l has has the the type type of of GenPept GenPept report report as as given given in in . 3 . 1 . So Example Example 6 6.3.1. So the the user user can can ask ask for for the the title title of of a a report report as as straightforwardly straightforwardly as as x ti l.e. Note x.. detai d e t a i l ,l . t i tt le Note that that encoding encoding and and decoding decoding are are performed performed to to map map the the

6.6 6.6

Data Sources Sources Data


o . . . . . . ~ ~ ~ , ~ . ~

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 65 165

complex complex object object transparently transparently into into the the space space provided provided in in the the detai de t a i l 1 column; column; that that system does not fragment the complex object to force it into third is, the Kleisli system normal form. There There are are two two possible possible techniques techniques to to use use a a flat flat relational relational database database system system as as a a relanested relational store. This first is to add a layer on top of the underlying flat rela tional database database system system to to perform perform automatic automatic normalization normalization of of nested nested relations relations into into tional the the third third normal normal form. form. This This is is the the approach approach taken taken by by systems systems such such as as OPM OPM [22]. Such an Such an approach approach may may lead lead to to performance performance problems problems as as the the database database system system may may be forced forced to to perform perform many many extra extra joins joins under under certain certain situations. situations. The The second second tech techbe nique is to add add a layer on top of the underlying flat relational database system to perform perform automatic automatic encoding encoding and and decoding decoding of of nested nested components components into into long long strings. strings. technique adopted adopted in Kleisli unnecessary joins and This is the technique Kleisli because it avoids unnecessary because it it is is a a simple simple extension-without extension--without significant significant additional additional overhead-to overhead--to the the because handling of Kleisli's data data exchange format. format. handling

6.6 6 .6

SOURCES DATA S O U RCES


The f the y geneticXchange, The standard standard version version o of the Kleisli Kleisli system system marketed marketed b by geneticXchange, Inc. Inc. sup supports more more than than 60 types types of of data data sources. sources. These These include include the the categories categories below. below. ports Relational database database management All popular popular relational 9 Relational management systems: systems: All relational database database management are supported, DB2, Informix, management systems systems are supported, such such as as Oracle, Oracle, Sybase, Sybase, DB2, Informix, and MySQL. The The support support for for these these systems sophisticated. For and MySQL. systems is is quite quite sophisticated. For ex example, the the previous previous section illustrates how how the the Kleisli turn these ample, section illustrates Kleisli system system can can turn these flat database database systems systems transparently into efficient object stores that flat transparently into efficient complex complex object stores that support both read and write access. access. A also shows Kleisli support both read and write A later later section section also shows that that the the Kleisli system to perform perform significant significant query system is is able able to query optimization optimization involving involving these these systems. systems. Bioinformatics analysis packages: Most pro 9 Bioinformatics analysis packages: Most popular popular packages packages for for analysis analysis of of protein sequences sequences and and other other biological biological data data are are supported. supported. These These packages packages include include tein both Web-based Web-based and/or locally installed installed versions versions of of WU-BLAST WU-BLAST [30], Gapped Gapped both and/or locally BLAST [3 1 ] , FASTA, CLUSTAL W [32], HMMER, BLOCKS, Profile Scan BLAST [31], FASTA, CLUSTAL W HMMER, BLOCKS, Profile Scan (PFSCAN) [33], [33], NNPREDICT, NNPREDICT, PSORT, PSORT, and and many many others. others. (PFSCAN) Biological databases: databases: Many Many popular popular data data sources sources of of biological biological information information 9 Biological are also also supported supported by by the the Kleisli Kleisli system, system, including including AceDB AceDB [18], [ 1 8], Entrez Entrez [9], are LocusLink, UniGene, UniGene, dbSNP, dbSNP, OMIM, OMIM, PDB, SCOP [34], TIGR, TIGR, KEGG, KEGG, and and LocusLink, PDB, SCOP MED LINE. For For each each of of these these sources, Kleisli typically provides many many access MEDLINE. sources, Kleisli typically provides access functions corresponding corresponding to to different different capabilities capabilities of of the the sources. sources. For For example, example, functions

1 66 166

/'OT<> 'rTl as a Backbone for The Kleisli Query nformatics Data Integration Query "" System for Bioi Bioinformatics

Kleisli Kleisli provides provides about about 70 different different but but systematically systematically organized organized functions functions to to access and and extract extract information information from from Entrez. Entrez. access
9

Patent Patent databases: databases: Currently Currently only only access access to to the the United United States States Patent Patent and and Trade Trademark is supported. mark Office Office (USPTO) (USPTO)is supported.

9 Interfaces: Interfaces: The The Kleisli Kleisli system system also also provides provides means means for for parsing parsing input input and and writ writing output in formats. In ing output in HTML HTML and and XML XML formats. In addition, addition, programming programming libraries libraries are are provided provided for for Java Java and and Per! Perl to to interface interface directly directly to to Kleisli Kleisli in in a a fashion fashion sim similar ODBC. A ilar to to JDBC JDBC and and ODBC. A graphical graphical user user interface interface called called Discovery Discovery Builder Builder is is also also available. available. It It is is generally generally easy easy to to develop develop a a wrapper wrapper for for a a new new data data source, source, or or modify modify an an existing Kleisli. There existing one, one, and and insert insert it it into into Kleisli. There is is no no impedance impedance mismatch mismatch between between the model supported supported by model necessary necessary to the data data model by Kleisli Kleisli and and the the data data model to capture capture the the data source. The parser that data source. The wrapper wrapper is is therefore therefore often often a a very very lightweight lightweight parser that simply simply parses records records in in the the data data source source and and prints prints them them out out in in Kleisli's Kleisli's simple simple data data parses exchange exchange format. format.
Example 6.6. 1 Let - de ta i l function 6.6.1 Let us us consider consider the the webomim-get webomim-get-detail function used used in in Example Example 6.4.4. It It uses uses an an OMIM OMIM identifier identifier to to access access the the OMIM OMIM database database and and returns returns a a set set of of objects objects matching matching the the identifier. identifier. The The output output is is of of type: type"
{ ( #u i d : {(#uid# a l t ernat ive_t i t l e s : #alternative_titlesnum , num, # title : #titles t ring , string, { s tring } , {string}, #gene_map_l ocus : #gene_map_locus# a l l e l i c_var i ant s : #allelic_variants{ s tring } , {string}, { s t r ing } ) } {string}) }

Note Note that that this this is is a a nested nested relation: relation: It It consists consists of of a a set set of of records, records, and and each each record record has has three three fields fields that that are are also also of of set set types, types, namely namely #gene_map_locus, #gene_map_locus, l e l i c_variants. This #alternat ive_t i t l es, and #alternative_titles, and #al #allelic_variants. This type type of of output output would definitely definitely present problem if had to based on would present a a problem if it it had to be be sent sent to to a a system system based on the the flat flat relational model, model, as relational as the the information information would would have have to to be be re-arranged re-arranged in in these these three three fields to fields to be be sent sent into into separate separate tables. tables. Fortunately, Fortunately, such such a a nested nested structure structure can can be be mapped mapped directly directly into into Kleisli's Kleisli's ex exchange format. format. The wrapper implementor would only only need parse each change The wrapper implementor would need to to parse each matching matching OMIM record OMIM record and and write write it it out out in in a a format format as as illustrated illustrated in in the the following: following:
{ ( #uid : 1 89965 , {(#uid189965, # ti tle : CCAAT/ ENHANCER-BINDING PROTEI N , BETA ", #title- " "CCAAT/ENHANCER-BINDING PROTEIN, BETA;; CEBPB CEBPB", #gene_map_locus 2 0q13 . 1 " , #gene_map_locus-: " "20q13.1", #alternative_t i tles : { " C / EBP-BETA " , #alternative_titles{"C/EBP-BETA", " INTERLEUKIN 6 - DEPENDENT DNA "INTERLEUKIN 6-DEPENDENT DNAB INDING PROTEIN L 6 DBP " , BINDING PROTEIN;; I IL6DBP", " LIVER ACTIVATOR ", "LIVER ACTIVATOR PROTEIN PROTEIN;; LAP LAP",

6.7

O ptim izations

1 67

" "LIVER-ENRICHED L I V E R - E N R I C H E D TRANSCRIPTIONAL TRANSCRIPTIONAL ACTIVATOR ", A C T I V A T O R PROTEIN PROTEIN", " PTION ; TCF5 "}, "TRANSCRI TRANSCRIP T I O N FACTOR FACTOR 5 5; TCF5"}, {})} #al le el li ic c_variants }) } #all variants-: {
Instead of of needing to create create separate to keep keep the the sets sets nested inside each each Instead needing to separate tables tables to nested inside record, the the wrapper wrapper simply simply prints prints the the appropriate appropriate set set brackets brackets { { and and } ) to to enclose enclose record, these these sets. sets. Kleisli Kleisli will will automatically automatically deal deal with with them them as as they they were were handed handed over over by by the the wrapper. wrapper. This This kind kind of of parsing parsing and and printing printing is is extremely extremely easy easy to to implement. implement. Figure 6.4 shows Figure 6.4 shows the the relevant relevant chunk chunk of of Perl Perl codes codes in in the the OMIM OMIM wrapper wrapper imple imple- detai . menting menting webomim-get w e b o m i m - ge t-de tai l i.

6.7

O PTI M IZATI O N S OPTIMIZATIONS

A makes Oracle Oracle and much more productive to than a A feature feature that that makes and Sybase Sybase much more productive to use use than a raw raw file file system system is is the the availability availability of of a a high-level high-level query query language. language. Such Such a a query query language language allows allows users users to to express express their their needs needs in in a a declarative, declarative, logical logical way. way. All All low-level low-level details details such such as as opening opening files, files, handling handling disk disk blocks, blocks, using using indices, indices, decoding decoding record record and and field field boundaries, hidden away boundaries, and and so so forth forth are are hidden away and and are are automatically automatically taken taken care care of. of. However, pay, one However, there there are are two two prices prices to to pay, one direct direct and and one one indirect. indirect. The The direct direct one one is is that that if if a a high-level high-level query query is is executed executed naively, naively, the the performance performance may may be be poor. poor. The The same same high-level high-level command command often often can can be be executed executed in in several several logically logically equivalent equivalent ways. ways. However, However, which which of of these these ways ways is is more more efficient efficient often often depends depends on on the the state state of data. A good optimizer optimizer can can take of the the data. A good take the the state state of of the the data data into into consideration consideration and and pick the indirect drawback pick the more more efficient efficient way way to to execute execute the the high-level high-level query. query. The The indirect drawback is is that that because because the the query query language language is is at at a a higher higher level, level, certain certain low-level low-level details details of of programming programming are are no no longer longer expressible, expressible, even even if if these these details details are are important important to to achieving achieving better better efficiency. efficiency. However, However, a a user user who who is is less less skilled skilled in in programming programming is is now able to user is now able to use use the the system. system. Such Such a a user is not not expected expected to to produce produce always always efficient efficient programs. optimizer can programs into programs. A A good good optimizer can transform transform inefficient inefficient programs into more more efficient efficient equivalent ones. Thus, Thus, a good optimizer optimizer is ingredient of equivalent ones. a good is a a key key ingredient of a a decent decent database database system system and and of of a a general general data data integration integration system system that that supports supports ad ad hoc hoc queries. queries. The The Kleisli Kleisli system system has has a a fairly fairly advanced advanced query query optimizer. optimizer. The The optimizations optimizations provided by 1 ) monadic provided by this this optimizer optimizer include include ((1) monadic optimizations optimizations that that are are derived derived from equational theory monads, such from the the equational theory of of monads, such as as vertical vertical loop loop fusion; fusion; (2) (2) context contextsensitive sensitive optimizations, optimizations, which which are are those those equations equations that that are are true true only only in in special special contexts on certain contexts and and that that generally generally rely rely on certain long-range long-range relationships relationships between between sub subexpressions, expressions, such such as as the the absorption absorption of of sub-expressions sub-expressions in in the the then-branch then-branch of of an -le an i if f --tthen hen-e sl e s e construct construct that that are are equivaltmt equivalent to to the the condition condition of of the the construct; construct;

# ! /usr/bin/perl #!/usr/bin/perl

# #
#. . . . stuff .... stuff for conne cting to OMIM omitted . . . for connecting to OMIM omitted...

.... <CMD> is the input input stream stream t to be parsed parsed... #. . . ,c CMO> i s the o be . . . # # # default default values values
$section $section = = "none u ; "none"; $state $state

0; $id $id = ""; = 0;

# the the main main program program print print

{ \n" ; " "{\n";

while while

(<CMD>) CMD

chomp chomp;;

if if

(/dispomim.cgi.cmd=entry.*id=([0-9]+)/) l / d i spomim . cg i . cmd=entry . o i d= I [ O - 9 ] + ) / )


$state ; $state = = 1 i; $ id = 1; $id = $ $I; $section $section = = lI t i t le " ; "title";

{
$line $1ine ". = " "";

# to being # look look for for keywords keywords to being parsing parsing sections sections

elsif e lsif

(($state==l) I I $ state = = l )

&& &&

(/a href=\"\" name=\"$id\_(.*?)\,'/)) ( f a hre f . \ " \ " name = \ " $ id\ I . O ? ) \ " / ) ) }

$section 1 $1 " ; $section = = 1 "$i";

$ l ine ::: ; $1ine = $_ $_;

** parse t itle # parse title


e lsif elsif ( ( $section eq title " ) (($section eq " "title")

&& &&

( / <SPAN CLASS= "H3 " > <: f ont j ) ) (/<SPAN CLASS="H3"><font/))

{ {

$ title = $title =

$_; $ t i t l e ==- s / < . O ? >//g; $_; $title s/<.*?>//g;

} { {
" . = " "";

l t ernative t itles # parse parse a alternative titles

elsif e lsif

(($section eq "HIM" "MIM")) { ( $section eq


@alts it @alts ::: = spl split

&& &&

(m-</p></em>(.*)</h4>-)) (m</p></em> ( . * ) </h4 > - ) )


$tmp $tmp;;

$tmp 1: Stmp ", = $ $I;

foreach $x @alts) foreach $x I (@alts)

{ ernativeTi tles .= " { $alt $alternativeTitles "\"$x\", "; } \ " $x\ " , " ; }
s f, s/, $ // ; $//;

/ c. br > / , /<br>/,

$alternativeT itles $alternativeTitles

$alternativeT i t l e s =~ =$alternativeTitles

} }

# parse gene # parse gene map map location location e lsif elsif ( ( $section eq ") (($section eq "TEXT "TEXT")

&& &&

( $ l ine ($1ine

=-

"'Gene map ( . *)/)) { =~ / /^Gene map locus locus * *(.*)/)){

$geneMapLocus 1; $geneMapLocus = = $ $I;

$geneMapLocus $geneMapLocus

1"; } - s /< . * ? >//9 ; $ l ine = =~ s/<.*?>//g; $1ine = 1"";

l e l i c variants # parsing parsing for for Al Allelic variants # l l e l i c varient i l l have t ' s own # each each a allelic varient w will have i it's own section section l l e l i c variants accross sections # need need to to group group the the a allelic variants accross sections e lsif elsif e lsif elsif ( $ sect ion eq C_VARIANTSII ) ($section eq IIALLELI "ALLELIC_VARIANTS") I $ section ($section - / Al l e l i cVariant / ) =~ /AllelicVariant/) s / . \d+ s/.\d+ 0 // ; *//;

{ {

{ Title = " " . } { $variant SvariantTitle "";

$_ l ine ; $ = = $ $1ine; e lsif elsif

s / < . O ? > //g ; s/<.*?>//g;

$variantTi tle $variantTitle

.... \"$_\", "\ "$ \ " ,

( 1 $ state= = l I (($state==l)

&& &&

I $ section eq CREATI ON_DATE " ) ) ($section eq " "CREATION_DATE"))

{ {

. " ";

$state ; $state = = 0 0;

$variant Title $variantTitle

=s f, $ / / ; $variantT i t l e :/ \ " , / \ " \n / g ; =~ s/, $//; $variantTitle =~ s s/\",/k"kn/g;

6.4 6.4

F IGURE FIGURE

rn R

The Per! Perl code code of of the wrapper implementing implementing the the webomim-get-detail webomim-get-detail func function of of Kleisli. It demonstrates demonstrates the ease of of developing wrappers for handling data sources that contain nested objects.

6.7 .... O ptim izations 6.7 O~_pti mizatj,OnL

...........................................................................................................................................................................................................................

169 1 69

((3) 3 ) relational which are relating to relational optimizations, optimizations, which are optimizations optimizations relating to relational relational database database sources sources such such as as the the migration migration of of projections, projections, selections, selections, and and joins joins to to the the external external relational database database management relational management system; system; and and (4) many many other other optimizations optimizations such such as as parallelism, parallelism, code code motion, motion, and and selective selective introduction introduction of of laziness. laziness.

6.7 .1 6.7.1

M on a d ic O pti m izati ons Monadic Optimizations


The f structural The restricted restricted form form o of structural recursion recursion corresponds corresponds to to the the presentation presentation of of monads by monads by Kleisli Kleisli [15, [15, 16] 16] and and is is expressed expressed by by the the combinator combinator U{ U{ {(x) f(x) II x x E ~ R} R} obeying obeying the the following following three three equations: equations:
U xE }} = } U {{f {(x) ( x ) Il x E{ {}} - { {} U {(x) Il x xE A u ) -= ( {(x) Ilx xE ) u {(x) Ilx xE }) U { {f(x) EA U B B) (U U{ {f(x) ~ A} A}) u ((U U{ {f(x) ~B B}) U {(x) Il x xE on = ( ( o) U { {f(x) ~{ {o}}f(o)

This sa t the f the f queries This combinator combinator iis at the heart heart o of the NRC, A/'TEC,the the abstract abstract representation representation o of queries in sSQL. It in the the implementation implementation of of sSQL. It earns earns its its central central position position in in the the Kleisli Kleisli system system because because it it offers offers tremendous tremendous practical practical and and theoretical theoretical convenience. convenience. The The direct direct from R x x ) y. correspondence sSQL is: correspondence in in sSQL is: select select y y from x,, f f ((x) y. This This combinator combinator is is a a key key operator operator in in the the library library of of complex complex object object routines routines in in Kleisli. Kleisli. All All sSQL sSQL queries translated into 1 5, 16]. queries can can be be and and are are first first translated into NRC A/'TCCvia via Wadler's Wadler's identities identities [ [15, 16]. The f(x) II x x E E R} R} combinator combinator is is best best seen seen in in The practical practical convenience convenience of of the the U{ U{ {(x) query query optimizations. optimizations. A A well-known well-known optimization optimization rule rule is is vertical vertical loop loop fusion fusion [35], [35], which which corre corresponds sponds to to the the physical physical notion notion of of getting getting rid rid of of intermediate intermediate data data and and the the logical logical notion elimination. Such notion of of quantifier quantifier elimination. Such an an optimization optimization on on queries queries in in the the compre comprehension syntax , ...., . . , Cn, Gn, x x E E { {e' e ' ] I HI H 1 ,, ... ...,, hension syntax can can be be expressed expressed informally informally as as {e {e II Cl G1, Ix] I[ Cl, . . , Gn, Cn, HI , ... [e' Ix], ... Ix] } . Such Hm}, Hm}, It, J1, ... ...,, Jk} Jk} "-'+ "~ {e[e' {e[e'/x] G1, ...., H1, ...,, Hm, Hm, It Jl[e'/x], ...,, Jk[e' Jk[e'/x]}. Such a a rule rule in in comprehension comprehension form form is is simple simple to to grasp: grasp: The The intermediate intermediate set set built built by by , ...., .. , Hm} the Hm} is is eliminated eliminated in in favor favor of of generating generating the the x x the comprehension comprehension {e' {e' I[ HI H1, on practice, the messy to on the the fly. fly. In In practice, the rule rule is is quite quite messy to implement implement because because the the informal informal " ... " denotes comprehension. An "..." denotes any any number number of of generator-filters generator-filters in in a a comprehension. An immediate immediate implementation implementation would would involve involve a a nasty nasty traversal traversal routine routine to to skip skip over over the the non non, ... applicable Gi to t o locate locate the the applicable applicable x x E ~ {e' {e'[I HI /-/1, ...,, Hm} Hm} and and k Ji. The The effect effect of of applicable Ci the the U{ U{ {(x) f(x) II x xE ~ R} R} combinator combinator on on the the optimization optimization rule rule for for vertical vertical loop loop fusion fusion is {(x) IIx x E y) lI Y is dramatic. dramatic. This This optimization optimization is is now now expressed expressed as as ( {f(x) ~ U{g( U{g(y) y E E R R}} n xE ( y ) } lI Y . The informal and ... " no U { U { {(x) f(x) IIx ~g g(y)} yE ~ R} R}. The informal and troublesome troublesome " "..." no longer longer "-'+ U{U{ appears. Such appears. Such a a rule rule can can be be coded coded straightforwardly straightforwardly in in almost almost any any implementation implementation language. language.

17 0 1 70

................................................................................................................

6 6

".'TO lrYl as nformatics Data The Kleisli Kleisli Query .... System as a a Backbone Backbone for for Bioi Bioinformatics Data Integration The

To To illustrate illustrate this this point point more more concretely, concretely, it it is is necessary necessary to to introduce introduce some some detail detail from from the the implementation implementation of of the the Kleisli Kleisli system. system. Recall Recall from from the the introductory introductory section section that Kleisli is implemented implemented on on top top of of SML. SML. The The type type SYN of of SML SML objects objects that that that Kleisli is represent queries queries in in Kleisli Kleisli is is declared declared as: as: represent
t y p e VAR V A R : int int type t y p e SVR S V R : int in t type t y p e CO CO = = . ... . . type d a t a t y p e SYN SYN = = ... datatype EmptySet I EmptySet I SngSet f SYN SngSet o of SYN I UnionSet f SYN UnionSet o of SYN * * SYN SYN ExtSet o of SYN * * VAR VAR * * SYN SYN I ExtSet f SYN I fThenElse o f SYN * SYN * I f T h e n E l s e of S Y N * S Y N * SYN SYN I f SVR Read o of SVR * * real real * * SYN SYN I Read the s the t h e real real i is the reques r e q u e s tt

((* * variabl V a r i a b l ees s , , represented represented by ) b y int int * *) ((* * Server S e r v e r connect c o n n e c t iions ons,, represented ) r e p r e s e n t e d by b y int int * *) ((* * Representation o f Representation of complex ects ) c o m p l e x obj obje cts * *)

{ } } *) * ((* * { ) * ((* * { E ) { E } *) U E2 E2 *) ((* * E1 El U * ) x < - E2 * ) ((* * U { E1 El I1 \ U{ \x <E2 } } *) ((* * i f El ls se e E3 *) if E1 then t h e n E2 E2 e el E3*) ((* * process ing , process E E us usi ng S S, priority p r i o r i t y assigned a s s i g n e d by by opt imi zer* opti miz er* ) ) x * ) variable V a r i a b l e VAR VAR (* x *) (* - > CO (* C Construct Binary ( CO * * CO B i n a r y (CO CO-> CO)) * * SYN SYN * * SYN S Y N (* o n s t r u c t for for caching stat ic c obj ects s allows imi c aching s tati obje c t s . . Thi This a l l o w s the the opt opti m i zz eer r to i insert to n s e r t some s o m e codes c o d e s for for doing d o i n g dynamic dynamic opt imi * ) opti m i zzat a t iion on *)

All SML SML objects that represent represent optimization optimization rules Kleisli are All objects that rules in in Kleisli are functions functions and and have have type RULE: type RULE:
type R ULE = = S Y N -> SYN o pti n type RULE SYN -> SYN opt io on

If an an optimization optimization rule rule r r can can be be successfully successfully applied applied to to rewrite rewrite an an expression expression e e to to If SOME(e' ) . If If it it cannot cannot be be successfully successfully applied, applied, then then an expression expression e', e', then then r(e) r (e) - S0ME(e'). an r (e) ( e) r ONE. =N NONE. Now the vertical vertical loop loop fusion fusion has has a a very very simple simple implementation. implementation. N o w the
=

Example 6. 6. 7.1 7. 1 Example

Vertical Vertical loop loop fusion. fusion.

fun Vert ( ExtSet ( El , x, x, E ExtSet ( E2 , y , E3) E3 ) ) )) ) f un V e r t ffus u s ii oon n ( E x t S e t (El, x t S e t (E2,y, Ver t _ : = N NONE I ert ffus u s ii oon n _ ONE I V
= S SOME ExtSet : O M E ((E xtSet

( ExtSet El E2 ) ( E x t S e t ((E l ,,xx E2)

, y , E3)) E3 ) ) ,y,

6.7 Optimizations 6.7

171

6.7.2 6.7.2

Co ntext-Sensitive Optimizations O pti m izations Context-Sensitive


The Kleisli Kleisli optimizer optimizer has has a n extensible extensible number number of of phases. phases. Each Each phase phase is is associated associated The an with a a rule rule base base and and a a rule-application rule-application strategy. strategy. A A large large number number of of rule-application rule-application with BottomUpOnce, which which applies applies rules rules to to rewrite rewrite strategies are are supported, supported, such such as as BoeeomUpOnce, strategies an expression expression tree tree from from leaves leaves to to root root in in a a single single pass. pass. By By exploiting exploiting higher-order higher-order an functions, functions, these these rule-application rule-application strategies strategies can can be be decomposed decomposed into into a a traversal component common common to to all all strategies strategies and and a a simple simple control control component component special special for for component each strategy. strategy. In In short, short, higher-order higher-order functions functions can can generate generate these these strategies strategies exex each tremely simply, simply, resulting resulting in in a a small small optimizer optimizer core. core. To give some some ideas ideas on on how how tremely To give this is is done, done, some some SML SML code code fragments fragments from from the the optimizer optimizer module module mentioned mentioned are are this presented on on the the following following pages. pages. presented The traversal traversal component component is higher-order function function shared shared by by all all strategies: strategies: The is a a higher-order
val v a l Decompose Decompose: : ( SYN - > SYN) SYN ) ( SYN->
> S SYN - > SYN -> YN-> SYN
-

Recall expressions. The The Recall that that SYN is is the the type type of of SML SML object object that that represents represents query query expressions. Decompose expression Q. Decompose function function accepts accepts a a rewrite rewrite rule rule r and and a a query query expression Q. Then Then it subtrees of immediate subtrees. it applies applies r to to all all immediate immediate subtrees of Q Q to to rewrite rewrite these these immediate subtrees. Note Note that that it it does does not not touch touch the the root root of of Q Q and and it it does does not not traverse traverse Q-it Q - - i t just just non-recursively non-recursively rewrites rewrites immediate immediate subtrees subtrees using using r r.. It It is, is, therefore, therefore, very very straight straightforward forward and and can can be be expressed expressed as as follows: follows:
fun N) = t ) f u n Decompose Decompose r r ( (SngSet S n g S e t N) = SngSe SngSet ( (rr N N) N (r N , r ) I Decompose Decompose r r ( (UnionSet U n i o n S e t ( (N , ,MM) ) ) ) = UnionSet UnionSet(r N, r M M) , x , r ) N, x I Decompose Decompose r r ( (ExtSet ExtSet ( (N, x,, M) M) ) ) = = ExtSet E x t S e t (( rr N N, x, r M M)
I . . .

A sa A rule-application rule-application strategy strategy S S iis a function function having having the the following following type: type:
val : RULEDB - > SYN val S S: R U L E D B - >- > SYN SYN-> SYN

The The precise precise definition definition of of the the type type RULEDB R U L E D B is is not not important important at at this this point point and and is is deferred deferred until until later. later. Such Such a a function function takes takes in in a a rule rule base base R R and and a a query query expression expression Q Q and and optimizes optimizes it it to to a a new new query query expression expression Q' Q' by by applying applying rules rules in in R R according according S. to the strategy to the strategy S. Assume > RULE Assume that that Pick P i c k :: RULEDB RULEDB -> RULEis is an an SML SML function function that that takes takes a a rule rule base base R R and and a a query query expression expression Q Q and and returns returns NONE NONE if if no no rule rule is is applicable, applicable, and and SOME( Q'} if SOME(Q') if some some rule rule in in R R can can be be applied applied to to rewrite rewrite Q Q to to Q'. Q'. Then Then the the control control components components of of all all the the strategies strategies mentioned mentioned earlier earlier can can be be generated generated easily. easily.

1 72

172

6 6

ntegration The Kleisli Kleisli Query Query System as as a a Backbone Backbone for for Bioinformatics Bioinformatics Data Data IIntegration

Example 7.2 The Bot tomUpOnce strategy Example 6. 6.7.2 The B ott omUpOnce strategy applies applies rules rules in in a a leaves-to-root leaves-to-root pass. pass. It It tries tries to to rewrite rewrite each each node node at at most most once once as as it it moves moves toward toward the the root root of of the the query expression. Here Here is is its its control control component: component: query expression. fun RDB f u n BottomUpOnce BottomUpOnce R D B Qry Qry = = le u n Pas Pass SubQry = = l et t f fun s SubQry le v a l BetterSubQry BetterSubQry = Decompose Decompose P a s s SubQry SubQry l et t val = Pass in c a s e P i c k R D B B e t t e r S u b Q r y in case pick RDB BetterSubQry o f SOME => EvenBetterSubQry of S O M E EvenBetterSubQry EvenBetterSubQry => EvenBetterSubQry II NONE > BetterSubQry end NONE = => BetterSubQry end in Pas Pass Q r y end end in s Qry

The The following following class class of of rules rules requires requires the the use use of of multiple multiple rule-application rule-application strategies. strategies. The scope scope of of rules rules like like the the vertical vertical loop loop fusion fusion in in the the previous previous section section is is over over the the The entire class of of rules rules has parts. The entire query. query. In In contrast, contrast, this this class has two two parts. The inner inner part part is is context sensitive, and scope is certain components of the and its its scope is limited limited to to certain components of the query. query. The The outer outer part part scopes scopes over over the the entire entire query query to to identify identify contexts contexts where where the the inner inner part part can can be applied. The of the be applied. The two two parts parts of the rule rule can can be be applied applied using using completely completely different different strategies. strategies. A of type: A rule rule base base RDB RDB is is represented represented in in the the system system as as an an SML SML record record of type:
type t y p e RULEDB RULEDB = = { { DoTrace f, D o T r a c e : : bool b o o l re ref, Trace > SYN > SYN > uni ef , Trace.: ( (rulename rulename -> SYN -> SYN -> u n i tt )) r ref, Rul es is st t ref Rule s:: ( (rulename rulename * * RULE R U L E )) l li ref }

The The Rules mul e s field field of of RDB RDB stores stores the the list list of of rules rules in in RDB RDB together together with with their their names. names. Trace field of RDB stores a function f that is to be used for tracing the The The T r a c e field of RDB stores a function f that is to be used for tracing the usage usage of the RDB. The of the rules rules in in RDB. The DoTrace m o T r a c e field field of of RDB RDB stores stores a a flag flag to to indicate indicate whether whether tracing done. If If tracing tracing is rule of of name tracing is is to to be be done. is indicated, indicated, then then whenever whenever a a rule name N N in in Q', the RDB is is applied applied successfully successfully to to transform transform a a query query Q Q to to Q', the trace trace function function is is RDB invoked as record a simply means invoked as f f N N Q Q Q' Q' to to record a trace. trace. Normally, Normally, this this simply means a a message message like Q is Q' using like " "Q is rewritten rewritten to to Q' using the the rule rule N" N" is is printed. printed. However, However, the the trace trace function function f allowed to complicated activities. f is is allowed to carry carry out out considerably considerably more more complicated activities. It sophisticated transforma It is is possible possible to to exploit exploit trace trace functions functions to to achieve achieve sophisticated transformations f e el then tions in in a a simple simple way. way. An An example example is is the the rule rule that that rewrites rewrites i if I t h e n ... el e I ... else if el . . else e l s e e3 e 3 to to if el then t h e n ... true t r u e .... e l s e e3. e 3. The The inner inner part part of of this this rule rule rewrites rewrites el to of this rule identifies identifies the of the ez to true. t r u e . The The outer outer part part of this rule the context context and and scope scope of the inner part part of of this this rule: This example inner rule: limited limited to to the the then-branch. then-branch. This example is is very very intuitive intuitive to human being. being. In to a a human In the the then-branch t h e n - b r a n c h of of a a conditional, conditional, all all sub-expressions sub-expressions iden identical of the conditional must tical to to the the test test predicate predicate of the conditional must eventually eventually evaluate evaluate to to true. true.

6.7 6.7

Optimizations Optim izations

1 73

However, such such a a rule rule is is not not so so straightforward straightforward to to express express to to a a machine. machine. The The infor inforHowever, mal ... " are mal " "..." are again again in in the the way. way. Fortunately, Fortunately, rules rules of of this this kind kind are are straightforward straightforward to implement implement in in Kleisli. Kleisli. to

Example 6. 6. 7.3 7.3 The The if-then-else if-then-else absorption absorption rule rule is is expressed expressed by by the the AbsorbThen AbsorbThen rule below. below. The The rule rule has has three three clauses. clauses. The The first first clause clause says says the the rule rule should should not not rule be be applied applied to to an an I ffThenEl T h e n E l ss e e whose whose test test predicate predicate is is already already a a Boolean Boolean constant constant because because it it would would lead lead to to non-termination non-termination otherwise. otherwise. The The second second clause clause says says the the rule should should be be applied applied to to all all other other forms forms of of I fThenElse. f T h e n E l s e . The The third third clause clause says says rule the rule rule is is not not applicable applicable in in any any other other situation. situation. the
f u n AbsorbThen AbsorbThen (I I fThenElse f T h e n E l s e ( B(o ol .. .,._ ) )) NONE ) = NONE fun ( Bool _ ,. _ A b s o r b T h e n ( I f T h e n E l s e ( E 1 , E 2 , E 3 ) ) = AbsorbThen ( I fThenEl s e ( El , E2 , E3 ) ) = I f SyntaxTool s El E l et t f fun Then le un T hen E E = : i if SyntaxTools. E .qEquiv uiv E1 E then then SOME ( (Bool B o o l true t r u e )) e el NONE SOME ls see NONE in ive Then E2 i n case c a s e ContextSens C o n t e x t S e n s i ti it ve T h e n TopDownOnce TopDownOnce E2 of > I fThenEl s( eE(1El , E2 o f SOME S O M E E2 E2'' = :> If ThenElse ,E2 ' , E' 3,)E3 ) > NONE NONE = :> N O N E end end II NONE AbsorbThen : NONE NONE AbsorbThen =
m

The second meat of part of of the The second clause clause is is the the meat of the the implementation. implementation. The The inner inner part the rewrite rewrite l then el el ls see e e3 to if if e el then e3 captured by by i fe if eI t h e n ... ... e I ... ... e 3 to It h e n ... ... true t r u e ... ... else else e 3 is is captured the function Then, which rewrites any e identical to el to true. This function is the function Then, which rewrites any e identical to el to t r u e . This function is strategy within within the then as the rule to to be be applied applied using then supplied supplied as the rule using the the TopDownOnce TopDownOnce strategy the scope of the the then - branch ... e~ el ... . .. using using the the c ContextSens generator scope of then-branch o n t e x t S e n s i t ii vt e ive rule rule generator given as follows. given as follows.
f un C o n t e x t S e n s i ti it ve u l ee S trategy Qry = fun ContextSens ive R Rul Strategy Qry =
= r re f false = ef false

l et t v val le a l Changed Changed


= val v a l RDB RDB =

( * T Thi s f fl la ag i s set f (* his g is set i if

Set up a c context (* S (* et u p a ontext-


=> = > fn fn

) Rul s app * R u l ee i is a p p ll ii eed d *)

D oTrace = r ef t r u e ,, ref true DoTrace T race = ef = r ref Trace ( fn ( fn

sens rul base * ) s e n s ii tt i ive ve r u l ee b a s e *)

=> fn _ _ => => C Changed : = t true => fn hanged -= r u e ))

(* C Changed is t true (* hanged is rue

val Opt zd edQry = S St trategy RDB Qry v al O p t iimi mize Qry = rategy R DB Q ry

Rul = r ref R u l ee ss = ef

[ ( ( .... " ", , R Rul ]} } [ u l ee )) ]

if R Rule is u used *) if u l e is s e d *)

(* A Apply Rule us ing (* pply R ule u si ng

i n if in if

! Changed t hen S OME O ptimize Qry e lse N ONE ! Changed then SOME Optimi zd edQry else NONE

St trategy S rategy. .

end end

* ) *)

174

The Kleisli Query System as a a Backbone for for Bioinformatics Bioinformatics Data Data Integration

This many other This ContextSens C o n t e x t S e n s ii ttive i v e rule rule generator generator is is re-used re-used in in many other context contextsensitive optimization rules, such such as external sensitive optimization rules, as the the rule rule for for migrating migrating projections projections to to external relational relational database database systems systems to to be be presented presented shortly. shortly.

6.7.3 6.7.3

R e l atio n a l O pti m izations Relational Optimizations


Relational Relational database database systems systems are are the the most most powerful powerful data data sources sources to to which which Kleisli Kleisli interfaces. interfaces. These These database database systems systems are are equipped equipped with with the the ability ability to to perform perform sophis sophisticated ticated transformations transformations expressed expressed in in SQL. SQL. A A good good optimizer optimizer should should aim aim to to migrate migrate as operations in to these as many many operations in Kleisli Kleisli to these systems systems as as possible. possible. There There are are four four main main op optimizations useful in selections, timizations that that are are useful in this this context: context: the the migration migration of of projections, projections, selections, and joins joins on database; and oins across and on a a single single database; and the the migration migration of of jjoins across two two databases. databases. The Kleisli Kleisli optimizer has four The optimizer has four different different rules rules to to exploit exploit these these four four opportunities. opportunities. A rule for name A special special case case of of the the rule for migrating migrating P P is is to to rewrite rewrite select select x x ..n ame from from " us ing A) x .n name from ( (process process " " select select * * f rom T T" usi n g A) x to to select select x x. ame from elect " us ing A) x, from ( (process process " "s se l e c t name n a m e from from T T" usi n g A) x, where where process process Q A Q us u s i ing ng A denotes denotes sending sending an an SQL SQL query query Q Q to tO a a relational relational database database A. A. In In the the original original query, query, the the entire entire table table T T has has to to be be retrieved. retrieved. In In the the rewritten rewritten query, query, only only one one column column of of that that table table has has to to be be retrieved. retrieved. More More generally, generally, if if x x is is from from a a rela relational tional database database system system and and every every use use of of x x is is in in the the context context of of a a field field projection projection x , these unused x.. 1 1, these projections projections can can be be pushed pushed to to the the relational relational database database so so that that unused fields fields are are not not retrieved retrieved and and transferred. transferred.
Example 7.4 The Example 6. 6.7.4 The rule rule for for migrating migrating projections projections to to a a relational relational database database is is implemented implemented by by MigrateProj l v i i g r a e e p r o j in in this this example. example. The The rule rule requires requires a a function function Ful that traverses expression N to is F u l llyProj y p r o j ected eceed x x N that traverses an an expression to determine determine whether whether x x is always used used within projection and always within N in in the the context context of of a a field field projection and to to determine determine what what fields projected; it fields are are being being projected; it returns returns NONE NONE if if x x is is not not always always used used in in such such a a context; context; L, where otherwise, returns SOME otherwise, it it returns SOME L, where the the list list L L contains contains all all the the fields fields being being projected. projected. This ti ve rule This function function is is implemented implemented in in a a simple simple way way using using the the ContextSens C o n t e x t S e n s ii tive rule generator generator from from Example Example 6.7.3. fun lyProj ected x f u n Ful Full yProjec ted x N N l et t val re f 0 , ref l ) le val ( (Count C o u n t , , Proj P r o j ss )) = : ( (ref 0, ref [ []) fun if x then f u n FindProj FindProjs s ( (Variable Variable y y)) = ( (if x = y y t h e n inc inc Count ls se e (() ); Count e el ; NONE N O N E )) L , Vari y )) = Proj ( I FindProj mindProjs s ( (mroj (L, V a r i aable ble y)) = . (( ! ((if if x ! Proj ); x = = y y then t h e n Proj P r o j ss . .: L L .-P r o j ss ) ) else else ( (); NONE NONE ) ) II FindProj F i n d m r o j s s _ = NONE NONE
= = _ =

6.8 6.8

User Interfaces I nterfaces User

1 5 17 75

in C ContextSens Fi indProj BottomUpOnce N; in o n t e x t S e n s ii tt i ive ve F n d P r o j ss B o t t o m U p O n c e N;

if n g t h (!Projs) !Count h e n SOME if l le ength ( ! Proj s ) = = ! Count t then SOME (!Projs) ( ! Proj s ) else else N ONE NONE

end e nd

The M MigrateProj rule is is defined defined below. below. The The function function SQL. SQL . P PushProj is one one of of The i g r a t e P r o j rule u s h P r o j is the many many support support routines routines available available in in the the current current release release of of Kleisli Kleisli that that handle handle the abstract syntax objects. manipulation of of SQL SQL queries and other manipulation queries and other SYN abstract syntax objects.
fun M MigrateProj fun igrateProj ( ExtSet (N, ( N , x, x, R Read ( S , p, p, S String M) ) ) = = (ExtSet e a d (S, t r i n g M))) if A Annotations ls sS SQL S ( * test test if if S S c connec tss to to a a if nnotations. . I QL S (* onnect SQL r v e r *) SQL s se erver * ) then case lyProj ( * test test i f x x is is a lways t hen c a s e Ful Full y P r o j eec c tted ed x x N N (* if al ways in a proj proje c t ii oon n *) in a ect * )

o f SOME SOME Projs Proj s = > SOME SOME (ExtSet ( ExtSet ( N, x S , p, p, of => (N, x,, Read Read ( (S, String .P PushProj s M )))) S tring ( (SQL SQL. u s h P r o j Proj Projs M)))) NONE = > NONE II N O N E => N O N E e ls se e NONE el NONE MigrateProj I M i g r a t e P r o j _ : NONE NONE
=

Besides the the four four migration migration rules rules mentioned mentioned previously, Kleisli has has various various other Besides previously, Kleisli other rules, including reordering joins on two relational data bases, parallelizing queries, rules, including reordering joins on two relational databases, parallelizing queries, and and large-scale large-scale code code motion, motion, the the description description of of which which is is omitted omitted in in the the chapter chapter due due to to space space constraints. constraints.

6.8 6.8

U S E R IINTERFACES NTE R FACES USER


Kleisli Kleisli is is equipped equipped with with application application programming programming interfaces interfaces for for use use with with Java Java and and Per!' Perl. It It also also has has a a graphical graphical interface interface for for non-programmers. non-programmers. These These interfaces interfaces are are described described in in this this section. section.

6.8. 1 6.8.1

Prog ra m m i ng La n g u ag e IInterface nte rface Programming Language


The The high-level high-level query query language, language, sSQL, sSQL, of of the the Kleisli Kleisli system system was was designed designed to to express express traditional traditional (nested (nested relational) relational) database-style database-style queries. queries. Not Not every every query query in in bioinfor bioinformatics base-style queries, matics falls falls into into this this class. class. For For these these non-data non-database-style queries, some some other other pro programming gramming languages languages can can be be a a more more convenient convenient or or more more efficient efficient means means of of imple implementation. 1 9] of mentation. The The Pizzkell Pizzkell suite suite [[19] of interfaces interfaces to to the the Kleisli Kleisli exchange exchange format format was was developed developed for for various various popular popular programming programming languages. languages. Each Each of of these these interfaces interfaces in in the the Pizzkell Pizzkell suite suite is is a a library library package packag e for for parsing parsing data data in in Kleisli's Kleisli's exchange exchange format format

1 76

The Kleisli Query System as a Backbone for Bioinformatics Data Integration

into also serves serves into an an internal internal object object of of the the corresponding corresponding programming programming language. language. It It also as means for as a a means for embedding embedding the the Kleisli Kleisli system system into into that that programming programming language language so so that the the full full power power of of Kleisli Kleisli is is made made available available within within that that programming programming language. language. that The The Pizzkell Pizzkell suite suite currently currently includes includes CPL2Ped CPL2Perl and and CPL2Java, CPL2Java, for for Ped Perl and and Java. Java. In contrast contrast to to sSQL sSQL in in Kleisli, Kleisli, which which is is a a high-level high-level interface interface that that comes comes with with a a In sophisticated optimizer and has a sophisticated optimizer and other other database-style database-style features, features, CPL2Ped CPL2Perl has a different different purpose and is Whereas sSQL aimed at extraction, integration, purpose and is at at a a lower lower level. level. Whereas sSQL is is aimed at extraction, integration, and analysis, CPL2Ped and preparation preparation of of data data for for analysis, CPL2Perl is is intended intended to to be be used used for for im implementing plementing analysis analysis and and textual textual formatting formatting of of the the prepared prepared data data in in Perl. Perl. Thus, Thus, CPL2Ped CPL2Perl is is a a Ped Perl module module for for parsing parsing data data conforming conforming to to the the data data exchange exchange format of of Kleisli into native native Ped Perl objects. objects. format Kleisli into The The main main functions functions in in CPL2Ped CPL2Perl are are divided divided into into three three packages: packages:
1 Kleisli exchange 1.. The The RECORD RECORD package package simulates simulates the the record record data data type type of of the the Kleisli exchange

format format by by using using a a reference reference of of Ped's Perl's hash. hash. Some Some functions functions are are defined defined in in this this package: package: 9 New is is the the constructor constructor of of a a record. record. For For example, example, to to create create a a record record such " , #deser 00 09900 such as as ( (#anno_name # a n n o _ n a m e : 9" "db_xref db_xref", #descr- : " "taxon taxon- : 1 10 "" ) ) in a a Ped Perl program, program, one one writes: writes: in
$ree >new (( " $rec = = RECORDRECORD->n ew "anno_name a n n o _ n a m e " ," , " ", " ,t "a taxon :0 19 000"9 "db_xref db_xref", "deser d e s c r "" ," xon-10 )0 " ) ; ;

where where $ree $ r e c becomes becomes the the reference reference of of this this record record in in the the Ped Perl program. program. 9 Proj Pro j eet e c t gets gets the the value value of of a a specified specified field field in in a a record. record. For For example, example,
$ree>Proj $rec-> P r o j eeet c t (( " "deser d e s c r "" )) ; 9
will return return the the value value of of the the field field #deser # d e s c r in in the the record record referenced referenced by by will $ree S r e c in in the the Ped Perl program. program.

package simulates simulates the list, set, 2. The The LIST LIST package the list, set, and and bag bag data data type type in in the the Kleisli Kleisli data data exchange format. format. These three bulk bulk data exchange These three data types types are are to to be be converted converted as as a a reference reference of list. Its main function function is: of Ped's Perl's list. Its main is" constructor of 9 new new is is the the constructor of bulk bulk data data such such as as a a list, list, a a bag, bag, or or a a set. set. It It works initialization in works the the same same way way as as a a list list initialization in Ped: Perl"
$ 1 = ISST(( " erry $i = L LI T - > >new new " tom t o m "" ,, " "j je rry" " ) ); ;
3. The 3. The CPLIO Cvuo package package provides provides the the interface interface to to read read data data directly directly from from a a Kleisli Kleisliformatted formatted data data file file or or pipe. pipe. It It supports supports both both eager eager and and lazy lazy access access methods. methods. Some functions in in this Some functions this package package are: are:

where 1 will where $ $1 will be be the the reference reference of of this this list list in in the the Ped Perl program. program.

~ . . . . . ~ o O ~ o ~ o ~ ~ o ~ ~ ~ o ~ o ~ ~ ~ = ~ . ~ ~ ~ ~ ~ ~ = ~ = o ~ : ~ - ~ ~ ~ - - ~

nterfaces 6.8 User User IInterfaces

177 1 77

.. 9 Openl Openl opens opens the the specified specified Kleisli-formatted Kleisli-formatted data data file file and and returns returns the handle of Per!' It the handle of this this file file in in Perl. It supports supports all all the the input-related input-related features features open including the of usual o of the the usual p e n operation operation of of Per!, Perl, including the use use of of pipes. pipes. For For example, example, the the following following expression expression opens opens a a Kleisli-formatted Kleisli-formatted file file " sequences . val : "s e q u e n c e s. va I " ": $hd val $hd = = CPLIO->Openl C P L I O - > O p e n l (( " " sequences sequences . v a l " " )) i ; .. another version version of 9 Openla O p e n l a is is another of the the Openl O p e n l function, function, which which can can take take a a string as as input input stream. The first first parameter the child child process process string stream. The parameter specifies specifies the to to execute execute and and the the second second parameter parameter is is the the input input string. string. An An example example that calls Kleisli from CPL2Per! to extract accession numbers that calls Kleisli from CPL2Perl to extract accession numbers from from a a sequence sequence file file is is expressed expressed as as follows: follows: $ cmd = $cmd = qq qq{{ create rom sequences ing c r e a t e view view X X f from s e q u e n c e s us usi n g stdin stdin;i s el le ec ctt x .a acces i } i se x. c c e s ss ii oon n from from X X X x; }; $a= (( " . / ssql " ,, $ cmd ) i $a= CPLIO->Openla CPLIO->Openla "./ssql" $cmd);
9 Open2 differs from from Openla O p e n l a in in that that it it allows a program to com comallows a program to .. Open2 differs municate in both directions directions with municate in both with Kleisli Kleisli or or other other systems. systems. It It is is pa parameterized the Kleisli rameterized by by the Kleisli or or other other systems systems to to call. call. It It returns returns a a list list consisting reference of consisting of of a a reference of CPLIO CvLIO object object and and an an input input stream stream that that the example: the requests requests can can be be sent sent into. into. For For example:

( $a , $ b) = (. "/ .s /s ssql ($a, Sb) = CLPIO->Open2 CLPIO->Open2(" q l " )" ) i ; print cmd1 i f lush bi $ res = $a>P Pars p r i n t $b $b $ $cmdl; fl ush $ $b; $res $a-> a r s ee ;i
9 o o

print $b $ cmd2 i f lush bi $ res = -> >P Parse p r i n t $b $cmd2; fl ush $ $b; $res = $ $a aarse;i .. reads all all the opened file until 9 Parse P a r s e is is a a function function that that reads the data data from from an an opened file until i ) is found. The it it can can assemble assemble a a complete complete object object or or a a semicolon semicolon ((;) is found. The return value will will be parsed object object in Per!' For return value be the the reference reference of of the the parsed in Perl. For ess example, example, printing printing the the values values of of the the field field #acc #acce ss i i on on of of an an opened opened file file may may be be expressed expressed by: by: $ set = >Parse $set = $hd$hd-> Parse; i foreach @ { $ s e t } )) { f o r e a c h $rec $rec ( (@{$set} { $n = -> >P Proj " accession " ) i $n = $ $ rec recr o j eect ct ( ("accession") ; print $n \ n " i } print " "$n\n"; } .. reads data opened file. 9 LazyRead L a z y R e a d is is a a function function that that reads data lazily lazily from from an an opened file. This This function function is is used used when when the the data data type type in in the the opened opened file file is is a a set, set, a a bag, or list. This reads one memory at bag, or a a list. This function function only only reads one element element into into memory at a a

178 1 78 ~

6 6

The Kleisli Kleisli The

nformatics Data Query System as as a a Backbone Backbone for for Bioi Bioinformatics Data Integration

~ ~ : ~ ~ : ~ ~ ~ ~ :

time. Thus, Thus, if iftheopened fileisavery bigset, LazyReadis stth e time. the opened file is a very big set, LazyRead isj u just the right the records and print accession numbers: right function ~ n c t i o n tto o access accesstherecordsand printaccession numbers:
whil { whi le e ((i) 1) { $ rec = >LazyRead f (($rec $rec eq ") ; $rec = $hd$hd-> L a z y R e a d ; ; last last i if eq " .... ); $n = = $rec $recroje c t ( ("" aac c cc ee ss si ") $n -> >m Proj ect so in on "; ); print " "$n\n"; } print $n \ n " ; }

The CPL2Ped for Graphviz system The use use of of CPL2Perl for interfacing interfacing Kleisli Kleisli to to the the Graphviz system is is demon demonstrated strated in in the the context context of of the the Protein Protein Interaction Interaction Extraction Extraction System System described described in in a a paper by layout of paper by Wong Wong [36]. [36]. Graphviz Graphviz [37] [37] is is a a system system for for automatic automatic layout of directed directed graphs. graphs. It It accepts accepts a a general general directed directed graph graph specification, specification, which which is is in in essence essence a a list list of of arcs arcs of of the the form form x x -+ y, y, which which specifies specifies an an arc arc is is to to be be drawn drawn from from the the node node x x to node y. to the the node y.

Example 1 Assume Assume that S PEC of Example 6.8. 6.8.1 that Kleisli Kleisli produces produces a a file file $ $SPEC of type type { ( #actor #actor: : s tring , # interaction ent : string ) } which which describes a string, #i n t e r a c t i o n : : string string,, #pati #patient: string)} describesa protein interaction interaction pathway. records express express that protein pathway. The The records that an an actor inhibits inhibits or or activates activates a relevant parts a patient. The The relevant parts of of a a Ped Perl implementation implementation of of the the module module MkGi f f that that accepts this file, file, converts converts it directed graph invokes Graphviz accepts this it into into a a directed graph specification, specification, invokes Graphviz is expressed follows: to layout, layout, and and draws it as as a GIF file file $GIF to draws it a GIF $GIF using using CPL2Ped CPL2Perl is expressed as as follows: use cp12per use c p l i p e r ll ;; CPLIO>Open1 ( " $SPEC " ) ; $a = $a = C PLIO-> Openl ("$SPEC"); " I ./dot . / dot -Tgif - Tg i f > > $GIF ") ; open ( DOT , "I o p e n (DOT, $GIF"); print " digraph \n " ; p r i n t DOT DOT " d i g r a p h aGraph aGraph { {in"; whi w hil (i) { le e ( 1) { $a>LazyRead last i f ( $ rec eq " " ); ); $ rec = $ $rec a-> L a z y R e a d ; ; last if ($rec eq .... " ac tor " ) ; $ s tart : $rec->Proj $start $ r e c - > m r o j eect ct ( ("actor"); $r re c-> >Proj ( " pa t i ent " ) ; S end = Send = $ ec P r o j eect c t ("patient"); = $ $r re c-> >Proj (" " interac t "; ); $ type = $type ec P r o j eect ct ( interact ii oon n") if i f ($type ( $ type eq eq "inhibit") " inhibi t " ) { { $e edgecolor " red " ; } } $ d g e c o l o r = "red"; else else { { $e edgecolor = " green " ; } } $ dgecolor : "green"; S edge = = " [ color = = $ $e edgecolor "; Sedge "[color d g e c o l o r ]J "; print DOT " " $ $s st tart S end S $e edge p r i n t DOT a r t -->> Send d g e ;;I\ nn "" ;; } p r i n t DOT print DOT "};\n"; " } ; \n " ; $a->Close $ a->Close;; c lose (DOT); ( DOT ) ; close
= =

6.9 DOther 6.o9 Other a t a Data Integration Technologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

179 1 79

The S PEC and The first first three three lines lines establish establish the the connections connections to to the the Kleisli Kleisli file file $ $SPEC and to to ect the the Graphviz Graphviz program program dot. dot. The The next next few few lines lines use use the the LazyRead LazyRead and and Proj Proje ct functions functions of of CPL2Perl CPL2Perl to to extract extract each each interaction interaction record record from from the the file file and and to to format format it for for Graphviz to process. process. Upon the layout computation, Graphviz Graphviz draws draws it Graphviz to Upon finishing finishing the layout computation, the interaction pathway pathway into the interaction into the the file file $GIF. This This example, example, though though short, short, demonstrates demonstrates how how CPL2Perl CPL2Perl smoothly smoothly integrates integrates the Kleisli Kleisli exchange exchange format format into into Perl. This greatly greatly facilitates facilitates both both the the development development the Perl. This of of data data drivers drivers for for Kleisli Kleisli and and the the development development of of downstream downstream processing processing (such (such as pretty pretty printing) printing) of of results results produced produced by by Kleisli. Kleisli. as

6.8.2 6.8.2

G ra p h ica l IInterface nterface Graphical


The The Discovery Discovery Builder Builder is is a a graphical graphical interface interface to to the the Kleisli Kleisli system system designed designed for for non-programmers Inc. This non-programmers by by geneticXchange, geneticXchange, Inc. This graphical graphical interface interface facilitates facilitates the the visualization required to visualization of of the the source source data data as as required to formulate formulate the the queries queries and and generates generates the necessary sSQL sSQL codes. allows users the necessary codes. It It allows users to to see see all all available available data data sources sources and and their their associated meta-data and assists them associated meta-data and assists them in in navigating navigating and and specifying specifying their their query query on on these these sources sources with with the the following following key key functions: functions: 9 A A graphical graphical interface interface that that can can see see all all the the relevant relevant biological biological data data sources, sources, including including meta-data-tables, meta-data--tables, columns, columns, descriptions, descriptions, etc.-and etc.--and then then construct construct a local a query query as as if if the the data data were were local public or sources, typically typically within 9 Add Add new new wrappers wrappers for for any any public or proprietary proprietary data data sources, within hours, and hoc queries queries that ad hoc that can can hours, and then then have have them them enjoined enjoined in in any any series series of of ad be be created created join many 9 Execute Execute the the queries, queries, which which may may join many data data sources sources that that can can be be scattered scattered all over all over the the globe, globe, and and get get fresh fresh result result data data quickly quickly The The Discovery Discovery Builder Builder interface interface is is presented presented in in Figure Figure 6.5. 6.5.

6.9 6.9

OTH E R DATA NTEG RATI O N TECH N O LOG I E S OTHER DATA IINTEGRATION TECHNOLOGIES
The brief description several other inte The brief description of of several other approaches approaches to to bioinformatics bioinformatics data data integration problems alternatives include gration problems emphasizes emphasizes Kleisli's Kleisli's characteristics. characteristics. The The alternatives include Se Sequence Retrieval Retrieval System OPM [22] quence System (SRS) (SRS) [38], [38], DiscoveryLink DiscoveryLink [23] [23],, and and OPM [22]..

6.9. 1 6.9.1

S RS SRS
SRS also presented n Chapter ) is SRS [38] [38] ((also presented iin Chapter 5 5) is marketed marketed by by LION LION Bioscience Bioscience and and is is arguably arguably the the most most widely widely used used database database query query and and navigation navigation system system for for the the life life

180 1 80

6 6

The Kleisli Kleisli Query Query System System as as a a Backbone Backbone for for Bioi Bioinformatics Data IIntegration The nformatics Data ntegration

m. __ __ __ __ __ == == =
- -. C! , -o -0" O T_ o o o 0- 0 - - 0 - -_ 0 - __ 0 - -_ 0 - - 0 - -_ 0 - - 0 - -

=== == .

, FlEsu'TS . () - ()

' ' ' ' ' ' ' ' '

- Il
()

1Id ....-n , - . .. . .... ..... ..., 0 ""'152 () ..... - () . ....-. . _ - - () --- -"" . _ -. - ..... ..., 0 "'" 152

-,---

- () .

11 - ()
- ()

pm.cs ....... _ 1IoIP

..,..,... _ 3_

()

- cos ..., 0

6.5 6.5 FIGURE F I G U RE

The interface to to Kleisli. The Discovery Discovery Builder Builder graphical graphical interface Kleisli.

science community. community. It It provides provides easy-to-use user interface access to to a science easy-to-use graphical graphical user interface access a broad range range of of scientific scientific databases, databases, including including biological biological sequences, sequences, metabolic metabolic pathpath broad ways, and and literature literature abstracts. abstracts. SRS SRS provides provides some some functionalities functionalities to to search search across across ways, public, source into into SRS, public, in-house in-house and and in-licensed in-licensed databases. databases. To To add add a a new new data data source SRS, the data source is is generally generally required required to to be be available available as as a a flat flat file, file, and and a a description description the data source of of the the schema schema or or structure structure of of the the data data source source must must be be available available as as an an Icarus Icarus script, script, which is is the the special special built-in built-in wrapper wrapper programming programming language language of of SRS. SRS. The The notable notable which exception to to this this flat flat file file requirement requirement on on the the data data source source is is when when the the data data source source exception is a a relational relational database. database. SRS SRS then then indexes indexes this this data data source source on on various various fields fields parsed parsed is and described described by by the the Icarus Icarus script. script. A A biologist biologist then then accesses accesses the the data data by by supplying supplying and some keywords keywords and and constraints constraints on on them them in in the the SRS SRS query query language, language, and and all all records records some matching those those keywords keywords and and constraints constraints are are returned. returned. The The SRS SRS query query language language matching is primarily primarily a a navigational navigational language. language. This This query query language language has has limited limited data data joining joining is capabilities based based on on indexed indexed fields fields and and has has limited limited data data restructuring restructuring capabilicapabili capabilities ties. The The results results are are returned returned as as a a simple simple aggregation of records records that that match match the the ties. aggregation of

6.9 6.9

Other Other Data Data Integration ...... ......

Technologies

......

.....

1 8 18 11

search n short, n terms s essentially n in search constraints. constraints. I In short, iin terms of of querying querying power, power, SRS SRS iis essentially a an information retrieval retrieval system. system. It It brings brings back back records records matching matching specified specified keywords keywords formation and constraints. These embedded links and constraints. These records records can can contain contain embedded links a a user user can can follow follow in individually dividually to to obtain obtain deeper deeper information. information. However, However, it it does does not not offer offer much much help help in in organizing organizing or or transforming transforming the the retrieved retrieved results results in in a a way way that that might might be be needed needed for for setting up also a setting up an an analytical analytical pipeline. pipeline. There There is is also a Web Web browser-based browser-based interface interface for for formulating SRS queries and viewing viewing results. this interface SRS is formulating SRS queries and results. In In fact, fact, this interface of of SRS is often often used by access multiple used by biologists biologists as as a a unified unified front front end end to to access multiple data data sources sources indepen independently, dently, rather rather than than learning learning the the idiosyncrasies idiosyncrasies of of the the original original search search interfaces interfaces of of these these data data sources. sources. For For this this reason, reason, SRS SRS is is sometimes sometimes considered considered to to serve serve "more "more of of a [39]. a user user interface interface integration integration role role rather rather than than as as a a true true data data integration integration tool" too1"[39]. In In summary, summary, SRS SRS has has two two main main strengths. strengths. First, First, because because of of the the simplicity simplicity of of flat flat file file indexing, indexing, adding adding new new data data sources sources into into the the system system with with the the !carus Icarus script scripting language is ing language is easy. easy. In In fact, fact, several several hundred hundred data data sources sources have have been been incorporated incorporated into SRS to date. date. Second, Second, it it has has a a nice nice user user interface interface that that greatly greatly simplifies query into SRS to simplifies query formulation, usable by formulation, making making the the system system usable by a a biologist biologist without without the the assistance assistance of of a a programmer. programmer. In In addition, addition, SRS SRS has has an an extension extension known known as as Prisma Prisma designed designed for for automating SRS warehouse. automating the the process process of of maintaining maintaining an an SRS warehouse. Prisma Prisma integrates integrates the the tasks of monitoring remote data data sources sources for for new new data data sets sets and and down downloading and tasks of monitoring remote loading and indexing such data hand, SRS also has indexing such data sets. sets. On On the the other other hand, SRS also has some some weaknesses. weaknesses. First, First, it basically a it is is basically a retrieval retrieval system system that that returns returns entries entries in in a a simple simple aggregation. aggregation. To To perform further further operations operations or or transformations transformations on the results, results, a a biologist biologist has has to to perform on the do do that that by by hand hand or or write write a a separate separate post-processing post-processing program program using using some some exter external scripting language language like C or or Perl, is cumbersome. cumbersome. Second, Second, its its princi princinal scripting like C Perl, which which is pally flat-file pally flat-file based based indexing indexing mechanism mechanism rules rules out out the the use use of of certain certain remote remote data data sources-in sources~in particular, particular, those those that that are are not not relational relational databases-and databases~and does does not not pro provide for for straightforward straightforward integration integration with with dynamic dynamic analysis However, this this vide analysis tools. tools. However, latter latter shortcoming shortcoming is is mitigated mitigated by by the the Scout Scout suite suite of of applications applications marketed marketed by by LION LION Bioscience Bioscience that that are are specifically specifically designed designed to to interact interact with with SRS. SRS.

6.9.2 6.9.2

Discove nk D i s c o v e r y L i n ryLi k
DiscoveryLink 1 ) is DiscoveryLink [23] [23] (also (also presented presented in in Chapter Chapter 1 11) is an an IBM IBM product product and, and, in in princi principle, biomedical ple, it it goes goes one one step step beyond beyond SRS SRS as as a a general general data data integration integration system system for for biomedical data. data. The The first first thing thing that that stands stands out-when out--when DiscoveryLink DiscoveryLink is is compared compared to to SRS SRS and and more more specialized specialized integration integration solutions solutions like like EnsEMBL EnsEMBL and and GenoMax-is GenoMax~is the the pres presence ence of of an an explicit explicit data data model. model. This This data data model model dictates dictates the the way way DiscoveryLink DiscoveryLink users users view view the the underlying underlying data, data, the the way way they they view view results, results, as as well well as as the the way way they they query 1 7] . The query the the data. data. The The data data model model is is the the relational relational data data model model [ [17]. The relational relational data data model model is is the the de de facto data data model model of of most most commercial commercial database database management management

1 82

182

. . . . . . . . . . . . . . . . . . ~ 6.........The Kleisli

as a Backbone for Bioi nformatics Data mho~.2K/e!so!Leuooery.Systeom o, a sa,,..oBacokboooneofoorBio!noformatic=~s Data I nteg ration

"'\/" T<>lrTl

systems, systems, including including the the IBM's IBM's DB2 DB2 database database management management system, system, upon upon which which Dis DiscoveryLink coveryLink is is based. based. As As a a result, result, DiscoveryLink DiscoveryLink comes comes with with a a high-level high-level query query language, language, SQL, SQL, that that is is a a standard standard feature feature of of all all relational relational database database management management systems. SRS. First, systems. This This gives gives DiscoveryLink DiscoveryLink several several advantages advantages over over SRS. First, not not only only can can users users easily easily express express SQL SQL queries queries that that go go across across multiple multiple data data sources, sources, which which SRS SRS users users are are able able to to do, do, but but they they can can also also perform perform further further manipulations manipulations on on the the results, which SRS users are are unable to do. not only only are are the the SQL queries results, which SRS users unable to do. Second, Second, not SQL queries more more powerful powerful and and expressive expressive than than those those of of SRS, SRS, the the SQL SQL queries queries are are also also auto automatically DB2. Query matically optimized optimized by by DB2. Query optimization optimization allows allows users users to to concentrate concentrate on on getting getting their their queries queries right right without without worrying worrying about about getting getting them them fast. fast. However, However, DiscoveryLink DiscoveryLink still still has has to to overcome overcome difficulties. difficulties. The The first first reason reason is is that model. This that DiscoveryLink DiscoveryLink is is tied tied to to the the relational relational data data model. This implies implies that that every every piece handles must piece of of data data it it handles must be be a a table table of of atomic atomic objects, objects, such such as as strings strings and and numbers. numbers. Unfortunately, Unfortunately, most most of of the the data data sources sources in in biology biology are are not not that that simple simple and deeply nested. nested. Therefore, impedance mismatch and are are deeply Therefore, there there is is some some impedance mismatch between between these these sources sources and and DiscoveryLink. DiscoveryLink. Consequently, Consequently, it it is is not not straightforward straightforward to to add add new new data sources or system. For example, to data sources or analysis analysis tools tools into into the the system. For example, to put put the the Swiss-Prot Swiss-Prot [40] [40] database database into into a a relational relational database database in in the the third third normal normal form form would would require require breaking breaking every every Swiss-Prot Swiss-Prot record record into into several several pieces pieces in in a a normalization normalization process. process. Such Such a a normalization normalization process process requires requires a a certain certain amount amount of of skill. skill. Similarly, Similarly, querying querying the the normalized normalized data data in in DiscoveryLink DiscoveryLink requires requires some some mental mental and and performance performance overhead, as overhead, as the the user user needs needs to to figure figure out out which which part part of of Swiss-Prot Swiss-Prot has has gone gone to to which which of of the the pieces pieces and and to to join join some some of of the the pieces pieces back back again again to to reconstruct reconstruct the the entry. entry. The The second second reason reason is is that that DiscoveryLink DiscoveryLink supports supports only only wrappers wrappers written written in in C++, C++, which which is is not not the the most most suitable suitable programming programming language language for for writing writing wrappers. wrappers. In In short, short, it it is is not not straightforward straightforward to to extend extend DiscoveryLink DiscoveryLink with with new new sources. sources. In In addition, addition, DiscoveryLink DiscoveryLink does does not not store store nested nested objects objects in in a a natural natural way way and and is is very very limited capability for handling long limited in in its its capability for handling long documents. documents. It It also also has has limitations limitations as as a a tool for creating creating and tool for and managing managing data data warehouses warehouses for for biology. biology.

6.9.3 6.9.3

O bj ect-Protocol Model OPM) Object-Protocol Model ((OPM)


Developed OPM [22] is general data Developed at at Lawrence-Berkeley Lawrence-Berkeley National National Labs, Labs, OPM is a a general data inte integration system. system. OPM OPM was was marketed marketed by sales were gration by GeneLogic, GeneLogic, but but its its sales were discontinued discontinued some some time time ago. ago. It It goes goes one one step step beyond beyond DiscoveryLink DiscoveryLink in in the the sense sense that that it it has has a a more powerful data more powerful data model, model, which which is is an an enriched enriched form form of of the the entity-relationship entity-relationship data model [41 ] . This data model [41]. This data data model model can can deal deal with with the the deeply deeply nested nested structure structure of of biomedical data in a natural way. Thus, it removes the impedance mismatch. This biomedical data in a natural way. Thus, it removes the impedance mismatch. This data data model model is is also also supported supported by by an an SQL-like SQL-like query query language language that that allows allows data data to to be be seen seen in in terms terms of of entities entities and and relationships. relationships. Queries Queries across across multiple multiple data data sources, sources, as well as as well as transformation transformation of of results, results, can can be be easily easily and and naturally naturally expressed expressed in in this this

6. 1 0 Conclusions Conclusions 6.10

1 83

query language. language. Queries Queries are are also also optimized. optimized. Furthermore, Furthermore, OPM OPM comes comes with with a a query number of of data data management management tools tools that that are are useful useful for for designing designing an an integrated integrated data data number warehouse on on top top of of OPM. OPM. warehouse However, OPM OPM has has several several weaknesses. First, OPM OPM requires the use use of of a a global global However, weaknesses. First, requires the integrated schema. schema. It It requires significant skill skill and and effort effort to to design a global global inteinte integrated requires significant design a grated schema schema well. well. If If a a new new data data source source needs needs to to be be added, added, the the effort effort needed needed to to rere grated design the the global global integrated integrated schema schema potentially potentially goes goes up up quadratically quadratically with with respect respect design to the the number number of of data data sources sources already already integrated. integrated. If If an an underlying underlying source source evolves, evolves, to the global integrated schema tends to be affected and significant re-design effort the global integrated schema tends to be affected and significant re-design effort may be be needed. needed. Therefore, Therefore, it it may may be be costly costly to to extend extend OPM OPM with with new new sources. sources. SecSec may ond, OPM OPM stores stores entities and relationships relationships internally internally using using a a relational relational database database ond, entities and management system. system. It It achieves achieves this this by by automatically converting the the entities and management automatically converting entities and relationships into set of tables in con relationships into a a set of relational relational tables in the the third third normal normal form. form. This This conversion process process breaks down entities into many version breaks down entities into many pieces pieces when when stored. stored. This This process process is transparent to users, so they can think and is transparent to OPM OPM users, so they can continue continue to to think and query query in in terms terms of entities and relationships. Nevertheless, the underlying fragmentation of entities and relationships. Nevertheless, the underlying fragmentation often often causes problems, as as many many queries that do the causes performance performance problems, queries that do not not involve involve joins joins at at the conceptual level of and relations relations are mapped to to queries that evoke evoke many conceptual level of entities entities and are mapped queries that many joins the physical pieces to reconstruct broken joins on on the physical pieces to reconstruct broken entities. entities. Third, Third, OPM OPM does does not not have a simple format format to exchange data At one one stage, have a simple to exchange data with with external external systems. systems. At stage, it it interfaces to to external external sources using the the Common interfaces sources using Common Object Object Request Request Broker Broker Architec Architecture ture (CORBA). (CORBA). The The effort effort required required for for developing developing CORBA-compliant CORBA-compliant wrappers wrappers is is generally significant CORBA is generally significant [42]. [42]. Furthermore, Furthermore, CORBA is not not designed designed for for data-intensive data-intensive applications. applications.

6. 10 6.10

CO NCLU S I O N S CONCLUSIONS
In In the the era era of of genome-enabled, genome-enabled, large-scale large-scale biology, biology, high-throughput high-throughput technologies technologies from from DNA DNA sequencing, sequencing, microarray microarray gene gene expression expression and and mass mass spectroscopy, spectroscopy, to to combinatory chemistry and high-throughput screening have generated an unprece combinatory chemistry and high-throughput screening have generated an unprecedented dented volume volume and and diversity diversity of of data. data. These These data data are are deposited deposited in in disparate, disparate, special specialized, ized, geographically geographically dispersed dispersed databases databases that that are are heterogeneous heterogeneous in in data data formats formats and and semantic semantic representations. representations. In In parallel, parallel, there there is is a a rapid rapid proliferation proliferation of of computa computational tional tools tools and and scientific scientific algorithms algorithms for for data data analysis analysis and and knowledge knowledge extraction. extraction. The The challenge challenge to to life life science science today today is is how how to to process process and and integrate integrate this this massive massive amount amount of of data data and and information information for for research research and and discovery. discovery. The The heterogeneous heterogeneous and and dynamic dynamic nature nature of of biomedical biomedical data data sources sources presents presents a a continuing continuing challenge challenge to to accessing, accessing, retrieving, retrieving, and and integrating integrating information information across across multiple multiple sources. sources. Many Many features features of of the the Kleisli Kleisli system system [2, [2, 5, 5, 43] 43] are are particularly particularly suitable suitable for for au automating tomating the the data data integration integration process. process. Kleisli Kleisli employs employs a a distributed distributed and and federated federated

1 84 ~ ~ ~ ~ ~ . , : . . . : ~ : , ~ = , . - - ~ ~ ~ : ~ i ~ 184

6 6

t O I'Y1 as a Backbone for nformatics Data The Kleisli Query S\I,,, System for Bioi Bioinformatics Data Integration

approach approach to to access access external external data data sources sources via via the the wrapper wrapper layer, layer, and and thus thus can can access access the demand. Kleisli Kleisli provides the most most up-to-date up-to-date data data on on demand. provides a a complex complex nested nested internal internal data model model that that encompasses encompasses most most of of the the current current popular data models including data popular data models including flat files, files, HTML, HTML, XML, XML, and and relational relational data databases, and thus thus serves serves as as a a natural natural data data flat bases, and exchanger Kleisli offers robust query exchanger for for different different data data formats. formats. Kleisli offers a a robust query optimizer optimizer and and a powerful and expressive query language to manipulate and transform data, and a powerful and expressive query language to manipulate and transform data, and thus thus facilitates facilitates data data integration. integration. Finally, Finally, Kleisli Kleisli has has the the capability capability of of converting converting re relational database Sybase, MySQL, MySQL, Oracle, DB2, and lational database management management systems systems such such as as Sybase, Oracle, DB2, and Informix into into nested nested relational relational stores, stores, thus thus enabling enabling the the creation creation of of robust robust ware wareInformix houses houses of of complex complex biomedical biomedical data. data. Leveraging Leveraging the the capabilities capabilities of of Kleisli Kleisli leads leads to to the the development development of of the the query query scripts scripts that that give give us us a a high-level high-level abstraction abstraction beyond beyond low-level combination of low-level codes codes to to access access a a combination of the the relevant relevant data data and and the the right right tools tools to to solve solve the the right right problem. problem. Kleisli Kleisli embodies embodies many many of of the the advances advances in in database database query query languages languages and and in in functional functional programming. programming. The The first first is is its its use use of of a a complex complex object object data data model model in in which which sets, sets, bags, bags, lists, lists, records, records, and and variants variants can can be be flexibly flexibly combined. combined. The The second second is is its its use use of of a a high-level high-level query query language language that that allows allows these these objects objects to to be be easily easily manip manipulated. ulated. The The third third is is its its use use of of a a self-describing self-describing data data exchange exchange format, format, which which serves serves as as a a simple simple conduit conduit to to external external data data sources. sources. The The fourth fourth is is its its query query optimizer, optimizer, which is capable of many which is capable of many powerful powerful optimizations. optimizations. It It has has had had significant significant impact impact on on data data integration integration in in bioinformatics. bioinformatics. Indeed, Indeed, since since the the early early Kleisli Kleisli prototype prototype was was applied applied to to bioinformatics, bioinformatics, it it has has been been used used efficiently efficiently to to solve solve many many bioinformatics bioinformatics data data integration integration problems. problems.

R E F E R E NCES REFERENCES
[[1] 1] [2]
National Center for Biotechnology Information (NCBI). l (NCBI). NCBI ASN. ASN.1 Specification, revision 2.0. Bethesda, 992. Bethesda, MD: National Library of Medicine, 1 1992. L. Wong. " Journal of of Functional Wong. "Kleisli: "Kleisli: A Functional Query System. System." Programming 1 0, no. 1 9-56. 10, 1 (2000): (2000): 1 19-56. ?A J. Backus. Backus. "Can Programming Be Liberated from Von Neumann Style Style? Functional Style " Communications CM 2 1, Style and Its Algebra of Programs. Programs." Communications of of the A ACM 21, no. 8 ( 1 978): 61 3-641 . (1978): 613-641. J. Darlington. "An Experimental Experimental Program Transformation and Synthesis System." Artificial Intelligence 16, no. 1 1 98 1 ): 1-46. I ((1981): S. S. Davidson, Davidson, C. Overton, V. V. Tannen, et al. "BioKleisli: "BioKleisli: A Digital Library Library for Biomedical Researchers. " International , no. 1 Researchers." International Journal of of Digital Libraries 1 1, 1

[3]

[4] [5]

((1997): 1 997): 36-5 3. 36-53. [6]


R. J. Robbins, ed. Report of of the Invitational DOE Workshop on Genome Informatics, 993. Informatics, 26-27. Baltimore, MD: April 1 1993.

References References

1 1 85 85

[7] [8] [9]

E Pearson, N. W. W. Matheson, D. L. L. Flescher, Flescher, et et al. al. "The GDB Human Genome Data P. Base 992." Nucleic Acids 1 992): 2201-2206. Base Anno 1 1992." Acids Research 20, supplement supplement ((1992):
C. Burks, " Nucleic Acids Research Burks, M. J. Cinkosky, and W. W. M. Fischer. Fischer. "GenBank. "GenBank." 1992): 2065-2069. 20, supplement ((1992): G. D. Schuler, J. A. Epstein, H. Ohkawa, et al. " Entrez: Molecular Biology G.D. "Entrez. System." 141-162. 1 996): 1 41-1 62. Database and Retrieval System. " Methods in Enzymology 266 ((1996):

[ 1 0] L. C. Bailey [10] L.C. Bailey Jr., S. S. Fischer, Fischer, J. Schug, et al. "GAlA: "GAIA" Framework Annotation of Genomic Sequence. " Genome Research 8, no. 3 ((1998): 1 998): 234-250. Sequence."

[11] E G. Baker, Baker, A. Brass, Brass, and S. Bechhofer. Bechhofer. "TAMBIS: "TAMBIS: Transparent Access Access to Multiple [ 1 1 ] P. Bioinformatics Information Sources. " Intelligent Systems for Molecular Biology 6 Sources." ((1998): 1 998): 25-34. [12] C. C.A. Stevens, and G. Ng. "Transparent Access to Multiple [12] A. Goble, R. Stevens, ): Bioinformatics Information Sources. " IBM Sources." 1BM Systems Journal 40, no. 2 (2001 (2001): 532-552. 532-552. [13] Sheng. "FIMM: "FIMM" A Database of Functional [ 1 3 ] C. Schoenbach, J. Koh, and X. Sheng. " Nucleic Acids Research 28, no. 1 Molecular Immunology. Immunology." 1 (2000): 222-224. 222-224. [14] Suciu. "Comprehension Syntax." SIGMOD Record [ 14] P. Buneman, L. Libkin, and D. Suciu. 23, no. 1 1( (1994): 1 994): 87-96.
[ 1 5] P. Monads. " Mathematical Structures [15] E Wadler. Wadler. "Comprehending Monads." Structures in Computer (1992): 461-493. 1 992): 46 1-493. Science 2, no. 4 (

[16] al. "Principles [ 16] P. Buneman, S. Naqui, V. Tannen, et a1. "Principles of Programming With 1 Types . " Theoretical Complex Objects and Collection Types." Theoretical Computer Computer Science 149, no. 1 (1995): ( 1 995): 3-48. 3-48. [17] E Codd. "A Relational Model for Large Shared Data Bank." Bank." Communications Communications [ 1 7] E. F. of the ACM A CM 13, no. 6 ( 1 9 70): 377-387. 377-3 87. (1970): of
W. Cartinhour, et al. "ACEDB: A Database for [18] for Genome [ 1 8 ] S. Walsh, M. Anderson, S. W. Information. " Methods of Biochemical Analysis 1 998): 299-318. 299-3 1 8 . Methods of Analysis 39 ((1998): Information."

[ 1 9] L. Wong. "Kleisli. "Kleisli: Its Exchange Format, Supporting Tools, and a n Application in [19] an Extraction." In Proceedings of International Protein Interaction Extraction." of the the First IEEE 1EEE International 2 1-28. Los Symposium on Bio-lnformatics Bio-Informatics and and Biomedical Engineering, Engineering, 21-28. Symposium CA: IEEE Computer Society, 2000. Alamitos, CA"

Standard 8824: Information Information International Standards Organization (ISO). Standard [20] International Processing Systems. Systems. Open Open Systems Interconnection. Specification Specification of of Abstraction Abstraction Processing Systems lnterconnection. Geneva, Switzerland: Switzerland: ISO, 1987. 1 987. Syntax Notation Notation One (ASN.1). (ASN. l ) . Geneva, Syntax

[21] T. Hubbard, Hubbard, D. D . Barker, and and E. Birney. Birney. "The ENSEMBL Genome Database Database Project." Nucleic Acids Research 30, no. 1 (2002): (2002): 38-41. 3 8-4 1 . Project. " Nucleic
I . M. A . Chen and and V. M M Markowitz. Markowitz. "An Overview o f the Object-Protocol Model [22] I.M.A. of Information Systems 20, 20, no. no. 5 5 and OPM OPM Data Data Management Management Tools." Information (OPM) and ( 1 995): 393-418. 393-4 1 8 . (1995):

1 86 18 6

6 6

The Kleisli Kleisli The


~ ~ ~

1TI as a nformatics Data Integration Query S\I'",t'> System a Backbone for Bioi Bioinformatics

[23] L. M. Haas, P. M. Schwarz, and P. Kodali. "DiscoveryLink: A System for [23] L.M. Integrated Access Access to Life " IBM Systems Journal 40, no. 2 Life Sciences Sciences Data Sources. Sources." (200 1 ): 489-5 11. (2001): 489-511.

P. Buneman, and S S. "Structural as [24] V. Tannen, P. . Nagri. " Structural Recursion a s a Query Language." In Proceedings Proceedings of of the Third International Workshop on Database 9-19. Morgan Kaufmann, 1 1991. Languages, 9-1 Programming Languages, 9. San Francisco: Morgan 991.
[25] V. Tannen and R. Subrahmanyam. "Logical and Computational Computational Aspects of 8th International Programming with with Sets/BagslLists." Sets/Bags/Lists." In Proceedings Proceedings of of the 1 18th Colloquium on Automata, Languages, Languages, and Programming, Programming, Lecture Note in Science, vol. vol. 5 510, 60-75. Berlin, Germany: Springer-Verlag, Springer-Verlag, 1 1991. 991. Computer Science, 1 0, 60-75. [26] [26] D D.. Suciu. "Bounded "Bounded Fixpoints for Complex Objects." Theoretical Theoretical Computer Science 76, no. 1 -2 ((1997): 1 997): 283-328. Science 1 176, 1-2 283-328. [27] [27] G. Dong, L. Libkin, and L. Wong. "Local Properties of Query Languages." Theoretical Computer Science Science 239, no. 2 (2000): 277-308. 277-308.
Query Languages for Bags [28] L. Libkin and and L. Wong. " "Query Bags and and Aggregate Functions." Journal of 1 997): 241-272. of Computer and System Sciences Sciences 55, no. 2 ((1997): 241-272.

[29] D D.. Suciu and L. Wong. "On "On Two Forms of Structural Recursion." In Proceedings Proceedings [29] of of the Fifth Fifth International Conference Conference on Database Database Theory, Theory, Lecture Notes in Computer Science, vol. 83 1 1-124. Berlin, Germany: Springer-Verlag, 995. Science, vol. 83.. 1 111-124. Springer-Verlag, 1 1995.
[30] S. F. E Altschul and and W. Gish. "Local Alignment Statistics." Methods in Enzymology 266 ((1996): 1 996): 460-480. 460-480. [3 1 ] S. F. [31] E Altschul, T. L. Madden, Madden, and A. A. Schaffer. Schaffer. "Gapped "Gapped BLAST BLAST and PSI-BLAST: PSI-BLAST: A New Research New Generation of Protein Database Search Programs." Nucleic Acids Research 7 ((1997): 1 99 7): 3389-3402. 25, no. 1 17 3389-3402.
D. Thompson, [32] ]. [32] J.D. Thompson, D. G. Higgins, and T. ]. J. Gibson. "CLUSTAL W: Improving the Sensitivity of Progressive Multiple Multiple Sequence Alignment through through Sequence Weighting, Positions-Specific " Nucleic Positions-Specific Gap Penalties and Weight Matrix Choice. Choice." Acids Research 1 994): 4673-4680. 4673-4680. Research 22 ((1994): SITE Database, Its Status in P. Bucher, Bucher, et al. "The PRO PROSITE [33] L. Falquet, M. Pagni, P. 2002. " In Nucleic Acids Research 8. Research 30, no. 1 1 (2002): 235-23 235-238. 2002." http://hits.isb-sib.chlcgibinlPFSCAN. http://hits.isb-sib.ch/cgibin/PFSCAN.

[34] [34] A. Murzin, S. E. Brenner, and T. Hubbard, Hubbard, et al. "SCOP: A Structural
Classification of Protein Database for the Investigation of Sequences and 1 995): 536-540. Structures." of Molecular Biology 247, no. no. 4 ((1995): 536-540. Structures." Journal of

[35] CM [35] A. Goldberg and R. Paige. "Stream Processing." In Proceedings Proceedings of of the A ACM Symposium on LISP and Functional Programming, 53-62. 53-62. New York: ACM, 1 984. 1984.
" In Proceedings Proceedings of of the [36] L. Wong. "PIES, A Protein Interaction Extraction System. System." Pacific Symposium in Biocomputing, 3 1 . Singapore: World Scientific, Biocomputing, 520-5 520-531. 200 1. 2001.

References ~ ~ ~ J ~ ' ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ : ~ - ~ ~ ~ ~ - - - ~ ~ ~ ~ ~ =

1 1 87 87

[37] R. Gansner, E. Koutsofios, S. C. North, et al. "A Technique for Drawing [37] E. E.R. Directed Graphs." IEEE Transactions on Software Engineering 1 9, no. 3 ( 1 993): 19, (1993): 2 1 4-230. 214-230. [38] T. Etzold and P. P. Argos. SRS: SRS: Information Retrieval System for Molecular Biology Data Banks. Methods 1 996): 1 14-1 2 8 . Methods in Enzymology Enzymology 266 ((1996): 114-128. [39] n Biopharmaceutical R & [39] 3rd Millennium Inc. "Practical Data Integration iin & D: Strategies and . " Cambridge, MA: and Technologies Technologies." MA" 3rd Millennium, 2002. [40] [40] A. Bairoch and R. Apweiler. Apweiler. "The SWISS-PROT Protein Sequence Data Bank and 1999." Its Supplement TrEMBL in 1 999." Nucleic Acids Research 27, no. 1 1 ((1999): 1 999): 49-54. 49-54. [41 ] P. P. S. Chen. "The Entity-Relationship Model: Toward a Unified View of Data ." [41] P.P.S. Data." A CM Transactions on Database Systems 1 ,n o. 1 19 76 ) : 9-36. 1, no. 1 ((1976): [42] J. Selletin and Data-Intensive Intra- & and B. Mitschang. " "Data-Intensive & Internet Applications: Experiences . " In Proceedings of Experiences Using Java and CORBA in the World Wide Web Web." of the Fourteenth IEEE International 1 1. International Conference on Data Engineering, 302-3 302-311. Los Alamitos, CA: IEEE IEEE Computer Science, Science, 1998. [43] L. Wong. "The Functional Guts of the Kleisli " In Proceedings of Kleisli Query System. System." of

the Fifth A CM SIGPLAN SIGPLAN International International Conference on Functional Programming, Programming,
1-10. New York: ACM, 2000.

This Page Intentionally Left Blank

CHAPTER CHAPTER

7 7

Diverse ation Diverse Inform Information Sources B IS Sources in in TAM TAMBIS

Formul ation Over Form u lation Over

Com p l ex Query Complex Query

Robert Stevens, Carole Robert Stevens, Carole Goble, Goble, Norman Norman W. W. Paton, Paton, Sean Bechhofer, and Andy Sean Bechhofer, Gary Gary Ng, Ng, Patricia Patricia Baker, Baker, and Andy Brass Brass

Molecular Molecular biology biology is is a a data-rich data-rich discipline discipline that that has has produced produced a a vast vast quantity quantity of of sequence sequence and and other other data. data. Most Most of of the the resulting resulting data data sets sets are are held held in in independently independently developed banks and tools. These developed data databanks and are are acted acted upon upon by by separate separate analysis analysis tools. These in information autonomous, distributed, formation sources sources and and tools tools are are autonomous, distributed, and and have have differing differing call call interfaces. interfaces. As As such, such, they they manifest manifest classical classical syntactic syntactic and and semantic semantic heterogeneity heterogeneity problems [ ]. problems [1 1]. Many y individual Many bioinformatics bioinformatics tasks tasks are are supported supported b by individual sources. sources. However, However, biologists wish to ask complex span a biologists increasingly increasingly wish to ask complex questions questions that that span a range range of of the the available sources [2]. places barriers available sources [2]. This This places barriers between between a a biologist biologist and and the the task task to to be be accomplished; accomplished; the the biologist biologist has has to to know know what what sources sources to to use, use, the the locations locations of of the the sources, sources (both sources, how how to to use use the the sources (both syntactically syntactically and and their their semantics), semantics), and and how how to sources. to transfer transfer data data between between the the sources. This problems called called Transpar This chapter chapter presents presents an an approach approach to to solving solving these these problems Transparent ent Access Access to to Multiple Multiple Bioinformatics Bioinformatics Information Information Sources Sources (TAMBIS) (TAMBIS) [3]. [3]. This This chapter chapter reports reports on on the the first first version version of of the the TAMBIS TAMBIS system, system, which which was was developed developed between 996 and between 1 1996 and 2000. 2000. A A second second version version extends extends and and develops develops this this first first version, version, addressing addressing some some of of the the problems problems recognized recognized in in the the approach. approach. This This new new version version is is introduced introduced in in Section Section 7.5. The The TAMBIS TAMBIS approach approach attempts attempts to to avoid avoid the the pitfalls pitfalls de described previously by scribed previously by using using an an ontology ontology of of molecular molecular biology biology and and bioinformatics bioinformatics ontology is to manage the presentation and usage of the sources. An a description description to manage the presentation and usage of the sources. An ontology is a of relationships between of the the concepts, concepts, and and the the relationships between those those concepts, concepts, within within a a domain. domain.

1 90

7 7

C o m plex Query u l ation Over Diverse nformation S o u rces in TAMBIS Complex Query Form Formulation Diverse IInformation Sources TAMBIS

The The ontology ontology allows allows TAMBIS: TAMBIS9 to to provide provide a a homogenizing homogenizing layer layer over over the the numerous numerous data databases and analysis analysis bases and tools tools

9 to to manage manage the the heterogeneities heterogeneities between between the the data data sources sources
9 to to provide provide a a common, common, consistent consistent query-forming query-forming user user interface interface that that allows allows queries queries across across sources sources to to be be precisely precisely expressed expressed and and progressively progressively refined refined

This This ontology ontology is is the the backbone backbone of of the the TAMBIS TAMBIS system; system; it it is is what what the the user user interacts interacts with questions. It with to to form form questions. It allows allows the the same same style style of of query query and and terms terms to to be be used used across diverse diverse resources, resources, and and it it also also manages manages the the answering answering of of the the query query itself. itself. across A A concept concept is is a a description description of of a a set set of of instances, instances, so so a a concept concept or or description description can can also also be be viewed viewed as as a a query. query. The The TAMBIS TAMBIS system system is is used used for for retrieving retrieving instances instances described model. This described by by concepts concepts in in the the model. This contrasts contrasts with with queries queries phrased phrased in in terms terms of of the the structures structures used used to to store store the the data, data, as as are are used used in in conventional conventional database database query query environments. environments. This This approach approach allows allows a a biologist biologist to to ask ask complex complex questions questions that that access access and and combine combine data data from from different different sources. sources. However, However, in in TAMBIS, TAMBIS, the the user user does does not not have have to to choose choose the the sources, sources, identify identify the the location location of of the the sources, sources, express language of express requests requests in in the the language of the the source, source, or or transfer transfer data data items items between between sourccs. sources.

Ontology Ontology Server Server

Sources Sources and and

Services Services Model Model

Sources Sources

Wrapped Wrapped

USER

USER ~I Query Query

~ so u rc e -----::;.. Formulation i-source ~'- I Formulation independent Interface independent Interface '--_.---_..-/ conceptual conceptual query l query

Conceptual

j-

Source and and Query Query Planning Planning Selection Selection

It--source ~ ----::;..I Plan Plan j source Execution dependent Execution dependent query query plan plan
Query

1
results

results

--

7 .1 7.1

The The flow flow of of information information through through the the TAMBIS TAMBIS architecture. architecture.

FIG U RE FIGURE

Query Form u l ation Over Diverse I nformation Sou rces in TAM B I S

191

Figure Figure 7.1 7.1 shows shows how how a a query query is is constructed constructed and and processed processed through through the the TAMBIS system. system. The The steps in processing processing a a TAMBIS TAMBIS query are as as follows: follows: TAMBIS steps in query are
1. A 1. A query query is is formulated formulated in in terms terms of of the the concepts concepts and and relationships relationships in in the the ontol ontology using the ogy using the visual visual Conceptual Query Formulation Interface. This This interface interface allows ontology to browsed by supports the allows the the ontology to be be browsed by users users and and supports the construction construction of of complex complex concept concept descriptions descriptions that that serve serve as as queries. queries. The The output output of of the the query query formulation formulation process process is is a a source independent independent conceptual conceptual query. The The query query for formulation use of the TAMBIS mulation interface interface makes makes extensive extensive use of the TAMBIS Ontology Ontology Server, Server, which which not not only only stores stores the the ontology ontology but but supports supports various various reasoning reasoning services services over over the the ontology. reasoning services example, to ontology. These These reasoning services serve, serve, for for example, to ensure ensure that that queries queries constructed constructed using using the the query query formulation formulation interface interface are are biologically biologically meaningful meaningful with with respect respect to to the the TAMBIS TAMBIS ontology. ontology.

answer 2. Given Given a a query, query, TAMBIS TAMBIS must must identify identify the the sources sources that that can can be be used used to to answer the valid and plans for the query query and and construct construct valid and efficient efficient plans for evaluating evaluating it it given given the the facilities provided by relevant sources. facilities provided by the the relevant sources. The The source source selection selection and and query query plan planning process process makes makes extensive extensive use of the the Sources Sources and and Services Services Model Model (SSM), (SSM), ning use of which which associates associates concepts concepts and and relationships relationships from from the the Ontology Ontology with with the the ser services provided output of source selection vices provided by by the the sources. sources. The The output of the the source selection and and query query planning planning process process is is a a source dependent dependent query plan that that describes describes the the sources sources to to be be used used and and the the order order in in which which calls calls should should be be made made to to the the sources. sources.
3. The 3. The query query plan plan execution execution process process takes takes the the plan plan provided provided by by the the planner planner and and executes executes that that plan plan over over the the wrapped wrapped sources to to yield yield an an answer answer to to the the query. query. Sources Sources are are wrapped wrapped so so they they can can be be accessed accessed in in a a syntactically syntactically con consistent AMBIS, each sistent manner. manner. In In version version one one of of T TAMBIS, each source source is is represented represented as as a a collection collection of of function function calls, calls, which which are are evaluated evaluated by by the the collection collection program programming ming language language (CPL) (CPL) [4] [4].. The The sources sources used used in in TAMBIS TAMBIS 1.0 1.0 were were Swiss-Prot, Swiss-Prot, ENZYME, ENZYME, CATH CATH (Classes, (Classes, Architecture, Architecture, Topology, Topology, Homology), Homology), Basic Basic Local Local Alignment Alignment Search Search Tool Tool (BLAST), (BLAST), and and PROSITE. PROSITE.

The The remainder remainder of of this this Chapter Chapter is is organized organized as as follows. follows. Section Section 7.1 7.1 gives gives a a brief brief overview overview of of the the TAMBIS TAMBIS ontology, ontology, describing describing its its scope scope and and the the language language in in which which it 7.2 describes it is is implemented. implemented. Section Section 7.2 describes how how users users interact interact with with TAMBIS, TAMBIS, in in par particular how the ontology is explored and how queries are constructed using ticular how the ontology is explored and how queries are constructed using the the interface describes how using the interface from from Section Section 7.2. 7.2. Section Section 7.3 7.3 describes how queries queries constructed constructed using the interface 7.2 are interface from from Section Section 7.2 are evaluated evaluated over over the the individual individual sources. sources. Section Section 7.4 7.4 describes describes work work in in several several areas areas related related to to TAMBIS TAMBIS and and describes describes how how TAM TAMBIS BIS compares compares to to alternative alternative or or complementary complementary proposals. proposals. Section Section 7.5 7.5 considers considers issue issue relating relating to to query query construction construction and and source source integration integration raised raised by by experience experience in in TAMBIS TAMBIS and and how how these these are are addressed addressed in in TAMBIS TAMBIS 2.0. 2.0.

1 92

7 7

u lation Over Diverse nformation Sou rces in TAM SIS Complex Query Form Formulation Diverse IInformation Sources TAMBIS

7 .1 7.1

T HE O NTOLOGY THE ONTOLOGY


An An ontology ontology is is a a description description of of the the concepts concepts and and their their relationships relationships within within a a domain. ontology is which knowledge domain. An An ontology is a a mechanism mechanism by by which knowledge about about a a domain domain can can be be captured captured in in computational computational form form and and shared shared within within a a community community [5]. [5]. The The TAMBIS TAMBIS ontology ontology describes describes both both molecular molecular biology biology and and bioinformatics bioinformatics tasks. tasks. A concept concept represents represents a a class class of of individuals individuals within within a a domain. domain. Concepts Concepts such such A as Protein P r o t e i n and and Nucl Nucle cc acid a c i d are are part part of of the the world world of of molecular molecular biology. biology. An An eii as Access A c c e s s iion o n number, number, which which acts acts as as a a unique unique identifier identifier for for an an entry entry in in an an informa information tion source, source, lies lies outside outside this this domain domain but but is is essential essential for for describing describing bioinformatics bioinformatics tasks molecular biology. tasks in in molecular biology. The The TAMBIS TAMBIS ontology ontology contains contains only only concepts concepts and and the the relationships between those concepts. Individuals that are members of concept relationships between those concepts. Individuals that are members of concept classes (P21598 is is an an individual of the the class class Acces Access oon n number number)) do do not not apsii ap classes (P21598 individual of pear contained within pear in in the the TAMBIS TAMBIS ontology. ontology. Such Such individuals individuals are are contained within the the external external resources resources over over which which TAMBIS TAMBIS answers answers queries. queries. The has been The TAMBIS TAMBIS ontology ontology has been designed designed to to cover cover the the standard standard range range of of bioin bioinformatics retrieval retrieval and and analysis analysis tasks tasks [2] [2].. This This means means that that a a broad broad range range of of biology biology formatics has been has been described. described. The The model model is, is, however, however, currently currently quite quite shallow; shallow; although although the the detail detail present present is is sufficient sufficient to to allow allow descriptions descriptions of of most most retrieval retrieval tasks tasks support supportable using sources. In addition, precision precision can able using the the integrated integrated bioinformatics bioinformatics sources. In addition, can arise arise from specialized concepts from the the ability ability to to combine combine concepts concepts to to create create more more specialized concepts (see (see Section 7.2.2). Section 7.2.2). The eii The model model is is centered centered upon upon the the biopolymers biopolymers Protein P r o t e i n and and Nuc N u cl le cc acid acid and such as and their their children, children, such as Enzyme, Enzyme, DNA, DNA, and and RNA. RNA. Biological Biological functions functions and and pro processes it is of cesses are are also also present, present, so so it is possible possible to to describe, describe, for for example, example, the the kinds kinds of reactions reactions that that are are catalyzed catalyzed by by an an enzyme. enzyme. Many Many tasks tasks in in bioinformatics bioinformatics involve involve comparing or identifying identifying patterns in sequences. sequences. As As a a result, result, sequence sequence components components comparing or patterns in such described. For such as as protein protein motifs motifs and and structure structure classifications classifications are are described. For example, example, a a motif motif is is a a pattern pattern within within a a sequence sequence that that is is generally generally associated associated with with some some biological biological function. function. The The ontology ontology thus thus supports supports the the description description of of motifs motifs and and var various different of motifs. Such descriptions ious different kinds kinds of motifs. Such descriptions are are facilitated facilitated by by the the presence presence of a collection of of relationships ba of a rich rich collection relationships between between concepts concepts in in the the ontology. ontology. These These barichness sic sic concepts concepts are are present present in in the the is is a a hierarchy. hierarchy. Other Other relationships relationships add add richness to of biological to the the model, model, so so that that a a wide wide range range of biological features features can can be be described. described. For For example, Mot (parts of) of) Protein example, M o ti i ff (and (and its its children) children) can can be be components components (parts P r o t e i n or or Nu Nuc lle functions, processes, e ii c c ac a c ii dd. . Other Other relationships relationships capture capture associations associations to to functions, processes, sub-cellular locations, similarities, and labels such as species name, gene sub-cellular locations, similarities, and labels such as species name, gene names, names, protein protein names, names, and and accession accession numbers. numbers. The The model model is is described described in in more more detail detail in in [6] and an an article article in in Bioinformatics Bioinformatics [6] and can can be be browsed browsed via via an an applet applet on on the the TAMBIS TAMBIS Web Web site. site.

7 . 1 The 7-21.-,,T~-oh g e 0~ nt ~ I ~

....................................................................................................................

193 1 93

The (DL) [7], [7], a The TAMBIS TAMBIS ontology ontology is is expressed expressed in in a a description description logic logic (DL) a type type of of knowledge knowledge representation representation language language for for describing describing ontologies ontologies [8]. [8]. DLs DLs are are consid considered ered an an important important formalism formalism for for giving giving a a logical logical underpinning underpinning to to knowledge knowledge repre representation sentation systems, systems, but but they they also also provide provide practical practical reasoning reasoning facilities facilities for for inferring inferring properties of and and relationships relationships between between concepts concepts [9] [9].. TAMBIS TAMBIS makes makes extensive extensive properties of use reasoning services. use of of these these reasoning services. i f isa As Motif isa Sequence SequenceAs well well as as the the traditional traditional isa isa relationships relationships (e.g., (e.g., a a Mot Component ) , there Component), there are are partitive partitive (describing (describing parts), parts), locative locative (describing (describing location), location), and and nominative nominative (describing (describing names names or or labels) labels) relationships. relationships. This This means means that that the the TAMBIS ontology can describe relationships TAMBIS ontology can describe relationships such such as: as: "Motifs "Motifs are are parts parts of of pro proteins" " The teins" and and "Organelles "Organelles are are located located inside inside cells. cells." The ontology ontology initially initially holds holds only only asserted concepts, but these can be combined dynamically via relationships asserted concepts, but these can be combined dynamically via relationships to to form form new, new, compositional compositional concepts. concepts. These These compositional compositional concepts concepts are are automati automatically cally classified classified using using the the reasoning reasoning services services of of the the ontology. ontology. Such Such compositional compositional concepts concepts can can be be made made in in a a post-coordinated post-coordinated manner: manner: That That is, is, the the ontology ontology is is not not a a static static artifact; artifact; users users can can interact interact with with the the ontology ontology to to build build new new concepts, concepts, composed of composed of those those already already in in the the ontology, ontology, and and have have them them checked checked for for consis consistency tency and and placed placed at at the the correct correct position position in in the the ontology's ontology's lattice lattice of of concepts. concepts. For For sCom example, example, Mot Moti i ff can can be be combined combined with with Protein P r o t e i n using using the the relationship relationship i isComponentOf p o n e n t O f to to form form a a new new concept concept Protein P r o t e i n mot m o t iiff, , which which is is placed placed as as a a kind kind i f. of of Mot Motif. The present in The ontology ontology is is a a dynamic dynamic model model in in that that what what is is present in the the model model is is the the description of potential concepts domain of description of potential concepts that that can can be be formed formed in in the the domain of molecular molecular biology biology and and bioinformatics. bioinformatics. As As these these new, new, compositional compositional concepts concepts are are described, described, they they are are placed placed automatically automatically within within the the lattice lattice of of existing existing concepts concepts by by the the DL DL reasoning reasoning services. services. For For example, example, the the compositional compositional concept concept Protein P r o t e i n moti m o t i ff (see (see above) above) is is automatically automatically classified classified as as a a kind kind of of Mot Moti i ff . . This This new new concept concept is is then then available available to to be be re-used re-used in in further further compositional compositional concepts. concepts. Most Most of of the the other other biological ontologies are the TAMBIS built around around a biological ontologies are static; static; the TAMBIS ontology ontology is is dynamic, dynamic, built a collection concept descriptions collection of of concept descriptions and and constraints constraints on on how how they they can can be be composed. composed. The The TAMBIS TAMBIS ontology ontology is is described described using using the the DL DL called called Galen Galen Representation Representation and GRAIL) [10]. and Integration Integration Language Language ((GRAIL) [10]. In In GRAIL, GRAIL, a a new new concept concept can can be be defined defined as follows: as follows:
Base w h i c h rlf I
... rnf n

where where each each r ri is is a a role role name name and and each each f fii a a filler filler concept. concept. Each Each r ri if fii pair pair is is also also known known as as a a criterion. criterion. A A role role is is a a property property of of a a concept, concept, and and the the filler filler of of a a role role is is the the name name or or description description of of the the concept concept that that can can play play the the given given role. role. For For example, example, Mot i f whi ch sComponentOf e in is Motif whic h i is C o m p o n e n t O f Prot Protein is a a description description of of a a protein protein motif. motif. Mot i f and Motif and Protein Protein are are names names of of existing existing concepts, concepts, which which are are acting acting here here as as

194 94 1

==

Com p l ex Query Form u l ation Over Diverse IInformation nformation Sou rces in TAM BIS Complex Formulation Sources TAMBIS ===---===

the sComponentOf the base concept concept and and a a role filler respectively. respectively. The The construct construct i isComponentOf Protein is the criterion of Motif in this case. Protein is the criterion of Mot i f in this case. Description Description logic logic ontologies ontologies are are organized organized within within a a subsumption subsumption lattice, lattice, which which captures relationship between between two two concepts. concepts. The Th~ fact fact that that one one concept concept is is a a captures the the isa relationship kind of another can be asserted asserted as kind of another can either either be as part part of of the the model, model, or or inferred inferred by by the the rea reasoning system system on on the the basis basis of the concept concept descriptions. descriptions. Figure Figure 7.2 7.2 illustrates illustrates both both soning of the forms forms of of subsumption subsumption relationship. relationship. For For example, example, Mot i if f has has been been asserted asserted to to be be a a te has has been been asserted kind of kind of SequenceComponent, SequenceComponent, and and PhosphorylationSi PhosphorylationSi te asserted i f. to to be be a a kind kind of of Mot Moti f. By By contrast, contrast, with with the the asserted asserted hierarchy, hierarchy, the the notion notion of of a a Mot protein ((Mot f whi ch M o ti i ff that that can can be be found found within within a a protein M o ti if whic h i i ssComponentOf ComponentOf P r o t e i n ) ) is is inferred inferred to to be be a a kind kind of of Mot M o ti i ff, , as as are are the the other other concepts concepts in in the the three three Protein boxes cases, the describing the boxes on on the the bottom bottom in in Figure Figure 7.2. 7.2. In In these these cases, the criteria criteria describing the concept concept are concepts. Wherever are used used to to infer infer the the classification classification of of these these concepts. Wherever C C2 is subsumed subsumed by by 2 is Cl , every C1, every instance instance of of C C2 is guaranteed guaranteed to to be be an an instance instance of of Cl C1 (e.g., (e.g., every every Mot M o tiff is is a a 2 is d every SequenceComponent, f whi ch i sComponentOf Protein SequenceComponent, an and e v e r yMot M o ti if which isComponentOf Protein is a a Mot Motif). is i f).

Phosphoryl atlonSlte

IsComponentOf Protein PhosphofylaUonSite which IsComponentOf Protein

MotIf which IsComponentOf Protein which hasOrganismClassir lCaUon Specles:guppy

7.2
F IGURE FIGURE

Example Example of of subsumption subsumption relationship relationship within within the the ontology. ontology. The The concepts concepts that that have have been been inserted inserted into into the the lattice lattice are are shaded shaded in in the the three three boxes boxes at at the the top top of of the the figure. figure. The The locations locations of of the the unshaded unshaded concepts concepts in in the the lattice lattice have have been been inferred. inferred.

7.2

nterface The User IInterface

1 95

In 1 1 ], the In spite spite of of its its inexpressiveness inexpressiveness compared compared with with some some other other DLs DLs [ [11], the GRAIL GRAIL representation representation has has a a useful useful property property in in its its ability ability to to describe describe constraints constraints about about when be formed. when relationships relationships are are allowed allowed to to be formed. For For example, example, it it is is true true that that a a Mo Mot t ii f f is is a a component component of of a a B iopolymer, i o p o l y m e r , but but not not all all motifs motifs are are components components of of all biopolymers. For example, example, a a PhosphorylationSi P h o s p h o r y l a t i o n S i t e te can can be be a a component component all biopolymers. For eii of of a a Protein, P r o t e i n , but but not not a a component component of of a a Nuc N u cl le cc acid, a c i d , both both of of which which are are B io opolymers. constraint mechanism mechanism allows model to p o l y m e r s . The The constraint allows the the TAMBIS TAMBIS model to capture capture this distinction and and thus only allow allow the this distinction thus only the description description of of concepts concepts that that are are described described as in terms as being being biologically biologically meaningful meaningful in terms of of the the model model from from which which they they are are built. built. This This allows allows general general queries, queries, such such as as "find all protein motifs," to to be be expressed expressed as as well nd phosphorylation well as as specific specific queries queries such such as as "fi "find phosphorylation motifs upon this protein." The The TAMBIS TAMBIS ontology ontology is is supplied supplied as as a a software software component component that that acts acts as as a a server. Other components ask questions knowledge in server. Other components can can ask questions of of the the knowledge in the the ontology ontology component. component. It It is is the the backbone backbone of of the the architecture, architecture, and and other other components components either either directly questions such directly or or indirectly indirectly use use the the ontology. ontology. These These other other components components ask ask questions such as: "what children, or as: "is "is this this a a concept;" concept; .... what are are the the parents, parents, children, or siblings siblings of of this this concept;" concept;" "which held by "which relationships relationships are are held by this this concept;" concept;" and and "what "what is is the the natural natural language language version of of this this concept. concept." version "

7 .2 7.2
~ _

\~ ~ t , i ~

TH E U S E R IINTERFACE NTE R FACE THE USER

This section section describes users in This describes the the user user interface interface to to TAMBIS. TAMBIS. The The interface interface supports supports users in carrying carrying out out two two principal principal tasks: tasks: exploring exploring the the ontology ontology and and constructing constructing queries, queries, which which are are described described in in Sections Sections 7.2.1 7.2.1 and and 7.2.2, 7.2.2, respectively. respectively.

7 .2. 1 7.2.1

Exp l o r i n g the nto logy Exploring the O Ontology


Although 800 concepts, Although the the full full TAMBIS TAMBIS ontology ontology contains contains approximately approximately 1 1800 concepts, the the version of used in querying contains contains approxapprox version of the the ontology ontology used in the the online online system system for for querying imately concepts. This model concentrates imately 250 250 concepts. This model concentrates on on proteins proteins and and enzymes; enzymes; it it de describes processes, motifs, scribes features features such such as as functions, functions, processes, motifs, and and structure. structure. In In this this and and following following sections, sections, examples examples are are based based on on this this smaller smaller ontology. ontology. The main window The main window of of the the TAMBIS TAMBIS system system is is shown shown in in Figure Figure 7.3. 7.3. The The main main window window is is used used to to launch launch exploration exploration or or query query building building tasks. tasks. A A concept concept name name is is either typed into either typed into the the find field field directly directly or or obtained obtained from from the the list list of of Bookmarks. Bookmarks. This This concept concept can can be be used used either either as as the the starting starting point point for for model model exploration exploration or or query query building building by by selecting selecting New query q u e r y or or Explore. E x p l o r e . If If Explore E x p l o r e is is selected selected in in Figure 7.3, the Figure 7.4 Figure 7.3, the explorer explorer window window depicted depicted in in Figure 7.4 is is launched. launched.

196 1 96

7 7

Complex Com p l ex Query Query Formulation Form u l ation Over Over Diverse Diverse Information I nformation Sources Sources in in TAMBIS TAM B I S

Bookmarks ""

IM _ o_ tlf 1 [ind: o.-

_ _ _ _ _ _ _ _ _ _ _ _ _ _

New guery_

D eve l o p i n g Ve r s i o n
E!!!ClII

11

)(plore...

I Ready.
7.3 7.3
FIGURE FIGURE

11 1 1 tle1p 1

0.96

--'

Qptlons...

The The TAMBIS TAMBIS main main window. window.

parent

dennrtlon
protein [+)
molecular

child

relation
sequence component [+)

modrr rclltion 1+)

biological function 1+)

motif SIte

1+)

biological process 1+) accession number

Protein

"
HlSl Of)'

" Show hid

how relation
Nllw

n concepts

Show crellled concepts


flY operI

guery

!! p

7.4 7.4 FIGURE FIGURE

The i ff with The explorer explorerwindow window showing showingmot moti with all all types types of ofrelations relations it it has has with withother other concepts. concepts.

7.2
o.............

nterface The User IInterface


~. . . . . . o . . . . . ~ . . . . . . . . . . . . . . ~

1 97
1 9 7

Relationship
hasAc c e s s i onNumber hasAccessionNumber i sC omponentOf isComponentOf indicatesFunction indicatesFunction i sA s s o c i atedWi thPro c e s s isAssociatedWithProcess hasModi f i c a t i on hasModification

Concept Concept
acces s i on number accession number protein protein biological biological function function

biological ess biological proc process molecular f i ca t i on molecular modi modification

7 .1 7.1 TABLE TABLE

The The relationships relationships from from moti moti f f to to other other concepts concepts in in the the TAMBIS TAMBIS model. model.

The The window window in in Figure Figure 7.4 7.4 shows shows the the basic basic concept concept description description facilities facilities of of the the model browser. Concepts are shown as buttons; the buttons usually have a title model browser. Concepts are shown as buttons; the buttons usually have a title that describes the relationship to that describes the relationship to the the central central or or focus concept, concept, which which has has no no title title itself. also indicates the relationship relationship the itself. The The button button color color also indicates the the button button has has to to the the central central concept, although this concept, although this is is not not evident evident in in the the monochrome monochrome screenshot. screenshot. iff Figure 7.4 7.4 shows all the the relationships relationships of Figure shows all of mot moti . . The The parent parent and and children children i te, respectively. concepts concepts are are sequence sequence component component and and s site, respectively. The The relationships relationships other other than than is-a-kind-of is-a-kind-of are are shown shown in in the the lighter lighter area area of of the the figure. figure. The The name name of of the relationship appears as the button title, and the name of the concept to which the relationship appears as the button title, and the name of the concept to which the label. For the relationship relationship links links is is the the button button label. For example, example, a a relationship relationship button button title title es ss si io on is hasAccessionNumber h a s A c c e s s i o n N u m b e r and and a a concept concept button button label label is is acc acce n number. number. is Table f. The Table 7.1 7.1 shows shows these these relationships relationships for for mot mo t i i f. The user user can can explore explore the the is-a-kind is-a-kindrelationships by of of hierarchy hierarchy or or the the other other relationships by clicking clicking on on the the buttons buttons representing representing the the concepts which mot is related. concepts to to which m o t ii f f is related. The view of The explorer explorer uses uses a a pie-chart pie-chart view of the the ontology, ontology, with with different different sectors sectors show showing ing the the parents, parents, children, children, definitions, definitions, and and other other relationships. relationships. In In the the TAMBIS TAMBIS on ontology, have a large number tology, some some concepts concepts have a large number of of members members in in one one sector, sector, far far more more than than can can be be shown shown at at any any one one time. time. Rather Rather than than cramping cramping the the view view of of related related concepts, concepts, the the sectors sectors are are scrollable, scrollable, allowing allowing controlled controlled viewing viewing of of the the ontology's ontology's contents. contents. Clicking Clicking on on a a concept concept button button that that is is not not the the focus focus causes causes that that button button to to become become the the new new focus. focus. Thus, Thus, a a user user can can move move up up and and down down the the taxonomy taxonomy and and across across the the taxonomies taxonomies by by following following other other relationships. relationships. Larger Larger jumps jumps may may be be made made within within the the model model by by using using a a go to function. function.

7 .2.2 7.2.2

Co n structi n g Que ries Constructing Queries


Queries f query Queries in in TAMBIS TAMBIS are are essentially essentially concept concept descriptions. descriptions. Thus, Thus, the the task task o of query formulation user in formulation involves involves the the user in constructing constructing a a concept concept that that describes describes the the infor information query is is illustrated Figure 7.5, which is mation of of interest. interest. An An example example query illustrated in in Figure 7.5, which is a a screen screen

1 98 198

========--====

7 7

Complex Query Form Formulation Diverse IInformation Sources in TAMSIS TAMBIS Com plex Query u l ation Over Diverse nformation Sources

Query Builder
species protei n motif

1!!Il!1 13

ookmark query

motif

L:
StatusBar

species: guppy

xplore......

Submit

..

ancel

tlelp

twarning: Applet Window


7.5
F IGURE FIGURE

A query query builder builder window window containing containing the the concept concept describing describing motifs motifs in in guppy guppy pro proA teins. teins.

shot of query builder request for shot of the the query builder window window containing containing a a request for the the motifs motifs that that are are components of guppy proteins. proteins. The components of guppy The equivalent equivalent GRAIL GRAIL concept concept is: is:
mot i f whi ch i sComponentOf motif which isComponentOf protein ch hasOrgan i smC l as s i f ication spec i es : guppy protein whi which hasOrganismClassification species-guppy

As name indicates, query builder window is used for As its its name indicates, the the query builder window is used for building building descriptions descriptions of biological concepts queries. One of biological concepts that that act act as as queries. One of of the the buttons buttons along along the the bottom bottom of t, is used to of the the window, window, Submi Submi t, is used to ask ask TAMBIS TAMBIS to to process process the the query query and and collect collect 7.6. The results the results. Part of a results page for this query is given in Figure the results. Part of a results page for this query is given in Figure 7.6. The results

7.2 7 .2

The User User IInterface The nterface

1 99 199

TA
(You
can

bookm

this re for futlIre execution)

Que

[Motif] { }

onl ( [PI has g,)nisn

Result: motif
ps 1 pdoc POOC 1 Id AStCG.Y SYLAn matches start 35 1. end 38 pattern /'I'

motif

.:J

7.6 7 .6 FIGURE FIGURE

Part Part of of the the results results page page that that fulfills fulfills the the description description shown shown in in Figure Figure 7.5. 7.5.

shown i n this this figure figure are are the the values values for for the the base base concept concept o f the the query query (i.e., (i.e., the the shown in of properties of of the the m mot iffs that are are components components of of guppy guppy proteins). proteins). This This set set of of properties oti s that (Mo t i f); other other concepts concepts may may be be included included in in results contains contains only only the the base base concept concept (Motif); results the the results results and and the the relationships relationships maintained maintained between between the the different different instances instances via via the the query builder. builder. The The pop-up pop-up menu menu on on a a concept concept button button contains contains an an option option include include query in results. results. Selecting Selecting this this option option causes causes the the concept concept button button to to be be highlighted highlighted in in the the in query query builder. builder. Given Given that that a a query query is is of of the the form: form:
Base which rlf I
... rnf n

200 200

7 7

u lation Over nformation Sou rces in TAM BIS Complex Query Form Formulation Over Diverse Diverse IInformation Sources TAMBIS

where i is where each each r r~ is a a role role name name and and each each f fi i a a filler filler concept, concept, the the query query builder builder essentially essentially supports: supports:
1 1.. The The specialization specialization or or generalization generalization of of the the base base or or filler filler concepts concepts

associated with 2. The The addition addition or or removal removal of of criteria criteria associated with a a composite composite concept concept This modification is This incremental incremental concept concept construction construction and and modification is possible possible because because of of the the dynamic model model supported supported by dynamic by the the ontology. ontology. In In fact, fact, the the query query interface interface is is driven driven directly reasoning services used extensively extensively during directly from from the the model, model, and and the the reasoning services are are used during query query construction construction to to present present appropriate appropriate options options to to users users and and for for validating validating the the concepts concepts constructed. constructed. The The knowledge knowledge held held in in the the ontology ontology is is used used to to guide guide the the user user through through the the query query building building process process by by offering offering only only appropriate appropriate possibilities possibilities for for modifying modifying a a query query at at each each stage stage [12]. [12]. As As new new concepts concepts are are formed formed and and clas classified, sified, new new criteria criteria become become available available and and others others are are lost lost as as potential potential additions additions to to the the growing growing concept. concept. Support Support for for the the previous previous query query construction construction operations operations is is illustrated illustrated in in the the following following subsections. subsections.
Rep/acing of a Replacing Part Part of a Query Query

The query builder construct a query from The query builder can can be be used used either either to to construct a query from scratch scratch or or for for modifying One way way of query is modifying previous previous or or bookmarked bookmarked queries. queries. One of modifying modifying the the query is to to replace the if with one that is more specific. the concept concept mot moti f with one that is more specific. An An example example is is replacing replacing s i te. the iff with the concept concept mot moti with the the more more specialized specialized concept concept phosphorylation phosphorylation ss te. Figure shows a a menu menu associated associated with with concepts concepts in in the the query query builder. builder. Figure 7.7 shows Selecting with-a-kind-of-this causes causes a a new new window window to to appear, appear, as as shown shown Selecting replace with-a-kind-of-this window, which which allows allows a a version version of of the the in Figure Figure 7.8. 7.8. This This window window is is the the replacer window, in explorer used in place of explorer to to be be used used to to identify identify a a concept concept that that can can be be used in the the query query in in place of mot iff. moti . When concept, on which the When launched, launched, the the replacer replacer is is focused focused on on the the mot m o t ii f f concept, on which the query builder had focus. focus. Moving Moving down through s i te ed query builder had down the the hierarchy, hierarchy, through it e to to modi modi f fi ie d window shown shown in Figure 7.9. s s iitte e yields yields the the window in Figure 7.9. Selecting Selecting phosphorylation phosphorylation s updates the query in query builder s iit te e and and pressing pressing replace it updates the query in the the query builder so so that that the the iff replaced query has has the same structure structure as Figure 7.5, but query the same as that that in in Figure but with with mot moti replaced with with phosphorylat hi limits the the user user to to the the is a hip h o s p h o r y l a g i i on on s 8i i te ge (Figure (Figure 7.10). 7.10). The The replacer limits erarchy during during replacement. helps ensure ensure that only valid concepts are erarchy replacement. This This helps that only valid concepts are created. created.
Restricting Restricting a a Concept Concept

When joined to When one one concept concept is is joined to another another in in the the query query builder builder with with a a relationship relationship other other than than is-a-kind-of, is-a-kind-of, the the description description of of the the original original concept concept is is restricted. restricted. In In the the example example query, query, mot m o t ii ffs s are are restricted restricted to to those those that that occur occur in in proteins, proteins, rather rather than than Mot M o t ii ffs s that that can can occur occur in in other other kinds kinds of of molecules. molecules. This This restric restricmot tion added to tion was was added to m o t ii ff using using the the restrict restrict by a relationship relationship option option illustrated illustrated in in

7.2 The The User User Interface I nterface 7.2

201

201

a uelY B wldel
s p ecies p rote i n m otif

I!!l IiI El

!!ookmark query

motif Restr et by a relationship..


. .

Replace with a kind of this ..

ncorporate Oism ntle Explorefl.

Submit._
StatusBar

Cancel

!:!elp

f\lllInIlg: I AppIet Window


7.7
F I G U RE FIGURE

A query query builder window showing showing the the pop-up pop-up menu menu invoked invoked by by clicking on the the A builder window clicking on topic topic concept concept motif. mot i f.

Figure i f to motif to be be retrieved retrieved can can be be further further restricted restricted by by adding adding Figure 7.7. 7.7. The The type type of of mot if another another concept concept to to the the description description of of mot moti f.. For For example, example, selecting selecting the the restrict by a relationship option option leads window shown leads to to the the user user being being offered offered the the restrict restrict window shown in 1 . If the user cation in Figure Figure 7.1 7.11. Ifthe user then then selects selects the the hasModi h a s M o d if fi ic a t i o n post p o s t transl t r a n s l aa - t onal ti io n a l modi modi f ication i c a t i on check check box box and and accept, the the query query in in the the query query builder builder is is replaced replaced with with that that in in Figure Figure 7.12. 7.12. The The query query is is now now "retrieve all motifs that that bring about post translational modifications in guppy proteins. " proteins."
Nonsensical Nonsensical Questions Questions

The The TAMBIS TAMBIS model model only only allows allows biologically biologically sensible sensible questions questions to to be be constructed. constructed. By By only only allowing allowing is-a-kind-of relationships relationships to to be he seen seen in in the the replacer, replacer, the the tendency tendency is is to to have have only only biologically biologically sensible sensible queries queries constructed. constructed. It It is, is, however, however, possible possible to to

202 202

= = = _ _

7 7

Complex u lation Over nformation Sources BIS Complex Query Query Form Formulation Over Diverse Diverse I Information Sources in in TAM TAMBIS

motif

Hlstcxy ....

!!ePIace

9Incel

7.8 7 .8 FIGURE F I G U RE

A replacer window centered on the concept mot mo e i if f..

replace a a valid valid concept concept with with one one that that is is biologically biologically nonsense. nonsense. However, However, the the query query replace builder detects this by by consulting consulting the the ontology ontology and and informs informs the the user user of of the the error. error. builder detects this For example, example, in in the the previously previously modified modified query query it it would would be be possible possible to to rere For protein with the the concept concept n nuc acid. However, if if this this place the the concept concept p place r o t e i n with u c ll ee ii cc a c i d . However, replacement is is made, made, TAMBIS TAMBIS notices notices that that in in the the ontology ontology nucleic nucleic acids acids cannot cannot replacement have phosphorylation phosphorylation sites sites and and changes changes the the color color of of the the offending offending concept concept button button have (nuc le eii acid in this this case) case) to to yellow, yellow, indicating indicating that that the the query query is is not not consistent consistent ( nucl cc a c i d in with the ontology. ontology. with the the constraints constraints in in the

7.2.3 7.2.3

The Role R o l e of of Reasoning Reaso n i ng in i n Query Query Formulation Form u l ati o n The
GRAIL, like like other other DL D L implementations, implementations, provides provides a a classification classification or or reasoning reasoning GRAIL, service, which allows the organization of concept descriptions into subsumption service, which allows the organization of concept descriptions into subsumption (isa) hierarchies. hierarchies. In In the the case case of of DLs, DLs, this this is is most most interesting interesting when when applied applied to to comcom (isa) posite descriptions. descriptions. In In standard standard taxonomies, taxonomies, the the position position of of each each concept concept is is exex posite plicitly stated stated by by the the modeler. modder. Within Within TAMBIS, TAMBIS, through through the the use use of of the the ontology ontology plicitly

I nterface 7.2 The User Interface

2 03 203

IIcllldCon(J sp"Cles Illoleln molll

I!I 13

modIOed slte

phosphol)lllltlOll ite

History ...

pl ce lt

7.9 FI G U RE FIGURE

A replacer replacer window window centered centered on on the the concept concept modi modif d s site with the the pointer pointer fi ie ed i t e with A about p h o s p h o r y l a g i o n si s i tte. e. about to to select select the the concept concept phosphorylation

server, server, the the position position of of composite composite concept concept descriptions descriptions can can be be determined determined by by the the reasoner. reasoner. This This is is of of particular particular importance importance when when new, new, previously previously unseen, unseen, descrip descriptions tions are are introduced introduced into into the the model-particularly model--particularly when when a a user user forms forms a a new new concept concept to to ask ask a a query. query. The used to The basic basic classification classification hierarchy hierarchy can can be be used to navigate navigate through through the the existing existing descriptions descriptions in in the the model model (e.g., (e.g., using using the the explorer). explorer). More More interesting, interesting, however, however, is is TAMBIS's TAMBIS's ability ability to to support support the the formation formation of of new, new, composite composite query query expressions expressions (through (through the the use use of of the the query query builder). builder). TAMBIS TAMBIS uses uses a a constraint constraint mechanism mechanism known known as as sanctioning sanctioningto to drive drive the the query query builder user builder user interface interface [10] [10].. Information Information included included in in the the ontology ontology specifies specifies the the com compositions positions that that may may be be formed, formed, and and this this in in turn turn determines determines the the specialization specialization options options that that may may be be applied applied to to a a query. query. This This type type of of constraint constraint mechanism mechanism is is peculiar peculiar to to DLs, DLs, as as such such constraints constraints naturally naturally form form part part of of many many frame-based frame-based knowledge knowledge representation representation languages. languages. These These constraints constraints are, are, however, however, important important in in describing describing what what concepts concepts are are allowed allowed to to be be formed formed within within the the ontology. ontology.

204 204

========--====

7 7

Complex Query Query Form Formulation Over Diverse Diverse IInformation Sources in TAM TAMBIS Complex u l ation Over nformation Sources BIS

a uelY B Uilder
species protein phosphorylation site Undo ookmark query

R f3

phosphorytation site

wh h

L:
StatusBar

species: guppy
-------'

--

xplore

__ _

Submit...

ancel

tle1p

f';lIarnlng. Applet Window


7 .10 7.10 A query builder window showing the query with mo tif y phospho moti f replaced b by p h o s p h o -

F IGURE FIGURE

ryl ation s i te. rylation site.

For i f may For example, example, the the concept concept Mot Motif may be be restricted restricted or or specialized specialized through through issmC as sii fc ia cation or a number of a number of relationships relationships including including hasOrgan hasOrgani m Cl la ss fi tion or in inFor each of these relationships, the allowable values dicatesFunc t i on. d i c a t e s F u n c t i o n . For each of these relationships, the allowable values are are constrained model. For constrained by by the the values values of of the the sanctions sanctions in in the the model. For example, example, hasOr hasOrgan issmClas sii ia cation gani mClass ff ic t i o n can can only only be be filled filled with with the the concept concept kingdom kingdom (or (or one one of of its its subclasses) subclasses).. It It would would be be an an onerous onerous task task to to specify specify explicitly explicitly the the potential potential values values for for any any combination, combination, so so to to minimize minimize the the information information required, required, sanctions sanctions are are inherited inherited down down the the classification classification hierarchy hierarchy in in the the model. model. Thus, Thus, the the sanctioning sanctioning information information can added sparsely. up, its can be be added sparsely. As As a a query query is is gradually gradually built built up, its position position in in the the classifi classification cation hierarchy hierarchy will will change, change, leading leading to to changes changes in in the the restriction restriction options options offered. offered. The The reasoner reasoner is is key key to to this this process process because because it it is is used used to to determine determine the the appropriate appropriate

7.3 7.3

The Query Query Processor Processor

205 205
species pfOleln molif

R es l ricl by a felalionship . .
accession number

I!lIiI f3

biomolecular process

biological function

post translational modification B.ccept ancel !::!elp

M'arning. Applet Window


7 .1 1 7.11

FIGURE

The i f, showing The restrict restrict window window for for mot motif, showing the the relationships relationships to to other other concepts concepts that that description of lies on can can be be used used to to restrict restrict the the description of motif. motif. The The cursor cursor lies on the the hasModi hasModi- f i cat i on post lational modi f i cation check fication post trans translational modification check box. box.

position description and and thus, thus, the potential restrictions. position of of a a query query description the potential restrictions. As As the the query query is constructed, the interface communicates with the ontology server, updating is constructed, the interface communicates with the ontology server, updating the the restrictions offered also be restrictions offered to to the the user. user. The The constraints constraints or or sanctions sanctions can can also be viewed viewed through the explorer; the through the explorer; the relationships relationships shown shown are are exactly exactly those those that that can can be be used used for for specialization specialization or or restriction restriction of of the the concept concept in in a a query. query.

7.3

T H E QU E RY PROCESSOR THE QUERY


The The query query processor processor converts converts a a source source independent independent declarative declarative GRAIL GRAIL query query into into a plan expressed a source source specific specific execution execution plan expressed in in CPL CPL [4] [4].. CPL CPL allows allows the the concise concise expression expression of of retrieval retrieval requests requests over over collections collections of of data, data, with with data data types types for for rep representing resenting arbitrarily arbitrarily nested nested sets, sets, bags, bags, lists, lists, records, records, and and variants. variants. The The principal principal components components of of the the query query processor processor are are the the wrappers, wrappers, the the sources sources and and services services model (SSM), and the planner. model (SSM), and the planner.

206 206

7 7

Complex Query Formu Formulation Over Diverse Diverse IInformation Sources in TAM TAMBIS Com plex Query lation Over nformation S o u rces in BIS

Query Builder
species prote in post translational modification motif

Rl!J EI

Undo

ookmark query

motif

L:

species: guppy

post translational modification

!;,xplore......

Subm' ...

Cancel

Done.

IV\!arning: Applet Window


7.12 7.12 FIGURE FIGURE

The query containing the the example example query with an extra restriction The query builder builder containing query with an extra restriction on on the the topic t i f. f . Note Note the the lines lines indicating indicating the the relationship relationship between between the the concon topic concept concept mo ts cepts. cepts.

7.3.1 7 .3. 1

The The Sources S o u rces and a n d Services Services Model Model


The SSM SSM stores stores the the relationships relationships between between the the concepts concepts and and roles roles in in the the ontology ontology The and the the functions functions used used to to wrap wrap sources sources in in CPL. CPL. In In the the SSM, SSM, the the ontology ontology is is used used to to and index the the CPL CPL functions functions used used to to evaluate evaluate queries queries written written in in terms terms of of the the ontology. ontology. index

7.3

The

Processor

207

The The SSM SSM contains contains descriptions descriptions of of three three broad broad categories categories of of information: information: iterators iterators that retrieve retrieve instances instances of of concepts concepts from from sources, role evaluators evaluators that that retrieve retrieve or or that sources, role compute compute values values for for the the roles roles of of instances, instances, and and filters filters that that are are used used to to discard discard instances not not relevant relevant to to the the query. query. instances Each the SSM its name, name, the the Each such such description description of of a a CPL CPL function function in in the SSM includes includes its types arguments, the the type type of of its its result, result, some some information information on on the the cost cost of of types of of its its arguments, computing the computing the function, function, and and the the source source accessed accessed by by the the function. function. There There are are seven seven categories categories of of mapping mapping information information supported supported within within the the SSM, SSM, which in detail a paper 1 th International which are are described described in detail in in a paper of of the the Proceedings Proceedings of of 1 1 lth International Conference Four of [13]. Four of these these Conference on on Scientific Scientific and and Statistical Statistical Data Data Management Management [13]. categories categories are are described described here here to to illustrate illustrate how how the the query query processor processor works: works:
1. Iteration: 1. Iteration: Iteration Iteration allows allows the the instances instances of of a a concept concept to to be be retrieved retrieved from from a a source. instances or source. For For example, example, the the fact fact that that the the instances or individuals individuals of of Protein p r o t e i n can can be represented by be obtained obtained from from Swiss-Prot Swiss-Prot is is represented by associating associating the the concept concept Pro protein t e i n with with the the function function get g e t --al a l l -l spsp -en ries, which has has no no input input argu argu-tentries, which ments ments and and which which returns returns results results of of type type protein_record. p r o t e i n _ r e c o r d . Given Given a a query query in in which the instances of protein are required, this SSM entry could be used to which the instances of protein are required, this SSM entry could be used to retrieve retrieve proteins proteins from from Swiss-Prot Swiss-Prot using using a a function function call call such such as: as:

\p < - get-al l - sp- entries ( ) \p <get-all-sp-entries()

If If an an alternative, alternative, more more specialized specialized source source of of protein protein information information is is available, available, for for example, example, from from a a database database of of enzymes enzymes (any (any protein protein that that acts acts as as a a catalyst catalyst is is an an enzyme), enzyme), then then an an additional additional SSM SSM entry entry can can be be created created to to indicate indicate this. this. In In fact, fact, there there is is a a source source called called ENZYME ENZYME that that stores stores descriptions descriptions of of enzymes, enzymes, and and thus, thus, there there is is an an SSM SSM entry entry associating associating the the concept concept Protein P r o t e i n which which with a function get al l enzymeentries hasFunc t i on catalys i s hasFunction catalysis with a function get-all-enzyme-entries that that supports supports iteration iteration over over the the entries entries in in the the ENZYME ENZYME database. database. During During query query processing, processing, the the planner planner uses uses the the most most specialized specialized source source of of information information available available to to answer answer a a query. query. If If there there are are several several sources sources of of the the same same information information (e.g., must be (e.g., if if there there is is more more than than one one protein protein source), source), this this must be handled handled within within the the wrappers. This restriction version of wrappers. This restriction within within the the first first version of TAMBIS TAMBIS is is to to be be relieved relieved in in future future versions versions of of TAMBIS TAMBIS (see (see Section Section 7.5). 7.5).
2. Roles: Roles: Roles Roles allow allow the the evaluation evaluation of of a a role role in in an an instance instance to to obtain obtain a a value value

siionNumber for for its its filler. filler. For For example, example, it it is is possible possible to to obtain obtain the the Acces Access o n N u m b e r of of a a protein protein given given the the Protein. P r o t e i n . This This is is represented represented in in the the SSM SSM by by the the asso assoch hasAc c e s s ionNumber Acc es ciation ciation of of the the concept concept Protein Protein whi which hasAccessionNumber Access ionNumber with -ac - from- sp- entry, which sionNumber with the the function {unction get get-ac-from-sp-entry, which takes takes as as argument argument a a value value of of type type protein_record p r o t e i n r e c o r d and and returns returns a a value value of of type type ac ce ess does not directly access source, but acc ss i 2 on_number. on_number. This This does not itself itself directly access a a source, but rather rather it it

208

Complex Query Formulation Over Diverse Information Sources in TAMBIS

accesses a a data data structure structure retrieved retrieved from from a a source source by by some some other other function function (such (such accesses get -sp described previously). previously). This This SSM SSM entry entry could could be be as e t -a al l l -l sp e n-tentries ries described as g used number from using a used to to retrieve retrieve the the accession accession number from a a Swiss-Prot Swiss-Prot entry entry using a function function call such such as: as: call
\accno << - get-ac-from-sp-entry(p) get -a c - f rorn- sp - entry ( p ) \accno
3. Mapped Roles: Mapped Mapped roles roles are are roles roles in in which which the the concept concept provided provided as as the the 3. role filler filler can can be be used used to to select select instances instances of of the the base base concept concept from from a a source. source. role For concept P rotein w hic h h a s O r g a n i s srn mProtein whi ch hasOrgani For example, example, instances instances of of the the concept Class i f i cation Species Spec ies : : guppy be retrieved retrieved from from Swiss-Prot Swiss-Prot by by Classification g u p p y can can be retrieving with guppy in in their their organism SSM entry retrieving entries entries with organism species field. field. This This SSM entry could be used to to retrieve retrieve Swiss-Prot Swiss-Prot entries entries using using a call such such as: as: could be used a function function call
<- get-sp-entries-by-os("guppy") get- sp-entries -by-o s ( " guppy " ) \p \p <-

where p p is is a a variable variable previously previously bound bound to to a a protein_record. protein_record. where

4. Filters: Filters: When When instances a concept example, by instances of of a concept have have been been retrieved, retrieved, for for example, by iteriter ation, query may to discard discard some some of of the the instances. ation, other other criteria criteria in in the the query may be be used used to instances. For example, an instance in the Protein whi ch For example, given given an instance of of Protein P r o t e i n in the query query P rotein w hic h hasFunc instance of of Protein Protein must must be be checked checked to to h a s F u n c tt ii oon n Hydrolase, Hydrolase, the the instance ion could be see see if if it it hasFunct hasFunctio n Hydrolase. H y d r o l a s e . The The relevant relevant SSM SSM entry entry could be used used to a Swiss-Prot to generate generate code code that that tests tests a Swiss-Prot record record for for the the function function hydrolase hydrolase using using a call such as: a function function call such as:
checksp-entry- for-hydro p) check-sp-entryfor-hydro lase lase ((p)

where where p p is is a a variable variable previously previously bound bound to to a a protein_record. protein_record. The The filters filters entries entries in in the the SSM SSM are are used used to to select select values values with with the the required required characteristics characteristics at at the the client client (i.e., (i.e., values values are are retrieved retrieved from from sources sources and and then then checked checked to to see see if if they they meet meet the the needs needs of of the the query) query).. In In general, general, it it is is desirable desirable to to have have the the sources sources retrieve retrieve only only values values that that are are relevant. relevant. Mapped Mapped roles roles provide provide one possible in one way way of of sending sending filters filters to to the the sources sources to to be be applied applied as as early early as as possible in the the retrieval retrieval process. process. Unfortunately, Unfortunately, at at the the time time of of writing, writing, many many sources sources did did not offer offer query query interfaces interfaces that that allowed allowed all all filtering filtering to to be be carried carried out out early early in in not the the query query process. process. This This left left much much client-side client-side filtering filtering to to take take place. place.

7.3.2 7.3.2

The l a n ner The Query Query P Planner


GRAIL fa s not GRAIL queries queries are are declarative, declarative, in in that that the the meaning meaning o of a query query iis not dependent dependent on on the the order order of o{ evaluation evaluation of of its its components. components. As As a a result, result, the the TAMBIS TAMBIS system, system, and and

7.3 7.3

The The

Processor Q ue ry s. Pro .. ~ ce ~f~-~~~=~`~``~176176176176

~ ............ ~ 209

209

not not the the user, user, must must take take responsibility responsibility for for identifying identifying an an efficient efficient evaluation evaluation order order for for the the components components of of a a GRAIL GRAIL query. query. This This section section describes describes how how GRAIL GRAIL queries queries are for the purposes of are represented represented internally internally for the purposes of optimization optimization and and how how this this internal internal representation is representation is generated. generated. GRAIL GRAIL queries queries are are intrinsically intrinsically nested nested structures. structures. The The query query internal internal form form (QIF) used in seen as (QIF) used in TAMBIS TAMBIS can can be be seen as an an un-nested un-nested representation representation of of the the original original GRAIL GRAIL query. query. This This representation representation has has been been developed developed to to allow allow easier easier reordering reordering of in the of the the components components of of a a query query in the planner. planner. The The QIF QIF is is a a list list of of query query components, components, an example example of of which which is in Figure Figure 7. 7.13 for the the running running example example GRAIL GRAIL query: query: 1 3 for an is given given in
Mot i f whi ch i sCornponentOf Motif which isComponentOf Protein s i f ication Spec i es : guppy Protein which which hasOrganisrnClas hasOrganismClassification Species-guppy

The query is represented by query components, The query is represented by two two query components, one one representing representing the the Mo Mot t ii f f and and the the other other representing representing the the Protein. P r o t e i n . Each Each of of the the components components stores stores the the name the base base concept, list of name of of the concept, a a list of the the criteria criteria from from the the query, query, the the name name of of the the CPL hold values CPL variable variable used used to to hold values retrieved retrieved from from sources, sources, and and details details of of the the tech technique identified identified by the planner planner for for retrieving retrieving instances instances of of the concept and and of of the the nique by the the concept
< <

name ti if name : :Mo Mot f

t heCr'iteria :9 th eCriteria < s ComponentO f < theC7'iterian theCriterion: : i isComponentOf theVariable l theVariable : : motifmotif-1 theTechnique : " theTechnique :" '"' user ue : u s e r VV a lal ue : "" '"' > >

relatedC ompanent : l relatedCar nponent : component component of of proteinprotein-1

theFetchC7'iterian : theFetchCriterion : null null > > :9 P ro ot te ei in n Pr

< < name name

theC7'iteria theCriteria : 9

< :a ha sOrgan smC li a sii ies < theCriterian theCriterion'h sOr g a n i s m Ci la ssif cs at of n i c a t i on Spec Species

r'elatedCompanent :9 null relatedComponent null user guppy' u s e r VVal' a l u eu e : 9'" '"g uppy' > >

the Variable theV a r i a b l e " : ProteinProtein- l 1 theTechnique : theTechnique " "" '"' theFetchCriterian : 9null null > theFetchCriterion >
7 .13 7.13
9 9 ,

F IGURE FIGURE

IIII!I!
~

QIF QIF for for example example query. query.

210

210

7 7

u l ation Over nformation Sources BIS Complex Query Form Formulation Over Diverse Diverse I Information Sources in in TAM TAMBIS

input input-:

query: i s t of query:L List of QueryComponent Que~Component

finaU?lan: L Li is st t o of f QueryComponent QueryComponent finalPlan: w h i ll e e query query < <> [ ]] do do whi > [ bestQC :: = findBest(query) bestQC = findBest( query) finaU?lan :: = finalPlan + ++ bestQC finalPlan = finalPlan + bestQC query :: = query - -- bestQC bestQC query = query end end re e turn t u r n finalPlan r
7.14 7 .14
. _ "

FIGURE F IGURE

m
.

The optimization algorithm.

criteria criteria used used during during retrieval. retrieval. The The values values for for theTechnique theTechnique and and theFetchCriterion theFetchCriterion are are identified identified during during planning. planning. Generation Generation of of the the QIF QIF from from a a GRAIL GRAIL query query is is straightforward straightforward and and is is carried carried out out in in a a single single pass pass over over the the query. query. Given Given the the QIF QIF for for a a query, query, a a search search algorithm algorithm seeks seeks to to identify identify efficient efficient ways ways to available in to evaluate evaluate the the query query given given the the functions functions available in the the SSM. SSM. The The search search algo algorithm rithm exploits exploits the the augmentation augmentation heuristic heuristic [14], [14], which which was was selected selected because because it it is is straightforward straightforward to to implement implement and and provides provides a a reasonable reasonable tradeoff tradeoff between between cost cost of of optimization and and quality quality of plan generated. generated. The in Figure Figure 7.14. 7. 14. optimization of plan The algorithm algorithm is is given given in The is to plan as list of components in The basic basic strategy strategy is to generate generate a a plan as an an ordered ordered list of query query components in which the the first first component component in in the the list list is the least least costly costly component component which is predicted predicted to to be be the to evaluate evaluate from from scratch, scratch, and and the the subsequent subsequent components components are are the the least least costly costly to to to evaluate evaluate given given what what has has previously previously been been evaluated. evaluated. The optimization optimization algorithm in Figure 7.14 depends depends heavily heavily on on the the definition of The algorithm in Figure 7.14 definition of the findBest {indBest function. function. This function, given given a a query component, considers considers a a variety variety the This function, query component, of ways ways in in which which instances instances of of the the component component can can be be retrieved retrieved from from sources. sources. Thus Thus of {indBest considers considers the the alternative alternative ways ways of of implementing implementing the the components of a a QIF QIF findBest components of onto onto CPL CPL functions, functions, using using the the entries entries in in the the SSM. SSM. For example, example, the the CPL CPL generated generated for for the the example example query query is: is: For
{motif-i { mot i f - l I ( " guppy " ) , , \protein-l<-get-sp-entry-by-os \protein- l < - ge t - sp- entry-by-os ("guppy") \moti f - l< - do-pro s i t e - scan-by-entry-rec (protein-i) ( protein- l ) } } \motif-l<-do-prosite-scan-by-entry-rec

This query query contains contains two two query query components, components, one one for for Protein Protein and and the the other other for for This Mot as illustrated illustrated in in Figure Figure 7.13. 7. 1 3 . The The query query component component for for P Protein is chosen chosen M o t iiff, , as r o t e i n is for evaluation evaluation first, first, and and the the SSM SSM entry entry used used to to obtain obtain instances instances of of P Protein is for r o t e i n is

7.3 7.3

The The

Query Processor Processor

21 1

the role role that that is is the the inverse inverse of of hasOrganismClassi hasOrganismClassifi a t i oion. n . Thus, Thus, the the query query the fc icat processor sf ii fi processor accesses accessesthe the ontology ontology to to find findthe the inverse inverse of of hasOrgani h a s O r g a n i ssmClas mClassi which is isthe the role rolehasProteins h a s p r o t e i n s on on Species. Species. This This role role has has a a roles rolesentry entry cation, which cation, SSM, which - sp - entry-by-os. This in the the SSM, which is is associated associated with with the the function function get get-sp-entry-by-os. This in function, given given the the name name of of a a Species, S p e c i e s , consults consults Swiss-Prot Swiss-Prot to to find find the the proteins proteins function, from the the species. species. The The second second query query component component is is evaluated evaluated in in a a similar similar manner, manner, from using the the inverse inverse of of the the role role i is ComponentOf. using sComponentOf. The The output output from from the the planner planner is is a a QIF QIF annotated annotated with with details details of of how how to to retrieve retrieve its components. components. Generating Generating the the corresponding corresponding CPL CPL program program involves involves a a single single pass pass its through through the the QIF. QIF. For For each each QIF QIF component, component, the the code code generator generator writes writes out out the the CPL CPL functions functions identified identified by by the the planner planner and and iterates iterates over over the the component's component's other other criteria, criteria, writing writing out out function function calls calls associated associated with with roles roles and and filters filters as as required. required.

7.3.3 7.3.3

The Wra Wrappers The ppers


The means that The distribution distribution and and heterogeneity heterogeneity within within bioinformatics bioinformatics resources resources means that many applications need to employ wrappers. Wrappers include external resources Wrappers include external resources many applications need to employ into a a system system that that enable enable the the resource resource to to adopt adopt the the same same operating operating paradigms paradigms as as the the into host host system, system, as as well well as as transform transform the the resource resource to to common common syntactic syntactic and and semantic semantic conventions. conventions. Many Many applications applications perform perform this this wrapping wrapping on on an an ad hoc basis, basis, using using the resources available available within within many languages. Kleisli Kleisli (presented (presented the resources many programming programming languages. 6) is is one one of the few systems to together with Chapter 6) in of the few systems to offer offer wrapper wrapper services services together with a a in Chapter query language that is is flexible flexible enough enough to query language that to cope cope with with bioinformatics bioinformatics resources. resources. The output output from the TAMBIS The from the TAMBIS system system is is a a query query plan plan written written in in CPL CPL using using a a modified version of the BioKleisli library of biological database wrappers [15]. An modified version of the BioKleisli library of biological database wrappers [15]. An example CPL which "retrieves "retrieves all motifs in guppy guppy proteins," is is as as follows: follows: example CPL query, query, which

I {m I

( " guppy "), \ p<get ssp p --entry-by-os e n t r y - b y - o s (" guppy") \p< -ge t-\m< -do -pros ( p) } \ m<dop r o s ii t te-scan-by-entry-rec e-scan-by-entry-rec (p) }
I

is the the projection projection expression, expression, which, which, in in this this In the the query, query, the the part part before before the the [ In I is case, indicates indicates that that only only the the motifs motifs m are are of of interest. interest. The The two two function function calls calls in in the the case, body of of the the query query to to the the right right of of the the ] I are are generators, generators, which which retrieve retrieve values values from from body distinct, wrapped wrapped sources. sources. The The first first line line in in the the query query body body indicates indicates that that the the new new distinct, p is is to to be be bound bound to to each each of of the the values values that that result result from from the the evaluation evaluation of of variable p variable the function function g get -s sp the parameter The function function the etp --entry-by-os e n t r y - b y - o s with with the parameter guppy. guppy. The name can can be be read read as as get Swiss-Prot Swiss-Prot entry although this this is is just j ust name entry by organism species, although a name-the structure of the name is not significant in itself. The second function a namemthe structure of the name is not significant in itself. The second function

212

._ ,"=

, ,,<'" ,-,, .'",'"", ' * . .w"z",,,,", ,,,,=/A< '+' !>

Com plex Query For m u lation Over Diverse I nformation S o u rces in TAMBIS

call binds binds the each of bound to p. The call the variable variable m m to to each of the the motifs motifs of of the the proteins proteins bound to p. The function be read read as function name name can can be as scan the prosite database for motifs in the given protein record. The provide access The CPL CPL system system is is supplied ~,.upplied with with function function libraries libraries that that provide access to to bases, analysis a range range of of bioinformatics bioinformatics sources sources of of different different types types (e.g., (e.g., data databases, analysis a tools 1 5 ] ) . TAMBIS uses these tools [ [15]). TAMBIS uses these libraries libraries and and a a number number developed developed to to provide provide a a function-based .0 accessed function-based view view of of the the sources. sources. The The public public release release of of TAMBIS TAMBIS 1 1.0 accessed five 300 CPL five sources sources and and used used a a total total of of approximately approximately 300 CPL functions. functions. CPL can be seen as providing syntactically consistent, CPL can be seen as providing syntactically consistent, but but not not source source transpar transparent, sources, and thus, CPL ent, access access to to the the sources, and thus, CPL can can be be viewed viewed as as a a wrapping wrapping mechanism mechanism tightly tightly coupled coupled with with convenient convenient language language facilities facilities for for accumulating accumulating and and transmit transmitting ting results results from from different different sources. sources.

Remarks and Semantic Remarks on on Handling Handling Syntactic Syntactic and Semantic Heterogeneity Heterogeneity

The The heterogeneity heterogeneity in in the the bioinformatics bioinformatics resources resources is is handled handled within within this this wrapper wrapper layer and the SSM. The wrapper layer irons out much of the structural or syntactic layer and the SSM. The wrapper layer irons out much of the structural or syntactic heterogeneity, providing a heterogeneity, providing a consistent consistent call call interface interface in in terms terms of of level level of of abstraction abstraction and all CPL and services services to to each each of of the the resources. resources. For For example, example, all CPL functions functions return return sets sets of of data, data, regardless regardless of of the the number number of of instances instances returned. returned. This This means means only only one one operator operator ever ever needs needs to to be be used used to to manipulate manipulate the the results results of of a a query. query. Any Any heterogeneity heterogeneity in in encoding, encoding, such such as as representation representation of of amino amino acid acid sequences, sequences, can can also also be be dealt dealt with with a t this at this level. level. The The wrapper wrapper layer layer also also gives gives an an opportunity opportunity for for standardization standardization of of naming naming conventions available in conventions for for services services available in the the resources, resources, though though this this is is of of no no consequence consequence to to users, users, except except that that they they may may find find the the CPL CPL query query plan plan useful useful as as a a quality quality check check on on the the task task TAMBIS TAMBIS is is performing. performing. The SSM affords main opportunity The SSM affords the the main opportunity for for the the reconciliation reconciliation of of semantic semantic het heterogeneity. ontology gives schema against against which erogeneity. The The ontology gives the the user user a a global global schema which to to form form queries. SSM allows schema to queries. The The SSM allows terms terms seen seen in in this this global global schema to be be mapped mapped to to the the values used in various resources. instance, the values used in the the various resources. For For instance, the concept concept Phosphoryla Phosphorylat t e corresponds O O O O l in SITE databank. t iionSi onSite corresponds to to the the motif motif entry entry PS ms00001 in the the PRO PROSITE databank. 7. .** ..* * in Similarly, Similarly, the the concept concept Kinase K i n a s e maps maps to to node node 2 .. 7 in the the ENZYME ENZYME data databank, bank, but but to to the the term term kinase k i n a s e in in the the Swiss-Prot Swiss-Prot databank. databank. The The SSM SSM can can match match filler mappings mappings to mapping via databank attribute filler to the the appropriate appropriate function function mapping via the the databank attribute in in SSM SSM objects. objects. In In this this manner, manner, the the terms terms in in the the ontology ontology may may be be mapped mapped to to different different terms appearing terms appearing in in the the resources. resources.

7.4

Related Work Work

213

7 .4 7.4 7.4. 1 7.4.1

R E LATE D WOR K RELATED WORK IInformation nfo rmation IIntegration nteg ration iin n B i o i nfo rm atics Bioinformatics
The The difficulties difficulties associated associated with with obtaining obtaining effective effective access access to to multiple multiple biological biological information information resources resources have have long long been been recognized, recognized, and and several several different different approaches approaches have been been proposed, proposed, making making use use of of widely widely varying varying underlying underlying technologies. technologies. have Probably the Probably the most most widely widely used used source source integration integration environment environment for for bioinfor bioinformatics Service (SRS) matics resources resources is is the the Sequence Sequence Retrieval Retrieval Service (SRS) [16] [16] (presented (presented in in Chapter Chapter 5 ) . SRS banks, which 5). SRS is is a a system system designed designed to to integrate integrate flat flat file file data databanks, which are are the the most most common used for bioinformatics resources. common data data storage storage form form used for bioinformatics resources. SRS SRS has has its its own own pro proprietary prietary data data description description and and processing processing language. language. This This is is used used to to parse parse the the flat flat file file entries entries and and create create indices indices over over fields fields and and their their contents. contents. SRS SRS has has a a query query language language for for selecting selecting entries entries or or part part of of entries entries via via Boolean Boolean combinations combinations of of indexed indexed fields fields and values. The and their their values. The language language contains contains operators operators that that can can take take advantage advantage of of the the heavy usually accessed heavy cross-linking cross-linking between between different different databanks. databanks. SRS SRS is is usually accessed via via a a Web-based which the queries is Web-based interface interface behind behind which the construction construction of of queries is hidden. hidden. The The Web Web interface interface also also offers offers supplementary supplementary analyses analyses such such as as similarity similarity and and pattern pattern scans scans over over protein protein or or nucleic nucleic acid acid sequences. sequences. SRS SRS makes makes no no attempt attempt to to reconcile reconcile any any semantic semantic heterogeneity heterogeneity between between the the different different resources resources during during query query execution. execution. Once Once results results have have been been retrieved, retrieved, the the user user can can follow follow hyperlinks hyperlinks between between entries entries and and much much use use is is made made of of this this query query by by navigation navigation style. style. Although Although SRS SRS is is successful successful at at providing providing navigational navigational access access between between diverse diverse resources, resources, it it provides provides limited limited facilities facilities to to support support querying querying or or programming programming over over diverse diverse sources. sources. Several Several proposals proposals have have been been made made in in these these directions. directions. In In terms terms of access, Kleisli 1 5] provides provides both language for of query-oriented query-oriented access, Kleisli [ [15] both a a query query language for ranging ranging over described using over data data types types described using a a rich rich hierarchical hierarchical data data model model and and a a collection collection of of wrappers wrappers (known (known in in Kleisli Kleisli as as drivers) for for accessing accessing biological biological resources. resources. However, However, Kleisli has no schema providing model of thus can Kleisli has no global global schema providing a a model of the the available available data data and and thus can be be seen seen as as providing providing lower-level lower-level access access to to biological biological resources resources than than TAMBIS. TAMBIS. In In fact, fact, as as already already described, described, TAMBIS TAMBIS generates generates Kleisli Kleisli programs programs as as output. output. Another Another query-oriented OPM) [ 1 7] , in query-oriented approach approach is is provided provided by by the the Object Object Protocol Protocol Model Model ((OPM) [17], in which which queries queries can can be be written written over over an an object-oriented object-oriented global global model model using using an an object object query and tools have been been developed query language, language, and tools have developed to to assist assist in in the the creation creation of of OPM OPM views views over over heterogeneous heterogeneous sources. sources. The The main main factor factor that that differentiates differentiates TAMBIS TAMBIS from from OPM, OPM, from from a a users' users' point point of of view, view, is is that that in in TAMBIS TAMBIS queries queries are are constructed constructed over object model. model. The over an an ontology ontology rather rather than than over over an an object The impact impact of of the the ontol ontology and its reasoning services on query building in TAMBIS has been discussed ogy and its reasoning services on query building in TAMBIS has been discussed in in Section Section 7.2. 7.2. The The ontology ontology shields shields the the user user from from the the query query language language used, used, the the

214

7 7

Com plex Query u l ation Over nformation Sou rces in BIS Complex Query Form Formulation Over Diverse Diverse IInformation Sources in TAM TAMBIS

heterogeneity heterogeneity of of the the resources, resources, and and any any demand demand for for knowledge knowledge of of the the resources. resources. Such Such transparency transparency may may not not be be to to all all users' users' tastes tastes and and more more intricate intricate queries queries or or programs programs could could be be hand-crafted hand-crafted in in systems systems such such as as Kleisli Kleisli or or OPM. OPM. Other Other pro proposals posals describing describing query-based query-based access access from from object object models models to to biological biological data data in include 1 8], DiscoveryLink 1 9] and clude ISYS ISYS [ [18], DiscoveryLink [[19] and PIFDM P/FDM [20] [20].. Kleisli, Kleisli, DiscoveryLink, DiscoveryLink, and and P/FDM 1 , and P/FDM are are respectively respectively presented presented in in Chapters Chapters 6, 6, 1 11, and 9. 9. Considerable Considerable attention attention has has been been given given in in bioinformatics bioinformatics to to wrapping wrapping sources, sources, thereby thereby providing providing syntactically syntactically consistent consistent access access from from programming programming languages languages to to ! diverse diverse resources. resources. The The bioPerl bioPerl initiative initiative I offers offers a a collection collection of of Perl Perl modules modules that that provide provide access access to to computational computational techniques techniques and and data data commonly commonly found found within within bioinformatics resources. In bioinformatics resources. In the the early early stages stages of of the the first first TAMBIS TAMBIS version, version, however, however, there was much interest in using the Common Object Request Broker Architec there was much interest in using the Common Object Request Broker Architec] . CORBA ture CORBA) to resources [21 ture ((CORBA) to wrap wrap bioinformatics bioinformatics resources [21]. CORBA allows allows develop development ment of of object object views views of of heterogeneous heterogeneous and and distributed distributed resources, resources, regardless regardless of of their platform, operating their host host platform, operating system, system, or or storage storage paradigm. paradigm. The The use use of of CORBA CORBA within (LSR) group within bioinformatics bioinformatics is is promoted promoted by by the the Life Life Sciences Sciences Research Research (LSR) group of of the the Object Object Management Management Group. Group. 2 2 The The LSR LSR aims aims to to promote promote standard standard descriptions descriptions of interfaces that that enable of object object interfaces enable interoperation interoperation between between distributed distributed bioinformat bioinformatics resources. Among others, the European Bioinformatics Institute has provided ics resources. Among others, the European Bioinformatics Institute has provided 1 6] CORBA servers databases [22] CORBA servers for for some some of of their their databases [22].. Recently, Recently, access access to to SRS SRS [ [16] has been through CORBA has been provided provided through CORBA [23]. [23]. This This service service allows allows objects objects representing representing data bank entries entries to to be be retrieved retrieved through through the query language. language. This This should should alal databank the SRS SRS query low remote remote access access to to a a large large number number of of data banks and analysis programs, programs, along along low databanks and analysis with a a rudimentary rudimentary query query facility. facility. TAMBIS has a a very very different different emphasis emphasis from from the with TAMBIS has the middleware approaches approaches in in that that interactive interactive user user access is the the main main emphasis emphasis and and middleware access is in that that individual individual sources are essentially essentially hidden user in TAMBIS. in sources are hidden from from the the user in TAMBIS. Unfortunately, the the required required large large number number of consistent, CORBA Unfortunately, of consistent, CORBA wrapped wrapped sources did did not not arrive arrive to to be be taken taken advantage by TAMBIS. TAMBIS. The The ability ability to to downdown sources advantage of of by load a a description description of of a a service's service's interface interface and and automatically automatically generate a client client that that load generate a could act act as as a a wrapper wrapper was was desirable, desirable, but but not not delivered. delivered. Many Many providers balked could providers balked at the the effort effort needed needed to to provide provide a a CORBA CORBA solution solution to to delivering delivering services. services. Simple Simple at 3 offer Object Protocol Protocol Servers Servers (SOAP) (SOAP) servers and Web services3 offer a a lighter weight Object servers and Web services lighter weight solution to to delivering delivering bioinformatics bioinformatics services. services. A A SOAP SOAP server server for for a a resource resource is is solution relatively cheap cheap to to set set up up because because an an object object model model does does not not have have to to be be designed designed relatively

1 . Information about the bioPerl bioPerl initiative initiative is available available at http://bioperl.org. 1. 2 . Go to http://www.omg.org/Isr http://www.omg.org/lsr for information information on the Life Life Sciences Sciences Research Research effort of the Object 2. Management Group. Management 3 . The Simple Simple Object Object Protocol Protocol Servers (SOAP) and Web service service protocols available at 3.The Servers (SOAP) protocols are available http://www.w3c.org/soap. http-//www.w3 c.org/soap.

7.4

Related Work Work

215

and implemented, implemented, a as CORBA. The The operations operations available available through through that that server server can can and s iin n CORBA. be Web services be described described in in the the Web services description description language language (WSDL), (WSDL), 4 4 and and this this descrip description into a tion can can be be compiled compiled into a client client for for the the SOAP SOAP server. server. The The idea idea is is much much the the same same as CORBA, but as that that for for CORBA, but as as a a lighter lighter weight weight solution solution it it relies relies on on simple simple message message passing, not object approach. passing, not on on a a heavyweight heavyweight object approach. These These services services transfer transfer their their data data in (XML) and in extensible extensible markup markup language language (XML) and thus thus can can take take advantage advantage of of widely widely adopted adopted XML XML data data formats formats such such as as the the biopolymer biopolymer markup markup language language [24] [24] and and the Bioinformatics Bioinformatics Sequence Sequence Markup Markup Language Language (BSML). (BSML). 5 5 XML XML is is also also seen seen as as the the the data data format format of of choice choice by by the the Interoperable Interoperable Informatics Informatics Infrastructure Infrastructure Consortium Consortium (I3C), (I3C), 6 6 which which aims aims to to promote promote standards standards for for protocols protocols and and exchange exchange formats. formats. The distributed annotation manage The distributed annotation system system (DAS) (DAS) 7 7 uses uses many many of of these these ideas ideas to to manage sequence annotations annotations distributed sequence distributed around around the the network, network, and and delivered delivered by by SOAP SOAP servers servers providing providing an an XML XML description description of of sequence sequence annotations annotations that that allows allows many many annotators annotators to to form form an an integrated, integrated, yet yet varied, varied, view view on on the the biological biological sequence. sequence. These These technologies technologies offer offer a a middleware middleware solution solution to to the the integration integration of of bioinformat bioinformatics technologies undoubtedly ics resources. resources. Vital Vital though though such such technologies undoubtedly are, are, they they can can be be seen seen as as plumbing resources together. plumbing resources together. Choice Choice of of resources, resources, locating locating those those resources, resources, know knowing how how to to reconcile their view view of of the the data, data, and and the the order order in in which which to to use use them them is is ing reconcile their still left upon still left up up to to the the user user of of these these technologies. technologies. TAMBIS, TAMBIS, on on the the other other hand, hand, sits sits upon these technologies and transparency in these middleware middleware technologies and uses uses the the ontology ontology to to offer offer full full transparency in query query management management to to the the user. user.

7 . 4.2 7.4.2

Know ledge Based nfo rmati o n IIntegration nteg rati o n Knowledge Based IInformation
TAMBIS s one f several sa TAMBIS iis one o of several systems systems that that uses uses a a knowledge knowledge base base a as a central central compo component nent in in information information integration; integration; although, although, it it is is the the first first such such system system to to be be used used in in bioinformatics. bioinformatics. A A survey survey of of knowledge-based knowledge-based information information integration integration is is given given in in Paton Paton et et al.'s al.'s article article in in Information Information and and Software Software Technology Technology [25]. [25]. In In common common with with single single interface interface to to multiple multiple sources sources (SIMS) (SIMS) [26], [26], Information Information [27] and [7] to describe Manifold [27] Manifold and Observer Observer [28], [28], TAMBIS TAMBIS uses uses a a description description logic logic [7] to describe the mod the concepts concepts over over which which queries queries are are to to be be expressed. expressed. A A description description logic logic is is a a modeling notation eling notation that that supports supports reasoning reasoning over over descriptions descriptions of of concepts concepts and and their their re relationships. Two principal approaches are used in information integration systems lationships. Two principal approaches are used in information integration systems to global schema individual sources, to relate relate concepts concepts in in a a global schema to to the the schemas schemas of of individual sources, namely namely global global schema global as as view view and and local local as as view view [29]. [29]. In In the the former, former, the the global schema is is defined defined as as
http.//www.w3.org/TR/wsdl. 4. For more more information information about the WSDL, WSDL,refer refer to http://www.w3.orgffR/wsdl.

5.. The BSML BSMLis available available at http://www.bsml.org. http-//www.bsml.org. 5


6 about BC. 6.. Refer Refer to http://www.i3c.org http://www.i3c.org for information information about I3C. 7. Go Go to http://www.biodas.org http.//www.biodas.orgfor information information on biological biological DAS. DAS.

2 16 216

7 7

u l ation Over Diverse nformation Sources BIS Complex Query Form Formulation Diverse IInformation Sources in TAM TAMBIS

a view view over over the the constructs constructs in in the the schemas schemas of of the the individual individual sources; sources; in in the the latter latter the the a constructs in in the the schemas of the the individual individual sources are defined defined as as a a view view of of those constructs schemas of sources are those in the the global global schema. schema. SIMS SIMS and and Observer Observer essentially essentially use use global in global as as view view techniques techniques for processing processing queries, queries, whereas whereas Information Information Manifold Manifold is is local local as as view. view. TAMBIS TAMBIS for follows follows the the global global as as view view approach, approach, but but it it generally generally differs differs from from other other such such ap approaches proaches in in that that very very few few assumptions assumptions are are made made of of the the query query processing processing capabilities capabilities of individual sources. of the the individual sources. In In fact, fact, as as is is generally generally true true in in bioinformatics, bioinformatics, TAMBIS TAMBIS assumes that that individual individual sources sources lack lack declarative declarative query query interfaces interfaces and and instead instead pro proassumes vide vide rather rather limited limited call call interfaces, interfaces, supporting supporting tasks tasks such such as as iterating iterating through through the the data data items items of of a a particular particular type type or or retrieving retrieving all all data data items items with with a a given given value value for for a a particular particular attribute. attribute. A A further further important important feature feature of of TAMBIS TAMBIS is is that that it it supports supports a a distinctive distinctive user user interface interface driven driven from from the the ontology, ontology, which which guides guides the the user user through through the the query query for formulation way that that makes mulation process process in in a a way makes it it difficult difficult to to construct construct biologically biologically mean meaningless queries. Other ingless queries. Other knowledge-based knowledge-based information information integration integration systems systems lack lack such such sophisticated sophisticated query query formulation formulation interfaces. interfaces.

7 . 4.3 7.4.3

B i o log ica l O nto l o g i es Biological Ontologies


The number of used in still quite The number of ontologies ontologies used in bioinformatics bioinformatics applications applications is is still quite small, small, but but it it is is growing. growing. However, However, where where ontologies ontologies have have been been used, used, they they span span a a wide wide range range of of purposes, purposes, subject subject areas, areas, and and representation representation styles styles [5]. [5]. The The uses uses of of bio bioontologies ontologies fall fall into into two two distinct distinct areas: areas: database database schema schema definition definition (e.g., (e.g., EcoCyc EcoCyc and and RiboWeb) RiboWeb) and and annotation annotation and and communication communication (e.g., (e.g., GO GO and and OMB). OMB). The The TAMBIS ontology adds TAMBIS ontology adds a a third third use, use, ontology-based ontology-based search search and and query query formula formulation, list. The version of tion, to to this this list. The version of TAMBIS TAMBIS described described in in this this chapter chapter was was the the first first ontology solution based ontology solution based on on Description Description Logic Logic of of its its type type in in the the bioinformatics bioinformatics arena. arena. RiboWeb is an RiboWeb [30] [30] is an ontology ontology of of ribosome ribosome structure, structure, components, components, and and ex experimental analysis of perimental methods methods used used to to drive drive a a Web Web interface interface that that supports supports the the analysis of ribosomal acquisition of ribosomal data. data. The The ontology ontology acts acts as as a a schema, schema, driving driving the the acquisition of instances instances that base. The held in that create create the the knowledge knowledge base. The knowledge knowledge held in the the ontology ontology also also drives drives the the analysis data, guiding guiding the analysis of of new new data, the user user as as to to which which analysis analysis methods methods are are appropriate appropriate for the data hand and indicating results contradict current for the data in in hand and indicating results that that contradict current knowledge. knowledge. coli metabolism, metabolism, EcoCyc 1 ] uses EcoCyc [3 [31] uses an an ontology ontology to to create create an an encyclopedia encyclopedia of of E E ..coli regulation, Web, this regulation, and and signal signal transduction. transduction. As As with with Ribo RiboWeb, this ontology ontology acts acts as as a a schema for the knowledge base, capturing the domain knowledge with high fi schema for the knowledge base, capturing the domain knowledge with high fidelity. these systems knowledge representation delity. Both Both these systems use use a a frame-based frame-based knowledge representation language language in in which which a a frame frame represents represents a a concept concept and and slots slots within within frames frames represent represent attributes attributes or or roles roles and and their their fillers. fillers. Such Such representations representations can can be be expressive, expressive, hence hence the the richness richness of of the the models. models.

7 .5 7.5

Cu rrent and Current and Future Future

BIS Developments in in TAM TAMBIS

217

217

The The Ontology Ontology for for Molecular Molecular Biology Biology (OMB) (OMB) [32] [32] provides provides a a framework framework for for describing and core describing computational computational methods, methods, database database representations, representations, and core molecular molecular biological biological concepts. concepts. The The OMB OMB is is aimed aimed at at providing providing a a reference reference ontology ontology to to im improve prove community-wide community-wide communication. communication. Data Data resources resources would would use use the the OMB OMB to to define classes, relationships, relationships, and define their their classes, and terms. terms. The The OMB OMB uses uses an an object-like object-like struc structure with with an an is a kind ture hierarchy and kind of of hierarchy and large large use use of of other other relationship relationship types. types. The Ontology (GO) The Gene Gene Ontology (GO) [33] [33] is is a a structured, structured, controlled controlled vocabulary vocabulary used used to location, and to annotate annotate gene gene products products for for their their function, function, ultimate ultimate cellular cellular location, and the the processes in bases and processes in which which they they take take part. part. GO GO is is used used in in several several genomic genomic data databases and thus adds consistency consistency across consequence, querying thus adds across these these resources. resources. As As a a consequence, querying these these re resources reliable. The sources becomes becomes more more reliable. The ontology ontology has has a a simple simple structure, structure, relying relying on on an an is-a-kind-ofhierarchy sparse partonomy natural language is-a-kind-of hierarchy and and a a sparse partonomy to to relate relate natural language phrases. phrases. The The ImMunoGeneTics ImMunoGeneTics ontology ontology holds holds terminology terminology on on the the areas areas of of immunoglobu immunoglobulins and Again, this lins and their their genetics. genetics. Again, this acts acts as as a a controlled controlled vocabulary, vocabulary, but but it it has has a a less less well-defined structure than GO and appears more like a glossary with inter-related well-defined structure than GO and appears more like a glossary with inter-related entries. entries. Although all all these these resources resources can can be be termed termed ont% Although gies, they ontologies, they fall fall into into a a spec spectrum trum of of expressivity expressivity and and formality. formality. The The frame-based flame-based systems systems are are relatively relatively rich, rich, expressive, phrase-based terminologies expressive, and and formal, formal, whereas whereas the the phrase-based terminologies are are simpler simpler and and less less expressive. expressive. The The first first TAMBIS TAMBIS ontology ontology was was the the first first bio-ontology bio-ontology to to use use a a de description scription logic logic as as its its representation representation and, and, as as a a consequence, consequence, has has a a more more well-defined well-defined semantics semantics than than the the other other representations representations used used in in bio-ontologies. bio-ontologies. In In contrast contrast to to the the narrow narrow range range of of ontology ontology use, use, however, however, the the scope scope and and detail detail of of the the content content of of these these ontologies ontologies varies varies enormously. enormously. The The ImMunoGeneTics, ImMunoGeneTics, RiboWeb, RiboWeb, and and EcoCyc EcoCyc on ontologies tologies are are highly highly detailed detailed but but highly highly specialized specialized to to one one subject subject area, area, leaving leaving only only some some commonality commonality for for core core areas areas such such as as gene gene and and protein. protein. The The OMB OMB is is wide wide ranging ranging and and high high level, level, whereas whereas GO GO lacks lacks any any high-level high-level conceptualization conceptualization but but becomes finishes. As becomes very very detailed, detailed, starting starting its its conceptualization conceptualization where where the the OMB OMB finishes. As has has been been seen, seen, the the TAMBIS TAMBIS ontology ontology is is broad broad in in its its conceptualization, conceptualization, using using an an upper-level upper-level ontology ontology in in which which to to place place these these concepts. concepts. The The ontology ontology is is relatively relatively shallow, shallow, but but detail detail may may be be added added as as the the user user dynamically dynamically creates creates new new concepts concepts as as compositions of compositions of pre-existing pre-existing concepts concepts and and has has them them automatically automatically checked checked and and classified classified by by the the ontology'S ontology's reasoning reasoning service. service.

7 .5 7.5

C U R R E NT AN DF UTU RE DEVE LO P M E NTS CURRENT AND FUTURE DEVELOPMENTS IIN N TAM BIS TAMBIS
The The first first version version of of TAMBIS TAMBIS was was successful. successful. It It is is possible possible to to use use an an ontology ontology de describing scribing a a complex complex domain, domain, such such as as molecular molecular biology biology and and bioinformatics, bioinformatics, and and use use it it to to give give the the illusion illusion of of a a common common query query interface interface to to multiple, multiple, diverse, diverse, and and

218

===== == ==== == ==;:=

7 7

u l ation Over Diverse nformation Sources Complex Query Query Form Formulation Diverse IInformation Sources in TAMBIS TAMBIS

heterogeneous drove a heterogeneous information information sources. sources. The The ontology ontology drove a query query formulation formulation in interface terface that that allowed allowed users users to to create create complex complex queries queries over over those those multiple multiple sources sources~ queries would usually queries that that would usually need need a a program program written written by by a a trained trained bioinformatician. bioinformatician. Usage of of TAMBIS TAMBIS did, did, however, however, reveal reveal some some issues issues that that needed needed to to be be addressed addressed in in Usage further further work. work. Total Total transparency transparency is is not not always always desirable. desirable. The The level level of of transparency transparency offered offered by by the the first first version version of of TAMBIS TAMBIS was was appreciated appreciated by by less less skilled skilled users users who who were were happy happy to to have have decisions decisions on on which which resources resources to to use use taken taken out out of of their their hands. hands. However, However, some users, those well well versed versed in using bioinformatics some users, usually usually those in using bioinformatics resources, resources, wished wished to to express express preferences preferences about about which which resources resources to to use, use, given given that that some some sources sources may may be be more As the more trusted trusted than than others. others. As the number number of of resources resources available available within within TAMBIS TAMBIS increases, increases, such such preferences preferences will will be be able able to to be be expressed. expressed. In In addition, addition, users users may may wish to to record record when when and and where where data data they they retrieved retrieved arose arose [34]. [34]. In In version version one, one, wish the recorded some the CPL CPL query query plan plan implicitly implicitly recorded some such such information information in in the the names names of of functions query plan. plan. It functions used used in in the the query It is is more more desirable, desirable, however, however, to to record record query query provenance provenance directly directly and and explicitly. explicitly. [2] revealed that user The TAMBIS user survey The TAMBIS user survey [2] revealed that user intervention intervention during during query query ex execution ecution and and inspection inspection of of intermediate intermediate results results was was desirable. desirable. Users Users often often wish wish to to monitor multi-source query, monitor the the progress progress of of a a complex, complex, multi-source query, inspecting inspecting results results to to eval evaluate validity of uate validity of the the query query so so far far and and to to edit edit data data before before it it proceeds proceeds into into subsequent subsequent parts parts of of the the query. query. Code Code for for managing managing the the execution execution of of a a query query will will have have to to be be included included into into the the code code generated generated by by the the updated updated query query processor processor in in TAMBIS. TAMBIS. In In the the first first version version of of TAMBIS, TAMBIS, both both the the SSM SSM and and wrappers wrappers were were hand-crafted. hand-crafted. It It is is an an aim aim of of future future versions versions of of TAMBIS TAMBIS to to build build tools tools to to support support this this process. process. Concepts Concepts in in the the ontology ontology have have to to be be related related to to methods methods or or functions functions in in the the wrap wrappers and information about argument and return types, and costs recorded. pers and information about argument and return types, and costs recorded. The The ontology could drive ontology itself, itself, through through the the ontology ontology server, server, could drive such such a a tool tool and and also also help help to ontology is to check check that that the the content content of of the the ontology is covered covered within within the the SSM. SSM. In In the the new new version version of of TAMBIS, TAMBIS, the the TAMBIS TAMBIS ontology ontology has has been been remodeled remodeled using DAML+OIL 8 classified using using the the DL DL language language DAML+OIL 8 and and classified using the the FaCT FaCT reasoner reasoner [35], [35], which which is is considerably considerably more more powerful powerful than than GRAIL GRAIL used used in in the the original original TAMBIS TAMBIS 9 This ontology. ontology.9 This allows allows the the biological biological domain domain to to be be described described more more precisely precisely in in the the ontology ontology and and allows allows more more precise precise questions questions to to be be asked asked by by the the users. users. In In addition, addition, the the reasoning reasoning services services of of the the DL DL are are used used extensively extensively during during query query processing processing to to support support semantic semantic query query optimization optimization based based on on axioms axioms within within the the ontology. ontology.

8 . Information can be 8. Information on on DAML+OIL DAML+OILcan be found found at http://www.daml.org.

9 . Versions 9. Versions of this this ontology ontology represented represented in in DAML+OIL DAML+OILmay may be be found found at -oil.html. http://img.cs.man.ac.uk/stevens/tambis http://img.cs.man.ac.uk/stevens/tambis-oil.html.

7 .5 7,5
~ i , ~

C urrent and Current and Future Future


. --~

in TAM BIS Developments in TAMBIS


~ 9

~ .

--

- - ,

219

219

' The query processing version one somewhat limited, The query processing in in version one of of TAMBIS TAMBIS was was somewhat limited, and and these these limitations limitations include: include:
1. The 1. The ontology ontology is is represented represented using using a a relatively relatively inexpressive inexpressive DL DL in in which which certain certain features biological domain domain are features of of the the biological are difficult difficult to to express. express.

collections in 2. The The mapping mapping between between concepts concepts in in the the ontology ontology and and collections in the the sources sources is is quite example, TAMBIS quite restrictive. restrictive. For For example, TAMBIS did did not not allow allow multiple multiple sources sources for for the the same kind kind of Swiss-Prot and same of data data (e.g., (e.g., both both Swiss-Prot and the the Protein Protein Information Information Reserve Reserve (PIR) as as protein protein sources. sources. (PIR)
3. Although [13], there 3. Although queries queries are are optimized optimized [13], there is is no no semantic semantic query query optimization optimization making the ontology. making use use of of axioms axioms from from the ontology.

In second version layer has has been In the the second version of of TAMBIS, TAMBIS, an an object-oriented object-oriented wrapper wrapper layer been adopted to replace that provided by Instead of adopted to replace that provided by CPL CPL and and BioKleisli. BioKleisli. Instead of a a CPL CPL query query plan, a a Java program is written. The The use use of of an an object-oriented object-oriented wrapper wrapper layer layer plan, Java program is written. will make TAMBIS TAMBIS compatible compatible with with mainstream mainstream middleware middleware proposals proposals such such as as will make that that standardized standardized by by the the Object Object Management Management Group Group (OMG), (OMG), which which in in turn turn is is o associated with an an important important standardization standardization activity activity in in bioinformatics. bioinformatics. 1 1~ associated with

7 .5. 1 7.5.1

S u m m a ry Summary
This chapter chapter has has provided provided an This an overview overview of of the the first first TAMBIS TAMBIS system system for for querying querying distributed distributed bioinformatics bioinformatics sources. sources. The The key key contributions contributions of of TAMBIS TAMBIS are: are:
1. It 1. It is is the the first first ontology-based ontology-based information information integration integration system system to to be be used used in in bioinformatics. bioinformatics. Although Although ontologies ontologies are are becoming becoming important important in in bioinformat bioinformatics annotating databases managing complex complex information ics for for annotating databases [33] [33] and and for for managing information re resources sources [30], [30], TAMBIS TAMBIS is is the the first first project project to to use use ontologies ontologies to to support support the the important task important task of of integrating integrating bioinformatics bioinformatics resources. resources.

2. TAMBIS is centered TAMBIS is centered on on the the first first description description logic-based logic-based ontology ontology in in bioin bioin-

formatics. formatics. Other Other ontologies ontologies in in bioinformatics bioinformatics have have made made use use of of frame-based frame-based representations representations or or structured structured terminologies, terminologies, but but they they are are not not amenable amenable to to sub subsumption in TAMBIS. sumption reasoning reasoning as as in TAMBIS.
3. The user interface in TAMBIS is driven 3. The user interface in TAMBIS is driven directly directly from from the the ontology, ontology, and and as as such, such, it guides the it both both guides the user user in in constructing constructing well well formed formed requests requests and and detects detects when when biologically questions have biologically nonsensical nonsensical questions have been been asked. asked. Other Other knowledge-based knowledge-based

1 0 . Information Information about OMG is 10. about the effort effort of standardization standardization of bioinformatics bioinformatics by the the OMG is available available at http://www.omg.org/homepages/lsr/. h ttp ://www. om g. or g/h omepa ges/lsr/.

220

u l ation Over Diverse nformation Sources SIS Complex Query Form Formulation Diverse IInformation Sources in TAM TAMBIS

information information integration integration systems systems have have paid paid less less attention attention to to user user interaction interaction issues. Issues.
4. The The TAMBIS TAMBIS query query processor processor has has been been integrated integrated with with existing existing wrapping wrapping 4. software, established middleware software, allowing allowing re-use re-use of of established middleware techniques techniques and and existing existing wrappers. minimal assumptions wrappers. The The query query processor processor makes makes minimal assumptions on on the the query query inter interfaces made available reflecting the faces made available by by sources, sources, reflecting the limited limited public public query query interfaces interfaces generally available available in in bioinformatics. bioinformatics. generally

TAMBIS precisely formed queries. Queries TAMBIS seeks seeks to to provide provide correct correct answers answers to to precisely formed queries. Queries can can be be expressed expressed precisely, precisely, at at a a level level of of detail detail corresponding corresponding to to that that of of the the underlying underlying resources, resources, by by using using the the ontology ontology to to constrain constrain what what it it is is valid valid to to ask. ask. Answers Answers should should be be correct correct because because the the sources sources and and services services model model makes makes explicit explicit how how queries queries ex expressed available sources. pressed over over the the ontology ontology can can be be answered answered using using the the available sources. However, However, such quality service is some cost; such quality of of service is achieved achieved at at some cost; the the development development of of ontologies ontologies that skilled and 800-concept that describe describe a a domain domain is is a a skilled and time-consuming time-consuming process process (the (the 1 1800-concept TAMBIS ontology took 2 2 person-years person-years to to write), write), and and incorporating a wrapped wrapped TAMBIS ontology took incorporating a source SSM is itself a manual and task. However, source into into the the SSM is itself a manual and time-consuming time-consuming task. However, these these two tasks tasks involve involve ((1) what it it is is valid to ask ask of of a a collection collection of of bioin biointwo 1 ) describing describing what valid to formatics formatics sources sources and and (2) (2) describing describing how how to to obtain obtain answers answers from from a a collection collection of of sources. sources. Although Although the the developers developers and and maintainers maintainers of of a a TAMBIS TAMBIS installation installation must must undertake undertake these these tasks, tasks, the the users users of of the the TAMBIS TAMBIS system system need need not, not, and and thus thus they they can benefit from can benefit from the the knowledge knowledge encoded encoded in in the the ontology ontology and and in in the the SSM. SSM.
WIT

ACKNOWLEDG M E NTS ACKNOWLEDGMENTS


This This work work is is funded funded by by AstraZeneca AstraZeneca pharmaceuticals, pharmaceuticals, the the BBSRC/EPSRC BBSRC/EPSRC Bioin Bioinformatics formatics Initiative Initiative (grant (grant number number BIF/05344), BIF/05344), and and the the EPSRC EPSRC Distributed Distributed Infor Information Management whose support mation Management Initiative Initiative (grant (grant number number GRlM76607), GR/M76607), whose support we we are Peim are pleased pleased to to acknowledge. acknowledge. We We are are also also grateful grateful to to Alex Alex Jacoby Jacoby and and Martin Martin Peim for for their their contributions contributions to to the the implementation implementation of of the the TAMBIS TAMBIS system. system.

_mltD\Ii

R E F E R E NCES REFERENCES
[ 1] [11 [2] V. V. M. Markowitz and O. Ritter. Ritter. "Characterizing Heterogeneous Molecular Biology Biology 1 995): 547-556. Journal of Computational Computational Biology Biology 2, no. 4 ((1995): Database Systems." Journal R. D. Stevens, Stevens, C. A. Goble, P. P. Baker, Baker, et al. "A Classification of Tasks Tasks in Bioinformatics. 7, no. 2 (2001 ): 1 80-1 8 8 . Bioinformatics 1 17, (2001): 180-188. Bioinformatics."" Bioinformatics

References References

2 211 22

[3]

C. A. Goble, R. Stevens, Stevens, G. Ng, Ng, et al. "Transparent Access Access to Multiple 1): Bioinformatics Information Sources. " IBM Systems Journal Sources." Journal 40, no. 2 (200 (2001):

532-552. [4]
P. P. Buneman, S. S. B. B. Davidson, K. Hart, et al. "A Data Data Transformation System for 1 st International Conference on Biological Data Sources." In Proceedings of of the 2 21st Very Large Data Bases 58-169. San Francisco: Morgan Kaufmann, 158-169. Bases (VLDB), 1

1995. 1 995.
[5]
R. Stevens, Stevens, C. A. Goble, and S. S. Bechhofer. Bechhofer. "Ontology-Based Knowledge , no. 4 Representation for Bioinformatics." Briefings Briefings in Bioinformatics 1 1, (November 2000): 398-416. P. G. Baker, P.G. Baker, C. A. Goble, S. S. Bechhofer, Bechhofer, et et al. "An "An Ontology for Bioinformatics Applications." 15, 5 , no. 6 ((1999): 1 999): 5 510-520. 1 0-520. Applications. " Bioinformatics 1 A. Borgida. "Description Logics in Data Management. " IEEE Transactions on Management." Knowledge and Data Engineering 7, no. 5 ((1995): 1 995): 785-798. G. A. Ringland and D. A. Duce. Approaches to Knowledge Representation: Representation: An G.A. John Wiley, Wiley, 1988. Introduction. New York: John
F. E

[6] [7] [8] [9]

Baader, D cGuinness, D. Nardi, e t ai, D.. M McGuinness, et al, eds. eds. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge, UK: Cambridge University Press, 2003.

[ 1 0] A. L. Rector, S. K. Bechhofer, [10] A.L. Bechhofer, C. A. Goble, et al. "The GRAIL Concept Modelling Language for Medical Terminology." Artificial Intelligence in Medicine 9, no. 2 2 (1997): ( 1 997): 139-171. 1 3 9-1 7 1 .

[11] D.. Nardi, et al. in [ 11] E F. M. Donini, M. Lenzerini, Lenzerini, D Nardi, e ta l . "Reasoning " Reasoning i n Description Logics." In 1 91-236. Stanford, CA: Center for the Foundations of of Knowledge Representation, 191-236. 1 996. Study of Language and Information (CSLI) (CSLI) Publications, 1996.
Bechhofer, R. Stevens, [12] S. Bechhofer, Stevens, G. Ng, Ng, et al. "Guiding the the User: An Ontology Driven Interface. " In Proceedings of of User Interfaces to Data Intensive Systems (UIDIS99), Interface." 1 5 8- 1 6 1 . New edited by N. W. Paton Paton and T. Griffiths, 158-161. New York: IEEE Press, 1 999. 1999. N. W. Paton, R. Stevens, Baker, et al. "Query Processing [13] Stevens, P. Baker, Processing in the TAMBIS [ 1 3] N.W. System. " In Proceedings of of the 1 1l 1 th Bioinformatics Source Integration System."

Conference on Scientific and Statistical Statistical Database Database Management International Conference 1 3 8-147. New New York: IEEE Press, 1999. 1 999. (SSDBM), 138-147.
A. N. Swami. "Optimization "Optimization of of Large Join Join Queries." Queries." In Proceedings of of the 1989 1 989 [14] A.N. A CM SIGMOD SIGMOD International Conference on Managing Error, 367-376. 367-376. New New ACM York: ACM ACM Press, 1989.

[15] et al. "BioKleisli: A Digital Library for [ 1 5] S.B. S. B. Davidson, C. Overton, Overton, V. Tannen, et of Digital Digital Libraries 1, 1 , no. 1 (November 1997): 1 997): Biomedical Researchers." Researchers." Journal of Biomedical 36-53. 36-5 3 .

222

Complex Query Query Form Formulation Over Diverse Diverse IInformation Sources in TAM TAMBIS u lation Over nformation S o u rces in BIS

[16] T. Nardi, et al. "SRS: Information Retrieval System for T. Ezold, A. Ulyanov, D D.. Nardi, Molecular 1 996): 1 14-128. 114-128. Molecular Biology Data Banks." Methods in Enzymology 266 ((1996): [ 1 7] I-M. A. Chen, A. S. Kosky, V. M. Markowitz, Constructing and [17] Markowitz, et al. " "Constructing Maintaining Maintaining Scientific Database Database Views in the Framework of the Object Protocol Model. " In Proceedings Proceedings of of the 9th International Conference Conference on SSDBM, 237-248 237-248.. Model." New 997. New York: IEEE Press, 1 1997.
C. Siepel, A. N. Tolopko, [ 1 8 ] A. [18] A.C. Tolopko, A. D. Farmer, et al. "An Integration Platform for " IBM Systems journal Journal 40, Heterogeneous Bioinformatics Software components. components." no. 2 (200 1 ) : 570-5 91. (2001): 570-591. . M. Haas, t al. "Integrating Life Sciences Data [ 1 9] L [19] L.M. Haas, P. P. Kodali, J J.. E E.. Rice, e et Data with a Little Garlic." In Proceedings Proceedings of of the International Symposium on Bio-Informatics and (BIBE), 5-12. 5-12. New New York: IEEE Press, 2000. Biomedical Engineering (BIBE),

[20] G. J. L. Kemp, G.J.L. Kemp, N. Angelopoulog, and and P. M. D D.. Gray. "A Schema-Based Approach Approach to Building a Bioinformatics Database Proceedings of the Database Federation." In Proceedings International Symposium on Bio-Informatics and Biomedical Engineering Engineering (BIBE), 1 3-20. New 13-20. New York: IEEE Press, 2000. 2000.

[21] Interoperating Bioinformatics Resources [2 1 ] R. Stevens and C. Miller. "Wrapping and Interoperating Using CORBA." , no. 1 1. Briefings in Bioinformatics 1 1, 1 (2000): 9-2 9-21. CORBA." Briefings
[22] P. ta l . "A CORBA Server for the P. Rodriguez-Tome, Rodriguez-Tom(~, C C.. Helgesen, P. P. Lijnzaad, e et al. " In Proceedings of of the Fifth Fifth International Conference Radiation Hybrid DataBase. DataBase." In Proceedings on Intelligent systems for Molecular Biology, 250-253. 250-253. Menlo Park, CA: AAAI Press, 1997. [23] T. Coupaye. "Wrapping SRS With CORBA: From Textual Data to to Distributed Objects. " Bioinformatics 15, no. 1 999): 3 33-3 3 8 . no. 4 ((1999): 333-338. Objects." [24] 5 , no. 4 ( 1 999): [24] D. Fenyo. "The Biopolymer Markup Markup Language." Bioinformatics 1 15, (1999): 339-340. 339-340.
W. Paton, [25] [25] N. N.W. Paton, C. A. Goble, Goble, and S. Bechhofer. "Knowledge Based Information Integration Systems. " Information Information and Software Technology 42, no. 5 (2000): Systems." 299-3 12. 299-312.

[26] Y. Arens, C. A. Knoblock, Knoblock, and and W-M. Shen. "Query Reformulation for Dynamic Information Journal of of Intelligent Information Information Systems Systems 6, no. no. 2-3 Information Integration." Integration." journal ((1996): 1 996): 99-1 30. 99-130. [27] Y. Levy, D. Srivastava, and T. Kirk. "Data Model [27] A. A.Y. Model and and Query Evaluation in Global Information Systems. " journal Journal of of Intelligent Information Information Systems Systems 5, no. no. 2 Systems." ((1995): 1 995): 1 2 1-143. 121-143. [28] E. Mena, Observer: An Approach Mena, A. Illarramendi, V. Kashyap, et al. " "Observer: Approach for Query Processing in Global Information Information Systems Based on Interoperation Interoperation Across Pre-Existing Ontologies. " Distributed and Parallel Parallel Databases 8, no. 2 (2000): Ontologies." 223-271 223-271..

References References

223
D. Ullman. "Information " In Proceedings of [29] [29] J. J.D. "Information Integration Using Logical Views. Views." of ICDT 9-40. Heidelberg, '97- 6th International Conference on Database Theory, 1 19-40. ICDT '97: 997. Germany: Springer-Verlag, Springer-Verlag, 1 1997.

[30] R. Altman, M. Bada, X. J. Chai, Chai, et al. "RiboWeb: An Ontology-Based System for [30] 1 999): Collaborative " IEEE Intelligent Systems 14, no. 5 ( Collaborative Molecular Molecular Biology. Biology." (1999): 68-76. 68-76.
[ 3 1 ] P. [31] P. Karp, Karp, M. Riley, Riley, S. Paley, et al. "EcoCyc: Electronic Encyclopedia of E. coli Genes and Metabolism." 1 999): 55-5 8. Metabolism." Nucleic Acids Research 27, no. 1 1 ((1999): 55-58.
" In Proceedings of [32] S. Schulze-Kremer. "Ontologies "Ontologies for Molecular Biology. Biology." of the Third Third Pacific Symposium on Biocomputing, 693-704. 693-704. Singapore: World Scientific, 1998.

[33] M M.. Ashburner, C. A. Ball, J. A. Blake, et al. " "Gene Gene Ontology: Tool for the [33] Unification of Biology. " Nature Biology." Nature Genetics 25, no. 1 1 (2000): 25-29. 25-29.
" [34] P. Buneman, S. Khanna, and w-c. W-C. Tan. "Data "Data Provenance: Some Basic Basic Issues. Issues." TSTTCS TSTTCS 2000: 29th Conference on Foundations Foundations of of Software Technology Technology and Theoretical Theoretical Computer Science, New New Dehli, India. Lecture Notes in Computer Science, vol. 1 974, 87-93. 87-93. Heidelberg, Germany: Springer-Verlag, Springer-Verlag, 2000. Science, 1974,

[35] 1. I. Horrocks. Horrocks. "Using an Expressive Description Logic: Fact or Fiction." In Principles of of Knowledge Representation and Reasoning: Proceedings of of the Sixth International Conference (KR '98), "98), edited by A. G. Cohn, L. K. Schubert, and S. C. Shapiro, Shapiro, 636-647. 636-647. San Francisco: Morgan Kaufmann, 1998.

This Page Intentionally Left Blank

CHAPTER CHAPTER

8 8

Integration System System K2 K2 Integration


Val Val Tannen, Tannen, Susan Susan B. B. Davidson, Davidson, and and Scott Scott Harker Harker

The ation The Inform Information

In 993, the (DOE) workshop In 1 1993, the invitational invitational Department Department of of Energy Energy (DOE) workshop on on genome genome informatics all sequence informatics published published a a report report that that claimed claimed that that until until all sequence data data is is gathered gathered in relational database, database, none in a a standard standard relational none of of the the queries queries in in the the appendix appendix to to the the report (see Figure . 1 for listing of 1 ] . While While the report could could be be answered answered (see Figure 8 8.1 for a a listing of the the queries) queries) [ [1]. the motivation for largely political, gauntlet had had been laid in motivation for the the statement statement was was largely political, the the gauntlet been laid in plain queries in plain view view for for database database researchers: researchers: The The data data to to answer answer the the queries in the the appendix appendix were (by and available, but physically were (by and large) large) available, but they they were were stored stored in in a a number number of of physically distributed bases. The bases represented distributed data databases. The data databases represented their their data data in in a a variety variety of of formats formats using different challenge was, was, therefore, using different query query interfaces. interfaces. The The challenge therefore, one one of of integrating integrating heterogeneous, bases and heterogeneous, distributed distributed data databases and software software programs programs in in which which the the type type of of data complex, extending relational technology. data was was complex, extending well well beyond beyond the the capabilities capabilities of of relational technology. As an example of of the the type type of of genomic genomic data data that is available available online, online, consider consider As an example that is the the EMBL-format EMBL-format Swiss-Prot Swiss-Prot entry entry shown shown in in Figure Figure 8.2. 8.2. Each Each line line begins begins with with a a two-character code, which indicates the type of data contained in the line. For two-character code, which indicates the type of data contained in the line. For example, tamped example, each each entry entry is is identified identified by by an an accession accession number number (AC) (AC) and and is is times timestamped by to three by up up to three dates dates (DT). (DT). The The create create date date is is mandatory, mandatory, while while the the sequence sequence update update and and annotation annotation update update dates dates only only appear appear if if the the sequence sequence or or annotation annotation has has been been modified since the entry was created. The sequence (SQ), a list of amino acids, modified since the entry was created. The sequence (SQ), a list of amino acids, appears appears at at the the end end of of the the entry; entry; the the rest rest of of the the core core data data includes includes citation citation informa information bibliographical references, tion ((bibliographical references, lines lines beginning beginning with with R), m), taxonomic taxonomic data data (Oc), (oc), a a description source of description of of the the biological biological source of the the protein, protein, and and database database references references (DR), (DR), explicit bases: EMBL explicit links links to to entries entries in in other other data databases: EMBL (annotated (annotated nucleotide nucleotide sequence sequence database); HSSP (homology (homology derived database); HSSP derived secondary secondary structure structure of of proteins); proteins); Wormpep Wormpep (predicted elegans genome genome sequencing sequencing project); project); (predicted proteins proteins from from the the Caenorhabditis elegans InterPro, InterPro, Pfam, Pfam, PRINTS, PRINTS, PROSITE PROSITE (databases (databases of of protein protein families families and and domains, domains, among among other other things). things). Annotation Annotation information, information, which which is is obtained obtained by by publications publications reporting reporting new new sequence sequence data, data, review review articles, articles, and and external external experts, experts, is is mainly mainly found found in table (FT), keyword in the the feature feature table keyword lines lines (KW), (KW), and and comment comment lines lines (cc), which which do do not not appear appear in in this this example example due due to to lack lack of of space. space. Note Note that that the the bibliographical bibliographical

226

8 8

" ",", ",'

The nformation IIntegration ntegration System The IInformation System K2 K2


=''"=." " " <_"',,*"_,,_, _""4, "'_,,_"'_,"" """'*" _.'". " _"""
"-"",, ,-., ' " " " . "" '=""'-- "" '' '=-'' ='' -''

The ollowing "unanswerable The f following "unanswerable queries" queries" were were taken taken from from Appendix Appendix 1 1 of of the the 1993 1993 Invitational Invitational DOE DOE Work Work-

t tp : //''''''. ornl publi catlmiscpubsl shop on on Genome Genome Informatics report, report, available available at at h http://www, ornl ..gov/hgmis/publicat/miscpubs/ govlhgmisl bioinfolcontents . h tml. Rather Rather than than saying that that all all sequence sequence databases databases must must be be relationalized, the the bioinfo/r

wording wording of the the report report has has now been modified modified to to say "until "until a a fully atomized sequence sequence database database is available (Le., (i.e., no data data stored stored in ASCII text text fields) fields),, none of the the queries in in this this appendix appendix can be be answered." answered."

1. Return 1. Return all sequences that that map map "close" to to marker marker M M on on human human chromosome chromosome 19, are are putative putative members members of the the olfactory olfactory receptor receptor family, family, and and have have been been mapped mapped on on a a contig map map of of the the region; return return also also the contig descriptions. (This is nominally a a link link between between GenBank, GDB, GDB, and and LLNL's databases.) databases.) 3. 3. Return Return the the map map location, where where known, of of all all alU alu elements elements having having homology greater greater than than "h" with with the alu sequence sequence "S". the
mologue has has been been identified in a a nonvertebrate nonvertebrate organism; organism; return return also also the GenBank GenBank accession number number

2 . Return 2. Return all genomic sequences for which alu elements are located located internal internal to to a a gene domain. domain.

4. Return Return all human human gene sequences, with annotation annotation information, information, for which which a a putative putative functional ho ho-

5. 5. Return Return all mammalian mammalian gene gene sequences sequences for for proteins proteins identified identified as as being being involved in intracellular signal transduction; return return annotation annotation information information and and literature literature citations. citations. transduction;

the homologue homologue sequence sequence where where available. of the

6. Return Return any any annotation annotation added added to to my my sequence sequence number number # ## ## # # since since I last updated updated it. 6. #
8. 8. Return Return the the number number and and a a list list of of the the distinct distinct human human genes that that have have been been sequenced. sequenced.

7. 7. Return Return the the genes for zinc-finger proteins proteins on chromosome chromosome 19 that that have have been been sequenced. sequenced. (Note (Note that that answering answering this this requires requires either either query query by sequence sequence similarity similarity or or uniformity uniformity of nomenclature.) nomenclature.)

9. Return Return all the the human human contigs contigs greater greater than than 150 kb. Return all sequences, sequences, for for which which at at least two two sequence sequence variants variants are are known, known, from from regions regions of of the the genome genome 10. Return
# 1 1 . Return 11. Return all publications publications from from the the last 2 2 years about about my favorite favorite gene, gene, accession number number # ## ## # .# . Return all G 1 / S serine/threonine serine/threonine kinase kinase genes translated proteins) proteins) that that are are known 12. Return G1/S genes (and (and their their translated known (exper (experimentally) or are are thought to exhibit tyrosine phosphorylation Keep clear clear imentally) thought (by (by similarity) similarity) also to exhibit tyrosine phosphorylation activity. Keep the distinction in in the output. the distinction the output. within within +/+ / - one chromosome chromosome band band of DS14###. DS14###.

8.1 8. 1

The 1993 queries. 1 99 3 DOE D O E Report's Report's "unanswerable" "unanswerable" queries. The

FIGURE F IGURE

references are structures; there (Reference references are nested nested structures; there are are two two references, references, and and the the RP RP (Reference Position), RC (Reference (Reference Comment), Comment), RA RA (Reference (Reference Author), Author), and and RL (Reference (Reference Position), FT (Feature (Feature Table) Table) is is a a Location) fields fields are are specific specific to to each each reference. reference. Similarly, Similarly, the the FT Location) nested structure structure in in which which each each line line contains contains a a start start and and end end position position (e.g., (e.g., 14 14 to to 21), 2 1 ), nested a type type of of feature feature (e.g., (e.g., NP_BIND), NP_BIND), and and a a description. description. The The entry entry is is designed designed to to be be a read easily easily by by a a human human being being and and structured enough to to be be machine machine parsed. parsed. HowHow read structured enough ever, several several lines lines still still contain contain a a certain certain amount amount of of structure that could could be be separated separated ever, structure that out during during parsing. parsing. For For example, example, the the author author list list is is a a string, string, which which could could be be parsed parsed out into a a list list of of strings strings so so as as to to be be able able to to index index into into the the individual individual authors. authors. Similarly, Similarly, into the taxonomic taxonomic data data is is also also a a string string spread spread over over several several lines lines and and could could again again be be the parsed parsed into into a a list. list.

8 8

The IInformation nformation Integration System The System K2 K2

227 227
4 6 3 AA 463 AA..

AC AC

ID ID

DT DT

P 53013 ; P53013;

EFIA_CAEEL EFIA_CAEEL 0 1-0CT 1 9 01 0CT1 9 99 66

STANDARD STANDARD;; ( Re l . (Rel. 3 4, 34, Created) Created) Last Last

PRT; PRT;

DT DT

DT DT

0 1-0CT 1 9 9 01 OCT1 9 9 66 1 5-DEC 1 9 9 15 DEC1 9 9 88 ( EF T - 3 (EFT-3

( Re l . (Rel.

GN GN OS OS OC OC

DE DE

ELONGATION -ALPHA E L O N G A T I O N FACTOR FACTOR I 1ALPHA Caenorhabd C a e n o r h a b d ii tt ii ss Eukaryota E u k a r y o t a ;; [ 1] [i] Rhabd i t Rhabdi t ii ddae a e ;; OR 31E3 . 5 ) OR F F31E3.5) e l egans . elegans.

( Re l . (Rel.

3 7, 37,

3 4, 34,

Last tation Last anno annot a t i o n update update)) ( EF - I -ALPHA ) . (EF-I-ALPHA).

sequence s e q u e n c e update update))

AND 0 3 G5 . 1 . AND R R03G5.1. Nematoda N e m a t o d a ;;

RN RN

OC OC

Metazoa Metazoa;;

Peloderinae Peloderinae; ;

Caenorhabd C a e n o r h a b d ii tt ii ss ..

Chromadorea C h r o m a d o r e a ;;

Rhabd i tida Rhabdi t i d a ;;

Rhabdi t o dea Rhabdit o ii d ea;;

RP RP RA RA
RN RN

RC RC

SEQUENCE .A A EFTS E Q U E N C E FROM FROM N N. .. ((E F T - 33 )). . Fave ll lo o A . ; Favel A.; STRAIN=BRISTOL S T R A I N = B R I S T O L N2 N2;;

RL RL

Submi ed Subm it tt ted [2] [ 2]

(NOV-1995) ( NOV- 1 9 9 5 )

to to

the EMBL EMBL/ GenBank/D D B J database databases. the /GenBank/ DDBJ s.

RP RP RC RC

RL RL

RA RA DR DR

Waterston . ; Waterston R R.; EMBL EMBL;; Submi tted Submit ted U5 1994; U51994;

STRAIN=BRISTOL S T R A I N = B R I S T O L N2 N2;;

SEQUENCE .A. S E Q U E N C E FROM FROM N N.A.

( R0 3 G5 . 1 ) . (R03G5.1).

(MAR -19 9 6 ) (MAR-1996)

DR DR DR DR DR DR DR DR DR DR

DR DR

EMBL EMBL;; HSS P; HSSP;

U4 0935 ; U40935;

AAA9 60 0 AAA96 66 88 ..11 ;; lAI P. IAIP.

to to

the DDBJ ses . the EMBL/GenBank/ EMBL/GenBank/D D B J databa databases. -. . -.

WORMPE W O R M P E PP ;; PFAM; PFAM;

WORMPEP WORMPEP;;

P 07157 ; P07157;

AAA8 16 6 8 AAA81 8 88 .. 11 ;; C E012 7 0 . CE01270. C E012 7 0 . CE01270. . -.

INTERPRO INTERPRO;;

R 0 3 G5 . 1 ; R03G5.1;

F 31E3 . 5 ; F31E3.5;

KW KW FT FT FT FT SQ SQ

KW KW

DR DR

DR DR

PRINTS PRINTS;;

PF0 0009 ; PF00009;

IPR0 00795; IPR000795;

NP_BIND NP_BIND

Mul ti igene Mult gene NP_BIND NP_BIND

E lo ongat io o n El ngati n

PROSITE PROSITE;;

PR0 0315; PR00315;

GTP_EFTU GTP_EFTU;;

PS 00301; PS00301;

ELONGATNFCT ELONGATNFCT..

1 . I.

FT FT

GTP 21 BY SIMILARITY) 21 GTP ( (BY SIMILARITY).. GTP 9 5 BY SIMILARITY) 95 GTP ( (BY SIMILARITY).. GTP 1 53 NP_BIND 1 56 BY SIMILARITY) NP_BIND 153 156 GTP ( (BY SIMILARITY).. 25 54 44 4A AF IF F 17 7 E 64 ; SEQUENCE 5 0 6 6 8 MW 4 6 3 AA SEQUENCE 463 AA;; 50668 MW;; 1 12 FI I E I1 55B7 B 7 CRC CRC64; MGKEKVHINI STTTGHLIYK M G K E K V H I N I VVIGHVDSGK WIGHVDSGK S T T T G H L I Y K CGGIDKRTIE C G G I D K R T I E KFEKEAQEMG KFEKEAQEMG DKLKAERERG D K L K A E R E R G ITIDIALWKF I T I D I A L W K F ETAKYYITII E T A K Y Y I T I I DAPGHRDFIK D A P G H R D F I K NMITGTSQAD NMITGTSQAD GEFEAGISKN G E F E A G I S K N GQTREHALLA G Q T R E H A L L A QTLGVKQLIV Q T L G V K Q L I V ACNKMDSTEP A C N K M D S T E P PFSEARFTEI PFSEARFTEI IGYNPKAVPF I G Y N P K A V P F VPISGFNGDN V P I S G F N G D N MLEVSSNMPW M L E V S S N M P W FKGWAVERKE F K G W A V E R K E GNASGKTLLE GNASGKTLLE PTDRPLRLPL P T D R P L R L P L QDVYKIGGIG Q D V Y K I G G I G TVPVGRVETG T V P V G R V E T G IIKPGMVVTF I I K P G M V V T F APQNVTTEVK APQNVTTEVK EAVPGDNVGF E A V P G D N V G F NVKNVSVKDI N V K N V S V K D I RRGSVCSDSK R R G S V C S D S K QDPAKEARTF Q D P A K E A R T F HAQVIIMNHP HAQVIIMNHP LDCHTAHIAC L D C H T A H I A C KFNELKEKVD K F N E L K E K V D RRTGKKVEDF R R T G K K V E D F PKFLKSGDAG P K F L K S G D A G IVELIPTKPL IVELIPTKPL LGRFAVRDMR L G R F A V R D M R QTVAVGVIKS Q T V A V G V I K S VEKSDGSSGK V E K S D G S S G K VTKSAQKAAP V T K S A Q K A A P KKK KKK 9 1 91

family family..

fac tor j factor;

EFACTOR_GTP E F A C T O R _ G T P ;;

Protein io os synthe Protein b bi y n t h e ss ii ss ;;

1 . I.

GTP-bindi G T P - b i n d i nng g ;;

1 4 14

KGSFKYAWVL KGSFKYAWVL

ALDSI IP PPQR ALDSII PQR GQI SNGYTPV GQIS NGYTPV SVEMHHESLP SVEMHHESLP

TNEVSGFIKK TNEVSGFIKK

CAVLVVACGT CAVLWACGT

CVESFTDYAP CVESFTDYAP

8.2 8.2 F IGURE FIGURE

Sample Sample Swiss-Prot Swiss-Prot entry. entry.

As data naturally As shown shown in in Figure Figure 8.2, 8.2, the the type type system system for for genomic genomic data naturally goes goes beyond bases and beyond the the sets sets of of records records of of relational relational data databases and include include sequential sequential data data (lists), (lists), deeply deeply nested nested record record structures, structures, and and union union types types (variants). (variants). As As an an example example publication: An of of a a union union type, type, the the format format of of the the RL line line depends depends on on the the type type of of publication: An unpublished unpublished entry entry contains contains a a brief brief comment; comment; a a journal journal citation citation includes includes the the journal journal

228

8 8

The nformation The IInformation

Integration System System K2 K2

abbreviation, the the volume volume number, number, the the page page range, range, and and the the year; year; the the format format of of a a abbreviation, book citation citation includes editor names, book includes the the set set of of editor names, the the name name of of the the book, book, an an optional optional volume, the the page page range, range, the the publisher, publisher, city, city, and and year. year. The The structure structure of of this this Swiss Swissvolume, Prot Prot entry entry can can be be described described precisely precisely in in a a data data definition definition language language with with sufficiently sufficiently rich rich types. types. Such Such a a description description will will be be shown shown later later on on in in Figure Figure 8.5. 8.5. The database database group group at at the University of of Pennsylvania Pennsylvania responded responded to to the the DOE DOE The the University challenge (or integration integration on-the-fly) on-the-fly) environment. environment. challenge by by developing developing a a view integration (or In In such such an an environment, environment, the the schemas schemas of of a a collection collection of of underlying underlying data data sources sources are global schema are merged merged to to form form a a global schema in in some some common common model model (e.g., (e.g., relational, relational, complex complex value, value, or or object-oriented). object-oriented). Users Users query query this this global global schema schema using using a a high highlevel language, such such as level query query language, as Structured Structured Query Query Language Language (SQL) (SQL) [2], [2], Object Object Query Query Language CPL) [4]; Language (OQL) (OQL) [3], [3], or or Collection Collection Programming Programming Language Language ((CPL) [4]; the the system system then what portion then determines determines what portion of of the the global global query query can can be be answered answered by by which which underlying local queries underlying data data source, source, ships ships local queries off off to to the the underlying underlying data data sources, sources, and and then the underlying then combines combines answers answers from from the underlying data data sources sources to to produce produce an an answer answer to to the global query. The The initial initial view-integration view-integration environment environment developed developed by by our our group group the global query. was called Kleisli, implemented by Limsoon Wong. was called Kleisli, and and it it was was designed designed and and implemented by Limsoon Wong. Wong Wong later Kleisli system later re-designed re-designed and and re-implemented re-implemented the the Kleisli system at at Singapore's Singapore's Institute Institute of of Systems Systems Science; Science; this this new new version version of of Kleisli Kleisli is is described described in in Chapter Chapter 6 6 of of this this book. book. About About the the same same time, time, other other information information integration integration projects projects were were also also developed developed [5, the system based on OPM) [8]. [5, 6, 6, 7], 7], including including the system based on the the Object Object Protocol Protocol Model Model ((OPM) [8].
K2 K2 vs. Kleisli Kleisli

K2 is system to and implemented K2 is a a successor successor system to Kleisli Kleisli that that was was designed designed and implemented at at the the Uni University of and Val versity of Pennsylvania Pennsylvania by by Jonathan Jonathan Crabtree, Crabtree, Scott Scott Harker, Harker, and Val Tannen. Tannen. Like Like Kleisli, K2 monad Kleisli, K2 uses uses a a complex complex value value model model of of data data and and is is based based on on the the so-called so-called monad approach approach (see (see Section Section 8.4). 8.4). However, However, the the design design of of K2 K2 also also contains contains a a number number of of new new ideas ideas and and redirections: redirections: First, First, the the model model incorporates incorporates a a notion notion of of dictionaries, which allows well as which allows a a natural natural representation representation of of object-oriented object-oriented classes classes [9] [9] as as well as Web Webbased data. Second, based data. Second, the the internal internal language language features features a a new new approach approach to to aggregate aggregate and collection and collection conversion conversion operations operations [10]. [10]. Third, Third, the the syntax syntax of of the the language language fol follows language for lows a a mainstream mainstream query query language for object-oriented object-oriented databases databases called called OQL OQL [3] [3] rather rather than than the the elegant elegant but but less less familiar familiar comprehension-style comprehension-style syntax syntax originally originally used used in in CPL CPL [4] [4] (Kleisli (Kleisli now now uses uses an an adapted adapted SQL SQL syntax). syntax). Fourth, Fourth, a a separation separation is is made made between between the the mediator mediator (global (global schema) schema) level level and and the the query query level level by by introduc introducing mediator definition definition language K2, K2MDL. ing a a mediator language for for K2, K2MDL. K2MDL K2MDL combines combines an an Object Object Definition with OQL OQL state Definition Language Language (ODL) (ODL) specification specification of of the the global global schema schema with statements intermediate mediators ments that that describe describe the the data data mapping. mapping. The The ability ability to to specify specify intermediate mediators allows allows a a large large integration integration environment environment to to be be created created in in layers layers and and componentized. componentized.

8 .1 8.1

Approach

229 229

Finally, Finally, to to improve improve its its portability, portability, K2 K2 is is implemented implemented in in Java Java and and makes makes use use of of several several of of the the standard standard protocols protocols and and application application programming programming interfaces interfaces (APIs) (APIs) 1 including that are are part part of of the the Java Java platform, platform, 1 including Remote Remote Method Method Invocation Invocation (RMI) (RMI) 2 2 that 3 not an and Java Data Data Base Base Connectivity Connectivity (JDBC). Thus, K2 and Java (JDBC). 3 Thus, K2 is is not an extension extension of of Kleisli, Kleisli, but but rather rather a a system system implemented implemented from from scratch scratch that that shares shares with with Kleisli Kleisli some some of of its its design while featuring number of distinct developments design principles, principles, while featuring a a number of distinct developments just just outlined. outlined. Overall, Overall, the the goal goal of of K2 K2 is is to to provide provide a a generic generic and and flexible flexible view view integration integration environment environment appropriate appropriate for for the the complex complex data data sources sources and and software software systems systems found found throughout appeals to throughout genomics, genomics, which which is is portable portable and and appeals to common common practices practices and and standards. standards.

8.1 8. 1

APPROACH APPROACH
A 11 years A number number of of other other techniques techniques have have also also been been developed developed over over the the past past 11 years in in response federations and response to to the the DOE DOE challenge, challenge, including including link-driven link-driven federations and warehouses. warehouses. In In a a link-driven link-driven federation, federation, users users start start by by extracting extracting entries entries of of interest interest at at one one data data source source and and then then hop hop to to other other related related data data sources sources via via Web Web links links that that have have been explicitly created been explicitly created by by the the developers developers of of the the system. system. The The Sequence Sequence Retrieval Retrieval [ 1 1 ] presented in Chapter 5, LinkDB [12], and GeneCards System (SRS) System (SRS) [11] presented in Chapter 5, LinkDB [12], and GeneCards [13] [13] are are examples examples of of this this approach. approach. While While the the federation federation approach approach is is easy easy to to use, use, especially for for novices, novices, it it does does not not scale scale well: well: When When a a new new data data source source is is added added to to especially the federation, connections the federation, connections between between its its entries entries and and entries entries of of all all existing existing federation federation data sources must added; this commonly referred data sources must be be added; this is is commonly referred to to as as the the N N2 2 problem. problem. Furthermore, join between Furthermore, if if users users are are interested interested in in a a join between two two data data sources sources in in the the federation, join by federation, they they must must manually manually perform perform the the join by clicking clicking on on each each entry entry in in the the first first data data source source and and following following all all connections connections to to the the second second data data source. source. 4 4 In In contrast, in a high-level query warehouse contrast, a a join join can can be be expressed expressed in a single single high-level query in in a a view view or or warehouse integration strategy. In warehouse integration strategy. In general, general, the the query query languages languages supporting supporting view view or or warehouse integration approaches allow arbitrary integration approaches are are much much more more powerful powerful and and allow arbitrary restructuring restructuring of of the the retrieved retrieved data. data. A A warehouse warehouse strategy strategy creates creates a a central central repository repository of of information information and and anno annotations. GUS) [14], tations. One One such such example example is is the the Genomics Genomics Unified Unified Schema Schema ((GUS) [14], which which integrates integrates and and adds adds value value to to data data obtained obtained from from GenBankJEMBLIDDBJ, GenBank/EMBL/DDBJ, dbEST, dbEST,
1 . See 1. See http://www.javasoft.com/j2se/. http.//www.javasoft.com/j2se/.

2 See http://www.javasoft.com/products/rmi-iiop/. 2.. See http.//www.javasoft.com/products/rmi-iiop/.

3 See http://java.sun.com/products/jdbcl. 3.. See http.//java.sun.com/products/jdbc/. 4 this is SRS, in which which a linking 4.. A A counterexample counterexample to this is SRS, linking operator operator is is provided provided to retrieve retrieve linked linked entries entries to a set of entries. entries.

230

8 8

The nformation The IInformation

Integration System K2 K2

and and Swiss-Prot Swiss-Prot (and (and others) others) and and contains contains annotated annotated nucleotide nucleotide (dioxyribonucleic (dioxyribonucleic acid [DNA], ribonucleic ), and acid [DNA], ribonucleic acid acid [RNA] [RNA]), and amino amino acid acid (protein) (protein) sequences. sequences. Note Note that integration systems systems can also be used to that view view integration can also be used to create create warehouses, warehouses, which which are are instantiations The advantage approach over instantiations of of the the global global schema. schema.The advantage of of a a warehouse warehouse approach over a because all local, delays a view view integration integration is is one one of of speed speed and and reliability; reliability; because all data data are are local, delays and and failures failures associated associated with with networks networks can can be be avoided. avoided. Furthermore, Furthermore, there there is is greater greater control over control over the the data. data. However, However, a a warehouse warehouse is is not not dynamic: dynamic: Not Not only only must must it it be be kept including a kept up-to-date up-to-date with with respect respect to to the the underlying underlying data data sources, sources, but but including a new new data algorithm is data source source or or algorithm is time-consuming. time-consuming. A A more more extended extended discussion discussion of of the the problems approaches can problems and and benefits benefits of of the the link, link, warehouse, warehouse, and and view view integration integration approaches can be be found found in in articles articles in in the the IBM Systems Journal and and the the Journal of of Digital Libraries [14, 15]. K2 is is a a system for generating generating mediators. Mediators Mediators are are middleware middleware com com[ 1 4, 1 5 ] . K2 system for ponents that integrate domain-specific data data from from multiple multiple sources, sources, reducing reducing and and ponents that integrate domain-specific 1 6] . A restructuring restructuring data data to to an an appropriate appropriate virtual view [ [16]. A major major benefit benefit of of mediation mediation is is the the scalability scalability and and long-term long-term maintenance maintenance of of the the integration integration systems systems structure. structure. Figure 8 8.3 is an an example example of of how how mediators mediators can can help help the the data data integration integration Figure . 3 is task. In example, each task. In this this example, each of of the the boxes boxes represents represents a a machine machine on on which which a a copy copy of of K2 mediator for local, as K2 is is used used to to provide provide a a mediator for some some local, as well well as as external, external, data data sources sources

K2

Mediator 3

Local DB

K2

K2 Mediator Mediator 1 I a

K2

K2 Mediator Me:iator 2 2 ~ . .

BLAST SRS (access to Swiss-Prot, GenBank)

Local DB

8.3 8.3 F IGURE FIGURE

Mediator example. example. Mediator

8 .1 8.1

Approach

231 231

and and views. views. Mediator Mediator 1 I resides resides behind behind a a company company firewall firewall and and was was built built to to integrate integrate data from from a a local database and and local local copies of Swiss-Prot and GenBank, GenBank, accessed accessed data local database copies of Swiss-Prot and through SRS, with through SRS, with the the application application program program BLAST. BLAST. Mediator Mediator 2 2 was was then then built built outside outside the the firewall firewall to to integrate integrate data data from from some some external external data data sources-PubMed sources~PubMed and database, which and a a patent patent database, which can can only only be be accessed accessed through through a a Web Web interface. interface. An An external external copy copy of of PubMed PubMed was was used used due due to to its its size size and and the the fact fact that that the the most most recent recent version version was was always always needed. needed. Mediator Mediator 3 3 was was then then built built within within the the company company 2, and firewall firewall to to integrate integrate data data from from Mediators Mediators 1 1 and and 2, and Mediator Mediator 1 1 was was enlarged enlarged to to integrate integrate data data from from Mediator Mediator 3 3.. K2 (together with with Tsimmis Tsimmis [ [17]) distinguishes itself itself among among approaches approaches based based K 2 (together 1 7] ) distinguishes on on mediation mediation in in that that it it generates mediators mediators starting starting from from a a concise, concise, high-level high-level description. description. This This makes makes K2 K2 especially especially appropriate appropriate for for configurations configurations in in which which many many mediators mediators are are needed needed or or in in which which mediators mediators must must be be frequently frequently changed changed due the data due to to instability instability in in the data sources sources or or in in the the client client needs. needs. Some Some of of the the salient salient features features of of K2's K2's mediation mediation environment environment are: are: 9 K2 K2 has has a a universal universal internal internal data data model model with with an an external external data data exchange exchange format format for for interoperation interoperation with with similar similar components. components. 9 It It has has interfaces interfaces based based on on the the Object Object Data Data Management Management Group Group (ODMG) (ODMG) standard standard [3] [3] for for both both data data definition definition and and queries. queries. 9 It integrates integrates nested nested data, data, while offering a Java-based Java-based interface interface (JDBC) to to rela relational tional database database systems systems and and an an ODMG ODMG interface interface to to object-oriented object-oriented database database systems. systems. 9 It It offers offers a a new new way way to to program program integration/transformation/mediation integration/transformation/mediation in in a a very very high-level declarative language language (K2MDL) high-level declarative (K2MDL) that that extends extends ODMG. ODMG. 9 It It has has an an extensible extensible rule-based rule-based and and cost-based cost-based optimizer. optimizer. External or internal decision-support systems can 9 External or internal decision-support systems can easily easily be be included. included. 9 It It is is written written entirely entirely in in Java, Java, with with corresponding corresponding consequences consequences about about porta portability. bility. The K2-generated mediator data trans The basic basic functionality functionality of of a a K2-generated mediator is is to to implement implement a a data transformation formation from from one one or or more more data data sources sources to to one one data data target. target. The The component component con contains (in ODMGIODL) tains a a high-level high-level (in ODMG/ODL) description description of of the the schemas schemas (for (for sources sources and and the the target) transformation (in (in K2MDL). perspective, the target) and and of of the the transformation K2MDL). From From the the target's target's perspective, the mediator offers mediator offers a a view view that, that, in in turn, turn, can can become become a a data data source source for for another another mediator. mediator. .4. In this diagram, An given in Figure 8 An overview overview of of the the K2 K2 architecture architecture is is given in Figure 8.4. In this diagram, clients clients can can issue issue OQL OQL queries queries or or other other commands commands against against an an integration integration schema schema constructed constructed using using K2MDL. K2MDL. The The queries queries are are then then translated translated to to the the K2 K2 internal internal

232
Web i Clients Web Clients
OQL

232

nformation Integration SystemK2 8 The The IInformation K2

o=l ......... O0.L


OQL

Local L ~ Iu L s O Users

~ ~r, I
ii ii

Commands

Commands[
ii
.

Admin

~dm,n 1 ........

Remote K2

Rern~ 1 K2 /
' !

Method Calls CaUs


iLILII

Method

Clients

iii i

K2MDL Translator K2 M OL Translator Translator.... .......................Query Query,,Translator Query Query Optimizer Optimizer Execution Execution Engine Engine I Be 1I J v er I IPipeO JO DB C11 Pi Pe D ri t iv e rll 1l'W 4Fi11I'R W4F R M '1... II
" J , 2 , " '

K2 Server

,"-'---'~-~,
Relational Databases

IL~

! I USPTOII PubMedl

"
l K2

Data Sou rces Sources

8.4 8.4

K2 system architecture. architecture.

FIGURE FIGURE
language using using K2MDL K2MDL and and query query translators. language expression expression language translators. This This internal internal language is then optimized and and executed to ship sub-queries to to external is then optimized executed using using data data drivers drivers to ship sub-queries external data sources sources and and return results. data return results. The remainder remainder of of this this chapter The chapter walks walks through through the the architecture architecture by by describing describing the data data model, model, illustrating illustrating K2MDL K2MDL and OQL, and the and OQL, and briefly briefly discussing discussing the the internal internal language, data data drivers, drivers, query query optimization, optimization, and and user user interfaces. interfaces. The The chapter chapter closes closes language, with a a discussion discussion of of scalability scalability and and impact. impact. with

..

8.2 8.2

DATA MODEL MODEL AND A N D LANGUAGES LAN G UAG E S DATA


ODMG was was founded founded by by vendors vendors of of object-oriented object-oriented database database management management systems systems ODMG and is is affiliated affiliated with with the the Object Object Management Management Group Group (OMG), (OMG), who who created created the the and Common Object Object Request Request Broker Broker Architecture Architecture (CORBA). ( CORBA). The The ODMG ODMG standard standard Common has two two main main components: components: The The first first is is ODL, ODL, a a data data definition definition language language that that is is has

8.2 Data 8.2 M ode Data / a n , ,Model , ~ d and Lang,~_~, uages ~ ..................................................................................................................................................................... 233 233

used to to define define data data elements. elements. ODL ODL is is an an extension extension of of CORBA's CORBA's Interface Interface Definition Definition used Language (IDL). (IDL). The The second second is is OQL, OQL, an an enhanced enhanced SQL92-1ike SQL92-like language language that that is is Language used for querying. By By building building on on these these standards, standards, K2 K2 leverages leverages the following used for querying. the following features: features: Rich modeling modeling capabilities capabilities 9 Rich Seamless interoperability interoperability with with relational, relational, object-oriented, object-oriented, information information rere 9 Seamless trieval (dictionaries), (dictionaries), and and electronic electronic data data interchange interchange (EDI) (EDI) formats formats (e.g., (e.g., trieval ASN.l ) ASN.1) Compatibility with with the the Universal Universal Modeling Modeling Language Language (UML) (UML) 9 Compatibility Integration with extensible markup markup language language (XML) (XML) documents documents with a given given 9 Integration with extensible with a Document Type Definition (DTD) Document Type Definition (DTD) Official bindings bindings to to Java, C++, and Smalltalk 9 Official Java, C++, and Smalltalk Industrial support ODMG members (Ardent, Poet, Object Design) Design) 9 Industrial support from from ODMG members (Ardent, Poet, Object use in in building building ontologies 9 Increasing Increasing use K2 uses uses ODMG's ODMG's ODL to represent represent the the data data sources turns out out K2 ODL to sources to to be be integrated. integrated. It It turns that many biological biological data data sources be described whose keys that many sources can can be described as as dictionaries dictionaries whose keys are are simple strings and whose entries entries are values. simple strings and whose are complex complex values. is simply simply a a finite finite function. function. Therefore, Therefore, it it has has a a domain domain that that is is a a A dictionary is value (called finite element of a value (called finite set set and and associates associates to to each each element of the the domain domain (called (called a a key) a ti io onary<Tl 2> an entry).. The The type type of of dictionary dictionary in in ODL ODL is is denoted denoted by by dic dict n a r y < T l ,, T T2> an entry) where s the f the 2 iis s the f the n OQL, where Tl T1 i is the type type o of the keys keys and and T T2 the type type o of the entries. entries. I In OQL, the the entry entry in Because OQL in the the dictionary, dictionary, L, corresponding corresponding to to the the key, key, k, is is denoted denoted by by L [ rk 1 ].. Because OQL has ( L ) is L has no no syntax syntax for for the the domain domain of of a a dictionary dictionary L L,, dam dom(L) is an an addition addition to to OQ OQL for this purpose. has type > then L ) has for this purpose. Note Note that that if if L has type dictionary<Tl d i c t i o n a r y < T 1 , , T2 T2> then dam dom ((L) has type set<Tl>. type set<Tl>. Complex value value data data are are built built by by arbitrarily arbitrarily nesting nesting records records (tuples) (tuples);; collections-such s sets, Variants are are collections--such a as sets, bags bags (multisets), (multisets), and and lists; lists; and and variants. variants. Variants tagged unions). pieces of data representing tagged alternatives (also known as pieces of data representing tagged alternatives (also known as tagged To To illustrate illustrate complex complex values values (including (including variants) variants),, Figure Figure 8.5 8.5 presents presents an an ODL ODL declaration declaration for for a a class class whose whose objects objects correspond correspond to to (parts (parts of) of) Swiss-Prot Swiss-Prot en enRe f returns complex values obtained by nesting sets, lists, tries. The attribute tries. The attribute me returns complex values obtained by nesting sets, lists, records records ((s st t rruct u c t in in ODL), ODL), and and variants, variants, the the latter latter identified identified by by the the keyword keyword choice. choice. K2's K2's approach approach to to data data integration integration consists consists of of two two stages. stages. In In the the first first stage stage users users specify specify data data transformations transformations between between multiple multiple sources sources and and a a single single target. target. The The target target is is virtual virtual (unmaterialized) (unmaterialized) and and is, is, in in effect, effect, a a new new view. view. The The sources sources

234 234
c as c ll as s s Entry Entry

nt,<>n.",t.n The IInformation 8 8 The System K2 nformation IIntegration ~ ~ ~ ~

( extent Entri es) (extent Entries)

a t t r i bute s t r ing I D; attribute string ID;

attribute struct Dates Dates a t t r i bute struct


9. .

a t t r i bute string attribute string AC AC;;

{ date date Create Create;; {

date SeqUpdate; date SeqUpdate ;

AnnotUpdate date AnnotUpdate; ; date

} DT; DT; }

a t t r i bute l i s t < struct attribute list<struct

a t t r i bute l i s t < string> QC attribute list<string> OC;; s tring R P; string RP;

{ {

string RC; s tring RC ;


choice choice choice choice

list<string> RA; ; l i s t < s tring> RA { string string present present; ; bool absent; } RT RT;; { bool absen t; } { {

string l i shed; string Unpub Unpublished; struct struct

{ { string string JAbbrev; JAbbrev;


short Volume Volume;; short struct struct

struct struct

{ e t < string> Editors { s set<string> Editors; ;


short short Volume Volume;; struct struct string itle; string T Title;

short short Year; Year;

{ o; } s; { short short from; from; short short t to; } Page Pages;


} l; } Journa Journal;

string l i sher ; string Pub Publisher; string ity; string C City; . . . ... } } RL RL;; short short Year; Year;

{ short t o; } s; { short short from; from; short to; } Page Pages;

} ook; } B Book;

> Ref ... } }> Ref;; attribute attribute string string KW KW;;
9. .

a t tribute struct attribute struct

a t t r ibute string Q; attribute string S SQ;

{ long From; ; s t ring Desc ; { string string KeyName KeyName; ; long From; long long To To; string Desc;; } } FT FT;

8.5 8.5 F IGURE FIGURE

ODL ODL description description of of a a class class of of Swiss-Prot Swiss-Prot entries entries (partial). (partial).

may may consist consist of of materialized materialized data data or or virtual virtual views views that that have have been been defined defined previously previously through similar similar data data transformations. transformations. through In In the the second second stage stage users users formulate formulate queries queries against against the the virtual virtual views. views. OQL OQL is vehicle for is an an excellent excellent vehicle for the the second second stage, stage, but but because because it it does does not not construct construct new output, it it is expressive enough new classes classes as as output, is not not expressive enough for for the the first first stage. stage. Hence, Hence, for for defining sources-target transformations, K2 language, (K2MDL), defining sources-target transformations, K2 uses uses a a new new language, (K2MDL), which which combines the ODL and OQL to combines the syntax syntax of of ODL and OQL to express express high-level high-level specifications specifications of of middleware called mediators, middleware components components called mediators, as as explained explained previously. previously. Some Some examples examples of K2MDL syntax are are in in the the next next section. section. of K2MD L syntax

8.3 8.3

An An

Example

235 235

The used in has allowed model a The rich rich type type system system used in K2 K2 has allowed us us to to model a large large range range of of practical practical data data sources sources in in a a transparent transparent and and friendly friendly manner. manner.

8.3

AN AN EXAMPLE EXAMPLE
The 2 approach s illustrated n example The K K2 approach iis illustrated with with a an example where where the the target target data data could could be be called users. It called an an ontology, that that is, is, a a schema schema agreed agreed upon upon by by a a class class of of users. It shows shows how how could implement sources. a a mediator mediator generated generated by by K2 K2 could implement it it in in terms terms of of standard standard data data sources. Consider the Consider the following following data data description, description, which which is is given given in in ODL ODL syntax: syntax: 5 s TARGET DESCRIPTION TARGET DATA DATA DESCRIPTION
class class Protein Protein ((extent extent proteins proteins))
{

attribut e string s sprotAccess i on ; attribute string Swi SwissprotAccession; attribute attribute string string recommendedName recommendedName;; attribute ternateNames ; attribute set<string> set<string> al alternateNames; attribute equence ; attribute string string s sequence; attribute attribute int int seqLength seqLength;; relationship relationship set<Gene> set<Gene> hasSource hasSource inverse : : hasProduc t ; inverse Gene Gene--hasProduct;

c lass Gene class Gene ((extent extent genes genes))


{

s t ruc t Range Range { long from ; l ong to ;}; struct {long from; long to;}; attribute tring name ; attribute s string name; attribut e string attribute string organism organism;; attribute Range location attribute Range location;; relationship has Product relationship Protein Protein hasProduct inverse : : has Source ; inverse Protein Protein--hasSource;

5 5.. This This is, is, of of course, course, a a very very simplified simplified model model of of proteins proteins and and genes, genes, but but the the intent intent of of this this example example is is to to
scientifically viable viable model. demonstrate demonstrate K2MDL, K2MDL, not not to to develop develop a a scientifically model.

236

8 8

The IInformation The nformation

Integration System System K2 K2

Data Data about about proteins proteins is is recorded recorded using using a a Swiss-Prot Swiss-Prot accession accession number, number, a a recom recommended name, a mended name, a set set of of alternate alternate names, names, the the protein protein sequence sequence and and its its length, length, and and a set set of of references references to to the the genes genes that that code code for for the the protein. protein. Data Data about about genes genes con cona sists sists of of the the name name of of the the gene, gene, the the name name of of the the organism organism from from which which it it comes, comes, the the location location of of the the gene gene in in the the genome, genome, and and a a reference reference to to the the protein protein for for which which it it codes. codes. While While most most of of our our attributes attributes have have simple simple values, values, strings, strings, and and integers, integers, a is also a ll ternateNames t ernateNames is a a set set of of strings strings and and l 1 ocation o c a t i o n is is a a record.The record.The schema schema also Source Gene are specifies specifies that that has hasS o u r c e in in Protein P r o t e i n and and hasProduct h a s P r o d u c t in in Gene are more more than than jjust ust attributes; classes and attributes; they they form form a a relationship relationship between between the the extents extents of of the the two two classes and are inverses. inverses. This This means means that that the the two two following following statements statements are are validated: validated: are
9
9

Given a a protein protein P, for for each each of of the the genes genes G in in the the set set P .. has hasS o u r c e it it is is the the Given Source . has Product : P . case case that that G G. hasProduct P.
=

Given belongs G . hasProduc t ) . Given a a gene gene G, it it is is the the case case that that G G b e l o n g s to to the the set set ( (G.hasProduct) hasSource. hasSource.

Now Now assume assume that that the the data data about about proteins proteins and and genes genes reside reside in in (for (for illustration illustration purposes) purposes) four four materialized materialized data data sources: sources: Swiss-Prot, Swiss-Prot, Orgs, Orgs, Genes, Genes, and and Protein ProteinSynonyms. Synonyms. Swiss-Prot Swiss-Prot contains contains some some protein protein data, data, which which can can be be accessed accessed through through an an SRS SRS driver driver that that presents presents an an object-oriented object-oriented schema schema (i.e., (i.e., a a class) class).. Orgs Orgs and and Genes Genes contain contain organism organism and and gene gene data, data, respectively, respectively, in in two two relations relations (in (in the the same same or or in in separate separate relational relational databases). databases). The The SQL SQL data data description description is is given given here here for for these but in these relations, relations, but in fact fact K2 K2 uses uses an an equivalent equivalent description description in in ODL ODL syntax, syntax, based based on on the the observation observation that that relations relations are are simply simply sets sets of of records. records. Finally, Finally, a a Web-based Web-based data data source source contains contains protein protein name name synonyms synonyms and and is is modeled modeled as as a a dictionary. dictionary.
SOURCE SOURCE DATA DATA DESCRIPTION DESCRIPTION

c lass Swi ssprot class Swissprot ( extent swi ssprots key (extent swissprots key Accession Accession))
{

attribute attribute attribute attribute attribute attribute attribute attribute attribute attribute attribute attribute

s tring ID ; string ID; string string Accession; Accession; s tring Description; Descripti on ; string l i s t<string> GeneNames list<string> GeneNames;; s tring Sequence string Sequence;; int int Sequence_Length Sequence_Length;;

CREATE TABLE TABLE Orgs CREATE Orgs {(name name string string,,

8.3

An

237
orgid tring orgid s string )); ; CREATE CREATE TABLE TABLE ( name (name geneid geneid orgid orgid s tartpos startpos l ength length ); ) ; ProteinSynonyms ProteinSynonyms Genes Genes s tring , string, string string,, string string,, l ong , long, long long t i onary< s tr ing , 9dic dictionary< string, set<s truc t { syn string lang s tring } > > ; set<struct{syn string,, lang string}> >;

The K2MDL K2MDL description description of of the the integration integration and and transformation transformation that that is is performed performed The when ProteinSynonyms are mapped into when the the sources sources Swiss-Prot, Swiss-Prot, Orgs, Orgs, Genes, Genes, and and ProteinSynonyms are mapped into the ODL definition the ontology ontology view. view. K2MDL K2MDL descriptions descriptions look look like like the the ODL definition in in the the ontol ontology, OQL expressions ogy, enhanced enhanced with with OQL expressions that that compute compute the the class class extents, extents, the the attribute attribute values, (a related idea appears values, and and the the relationship relationship connections connections (a related idea appears in in a a paper paper from from the 99 1 International Conference 1 8] ) . the 1 1991 Conference on Management Management of of Data [ [18]). The las sde The definition definition of of the the class class Protein P r o t e i n as as a a K2MDL K2MDL c cl ass d e ff starts starts with with an an OQL statement that that shows shows how how to to compute compute the the extent extent proteins p r o t e i n s of of this this class class OQL statement by by collecting collecting the the accession accession numbers numbers from from SwissProt. SwissProt. The The elements elements of of the the extent extent are used as are used as object identifiers (OIDs) (OIDs) for for the the objects objects in in the the class. class. The The rest rest of of the the definition shows value of definition shows how how to to compute compute the the value of each each attribute attribute for for a a generic generic object object identified le e- identified by by the the OlD OlD (denoted (denoted by by the the keyword keyword sel s e l ff )) . . The The OQL OQL function function e l f the sa ment c ) extracts ment ( (c) extracts the the unique unique element element o of the collection, collection, c, when when c iis a singleton, singleton, and and raises raises an an exception exception otherwise. otherwise.
MEDIATOR MEDIATOR DESCRIPTION DESCRIPTION I

c las sde f Prot e in classdef Protein s t inct s . Access i on ((extent extent proteins proteins { select select di distinct s.Accession from ssprots S i}) from swi swissprots s;}) { attribute tring Swi ssprotAccess ion { f; } ; attribute s string SwissprotAccession { sel self; ]; a t tr ibute string attribute string recommendedName recommendedName { { el ement ( select s . Description element(select s.Description f rom Swi ssprot s . Access i on=sel f ) ; from Swissprot s where where s s.Accession=self); }; };

238 238

8 8

The IInformation The nformation

Integration System K2 K2

attribute lt ternateNames attribute set<string> set<string> a al ernateNames { { s e l e c t di s t inc t ps . syn select distinct ps.syn f rom swi ssprots s , ProteinSynonyms from swissprots s, P r o t e i n S y n o n y m s [ [ss . .DDescription e s c r i p t i o n ] ] ps ps where . Access io on=sel . lang= " eng " ; where s s. Accessi n : s e l ff and and ps ps.lang:"eng"; }; } ; at tribute string attribute string sequence sequence { { el ement se elect . Sequence ele m e n t ((s lect s s.Sequence from ssprots s . Accession=sel f ) ; from swi swissprots s where where s s.Accession=self); }; ; } attribute attribute int int seqLength seqLength { { e le ement select . Sequence_Length el m e n t ((s elect s s. Sequence_Length from ssprots s . Access ion=sel f ) ; from swi swissprots s where where s s.Accession=self); }; ; } relationship et<Gene> h has Source relationship s set<Gene> asS ource { { select t inct gn select dis distinct gn from ssprots s , s . GeneNames gn from swi swissprots s, s.GeneNames gn where . Accession=s el f ; where s s.Accession:self; : : has Produc t ; } inverse inverse Gene Gene--hasProduct;

Note, al ternate Note, for for example, example, the the computation computation of of the the value value of of the the attribute attribute a lt ernateNames. Names. For For an an object object identified identified by by sel s e l f ,f, find find the the Swiss-Prot Swiss-Prot entry, entry, s, s, whose whose accession ell accession number number is is s se f ,f, then then use use the the description description of of s s as as a a key key in in the the dictio dictionary nary ProteinSynonyms. P r o t e i n S y n o n y m s . The The entry entry retrieved retrieved from from the the dictionary dictionary Protein ProteinSynonyms synonyms [ s s . . Description] D e s c r i p t i o n ] is is a a set set of of records. records. Select Select from from this this set set the the records records with with names names in in English English and and collect collect those those names names into into the the answer. answer. The The value value of of the the attribute attribute is is a a set set of of strings. strings. A A further further query query posed posed against against the the class class Protein P r o t e i n may, may, for select objects whose a al for example, example, select objects whose l tterna e r n a t eteNames N a m e s attribute attribute contains contains a a given given synonym. synonym.
MEDIATOR MEDIATOR DESCRIPTION DESCRIPTION 11 II

c lassdef Gene classdef Gene ((extent extent genes genes { s t inc t g .g geneid ;}) { select select di distinct g. e n e i d from from Genes Genes g g;})
{

s t ruct Range struct Range


{

}; } ;

l ong from long from;; l ong t o; long to;

8.4

I nternal

239
attribute attribute string string name name { { e le ement select s t inct g . name el m e n t ((s e l e c t di distinct g.name from from Genes Genes g g where . geneid=sel f ) ; where g g.geneid=self); }; } ; attribute tring organism attribute s string organism { { e lement ( select o . name e l e m e n t ( s e l e c t o.name f rom Orgs , Genes from Orgs 0 o, Genes g g where . geneid=s e l f and . orgid=g . orgid ) ; where g g.geneid=self and o o.orgid=g.orgid); }; } ; attribute Range location attribute Range location { { el ement select t ruct ele m e n t ((s elect s struct ((from: from : g . s tartpos , to . startpos + g . l ength- l ) g.startpos, to:: g g.startpos+g.length-l) from from Genes Genes g g where . geneid=sel f ) ; where g g.geneid=self); }; } ; relat io onship e in hasProduct has Product { relati n s h i p Prot Protein { e le ement select s t inct s . Access io on el m e n t ((s e l e c t di distinct s. Accessi n f rom swi ssprots s , s . GeneNames gn from swissprots s, s.GeneNames gn,, Genes Genes g g where . geneid=sel f and . name ) ; where g g.geneid=self and gn=g gn=g.name); : : hasSource ; } inverse inverse Protein Protein--hasSource;

This This example example illustrates illustrates that that relatively relatively complex complex integrations integrations and and transformations transformations can easily modified. can be be expressed expressed concisely concisely and and clearly, clearly, and and can can be be easily modified.

8 .4 8.4

IINTERNAL NTE R NAL LAN G UAG E LANGUAGE


The making ODL, ODL, OQL, well together The key key to to making OQL, and and K2MDL K2MDL work work well together is is the the expres expressiveness K2, which based on values and siveness of of the the internal internal framework framework of of K2, which is is based on complex complex values and dictionaries. ODL classes internally as dictionaries. ODL classes with with extents extents are are represented represented internally as dictionaries dictionaries with object identities). with abstract abstract keys keys (the (the object identities). This This framework framework opens opens the the door door to to inter interesting optimizations that make make the esting optimizations that the approach approach feasible. feasible. The The K2 K2 internal internal language language is is organized organized by by its its type type structure. structure. There There are are base base types types such such as as string string and and number; number; record record and and variant variant types; types; collection collection types, types, namely namely sets, sets, bags, bags, and and lists; lists; and and dictionary dictionary types. types. For For each each type type construction construction there there are are two two constructors, such as empty set and set union, and decon classes of operations: classes of operations: constructors, such as empty set and set union, and deconstructors, such as field selection. selection. The collection types as record record field The operations operations for for collection types are are structors, such

240 240

8 8
~

The IInformation The nformation IIntegration ntegration


~

System K2 K2

inspired inspired from from the the theory theory of of monads [19] [19] and and are are outlined outlined in in an an article article in in The Theand collection [20]. For For details details specific specific to to aggregates aggregates and collection oretical Computer Science Science [20]. conversions of the 7th International Conference on Category Category The Theconversions see see Proceedings of [10]; the the operations operations on on dictionaries dictionaries are are described described in in Science [10]; ory and Computer Science the 999 Proceedings of of the International Conference on Database Theory [9]. [9]. the 1 1999 The The internal internal language language derives derives its its expressiveness expressiveness from from its its flexibility. flexibility. The The prim primorthogonality, which which says says that that their their itives are are chosen chosen according according to to the the principle principle of of orthogonality, itives meaning should not meaning should should not not overlap overlap and and that that one one should not be be able able to to simulate simulate one one prim primitive itive through through a a combination combination of of the the others. others. This This produces produces a a language language with with fewer, fewer, but but better better understood, understood, primitives. primitives. As As an an example, example, consider consider the the following following basic basic query query statement: statement:
select (x) select E E(x) from n R from x x i in R where P P( (x) where x)

This This translates translates internally internally into into


SetU(x R)if P(x) sngset(E(x)) else emptyse emptyset SetU ( x in in R ) if P ( x ) then then sngset ( E ( x ) ) else t

where x in T( x ) is y the where SetU SetU ( (x i n S)T (x) is the the set set deconstructor deconstructor suggested suggested b by the theory theory of of monads and sngset ( e ) is a singleton set (i.e., the set with j ust one element, e). monads and s n g s e t (e) is a set (i.e., the set with just one element, The The semantics semantics of of the the set set deconstructor deconstructor is is that that of of the the union union of of a a family family of of sets. sets. For {at an} then then For example, example, if if S = = { a l,, .. .. .., , an}
SetU ( x in in S ) T (x) = T ( al ) U . U T an ) SetU(x S)T(x) --T(a I) U . . UT ((a n)

This This approach approach increases increases the the overall overall language language expressiveness expressiveness by by allowing allowing any any type typecorrect language is compositional). At At the the correct combination combination of of the the primitives primitives (the (the language is fully fully compositional). same provides a approach to identification of same time time it it provides a systematic systematic approach to the the identification of optimization optimization transformations yielding an equational theory for for the the internal internal language. language. The The transformations by by yielding an equational formulas formulas of of such such a a theory theory are are equalities equalities between between equivalent equivalent parts parts of of queries, queries, and and they rewriting queries they are are used used for for rewriting queries in in several several of of the the stages stages of of the the optimizer. optimizer. Finally, Finally, K2 known efficient such as K2 exploits exploits known efficient physical physical algorithms algorithms for for operations operations such as joins joins by by automatically within queries automatically identifying identifying within queries the the groups groups of of primitives primitives that that compute compute these these operations. operations.

8.5

DATA S O U RCES SOURCES


K 2 maps s described K2 maps data data from from external external sources sources into into its its internal internal language, language, a as described pre previously. viously. K2 K2 also also has has a a notion notion of of functions, functions, which which are are used used to to provide provide access access both both to (a sequence to stand-alone stand-alone applications applications such such as as BLAST BLAST (a sequence similarity similarity package) package) and and to to

8.5 8.5

Data Sources Data Sources

241

pre-defined pre-defined and and user-defined user-defined data data conversion conversion routines. routines. This This flexibility flexibility allows allows K2 K2 to to represent represent most most data data sources sources faithfully faithfully and and usefully. usefully. K2 accesses data drivers. K2 accesses external external information information through through data drivers. This This is is an an intermediate intermediate layer between layer between the the K2 K2 system system proper proper and and the the actual actual data data sources. sources. There There are are two two kinds K2: those kinds of of drivers drivers in in K2: those that that are are tightly tightly integrated integrated with with the the server server and and those those that that are are more more loosely loosely connected. connected. Integrated Integrated data data drivers drivers (IDDs) (IDDs) are are created created by by extending extending the the two two abstract abstract Java Java classes that IDDs export K2, classes that form form K2's K2's driver driver API. API. The The IDDs export a a set set of of entry entry points points to to K2, connect connect to to the the data data source, source, send send queries queries to to it, it, receive receive results results from from it, it, and and package package the IDDs with the results results for for use use in in the the rest rest of of the the K2 K2 system. system. The The tight tight coupling coupling of of IDDs with the the K2 minimizes the K2 system system minimizes the overhead overhead associated associated with with connecting connecting to to the the data data source source and allows for also cache and allows for additional additional optimizations. optimizations. K2 K2 can can also cache results results of of queries queries sent sent to the IDDs to improve overall speed. to the IDDs to improve overall speed. K2 K2 comes comes with with an an IDD IDD that that can can connect connect to to any any relational relational database database system system that that implements implements Sun's Sun's JDBC JDBC API. API. A A Sybase-specific Sybase-specific IDD IDD is is also also available, available, which which takes takes advantage advantage of of some some features features of of Sybase Sybase that that are are not not available available through through JDBC. JDBC. OracleOracle- and and MySQL-specific MySQL-specific versions versions are are currently currently under under development. development. To To con connect JDBC, one nect to to a a new new relational relational database database that that supports supports JDBC, one merely merely needs needs to to add add the the connection connection information information to to a a configuration configuration file, file, and and K2 K2 will will automatically automatically expose expose the the underlying underlying schema schema for for querying. querying. When When the the underlying underlying schema schema changes, changes, the the K2 K2 administrator administrator must must restart restart the the IDD IDD so so it it can can rediscover rediscover the the new new schema. schema. A A proce procedure planned as dure for for automatically automatically detecting detecting and and rediscovering rediscovering schema schema changes changes is is planned as a a future future enhancement. enhancement. Another Another IDD IDD provided provided with with K2 K2 makes makes use use of of the the World World Wide Wide Web Web Wrap Wrap6 per per Factory Factory (W4F), (W4F), also also developed developed at at the the University University of of Pennsylvania. Pennsylvania. 6 W4F W4F is is a a toolkit for wrappers for wrappers can toolkit for the the generation generation of of wrappers for Web Web sources. sources. New New wrappers can easily easily be be generated generated using using W4F's W4F's interface interface and and can can then then be be converted converted automatically automatically to to K2 handled by K2 drivers. drivers. Changes Changes to to the the format format of of Web Web sources sources are are partly partly handled by W4F's W4F's declarative language. Large declarative wrapper wrapper specification specification language. Large format format changes changes require require human human intervention intervention to to re-define re-define and and re-generate re-generate wrappers. wrappers. A A very very powerful powerful feature feature of of K2 K2 is is its its ability ability to to distribute distribute query query execution execution using using its IDD for RMI. This its IDD for Java Java RMI. This IDD IDD can can make make an an RMI RMI connection connection to to a a remote remote K2 K2 server server and and send send it it part part of of the the local local query query for for processing. processing. Therefore, Therefore, all all that that is is required required to to connect connect to to the the remote remote K2 K2 server server and and start start distributing distributing queries queries is is a a change to the local K2 configuration file. change to the local K2 configuration file. Sometimes Sometimes it it is is difficult difficult to to develop develop an an IDD IDD for for a a new new type type of of data data source. source. For For example, example, to to treat treat a a group group of of flat flat files files as as a a data data source source it it is is often often easier easier to to write write a a

Information about available at http://db.cis.upenn.edu/Research/w4f.html. http'//db.cis.upenn,edu/Research/w4f.html. 6. Information about W4F W4F is available

242

242

8 8

The IInformation nformation IIntegration ntegration System System K2 The K2

Per! script to rather than than implementing Perl script to handle handle the the string string manipulations manipulations involved involved rather implementing them them in in Java, Java, as as would would be be required required in in an an IDD. IDD. In In fact, fact, some some data data sources sources cannot cannot be all from languages. To handle be accessed accessed at at all from Java Java but but only only through through APIs APIs in in other other languages. To handle this, K2 has an IDD called the the PipeDriver that does does not not connect connect to to the the data data source source this, K2 has an IDD called PipeDriver that directly, directly, but but to to a a decoupled decoupled data data driver driver (DDD) (DDD).. A DDD DDD iis a simple, simple, stand-alone stand-alone application, application, written written in in any any language language a at all, that that A sa t all, accepts queries through through its accepts queries its standard standard input input stream stream and and writes writes results results to to its its standard standard output. PipeDriver takes DDD and output. The The PipeDriver takes care care of of sending sending queries queries to to the the DDD and converting converting the DDDs at the results results into into K2's K2's internal internal representation. representation. It It can can run run multiple multiple DDDs at once once to to take take advantage advantage of of parallelism, parallelism, and and it it can can make make use use of of the the caching caching mechanism mechanism built built into all of simplifies the DDD writer. into IDDs, IDDs, all of which which simplifies the job job of of the the DDD writer. The is responsible responsible for for establishing to the data source The DDD DDD is establishing a a connection connection to the data source (often (often nothing is is required in this this step), telling the PipeDriver it it has has made made the the connection, nothing required in step), telling the PipeDriver connection, and waiting waiting for for a a query query to When the the DDD receives a a query, query, it it extracts extracts the the and to come come in. in. When DDD receives appropriate result from out in in a simple data exchange appropriate result from the the data data source source and and writes writes it it out a simple data exchange format. then returns waiting for the next loop continues continues until until format. It It then returns to to waiting for the next query. query. This This loop the point the river tells tells the DDD to to the K2 K2 server server is is brought brought down, down, at at which which point the PipeD PipeDriver the DDD terminate. terminate. DDDs have been written to connect SRS, KEGG, KEGG, and DDDs have been written to connect with with SRS, and BLAST, BLAST, as as well well as as a Web-based sources. a number number of of Web-based sources. The The time time it it takes takes to to create create a a new new DDD DDD depends depends greatly capabilities of greatly on on the the capabilities of the the data data source source for for which which it it is is being being written written and and on on how intelligence is built into only takes how much much intelligence is to to be be built into the the DDD. DDD. For For example, example, it it only takes an an hour hour or or two two to to write write a a Per! Perl script script to to connect connect to to a a simple simple document document storage storage system; system; the be written to receive receive an ID, retrieve the script script must must be written to an ID, retrieve the the document, document, and and print print it it out out in in K2's format, taking taking into K2's exchange exchange format, into account account any any error error conditions conditions that that might might occur. occur. However, the writer has create special-purpose However, the DDD DDD writer has the the flexibility flexibility to to create special-purpose DDDs DDDs of of any DDD has has been that performs performs queries any complexity. complexity. One One DDD been written written that queries over over a a collection collection of maintains a local disk of documents documents that that come come from from a a remote remote Web Web site. site. This This DDD DDD maintains a local disk cache cache of of the the documents documents to to speed speed access. access. It It is is responsible responsible for for downloading downloading new new versions versions of of out-of-date out-of-date documents, documents, taking taking concurrency concurrency issues issues into into account, account, and and parsing and indexing documents on the fly. It also supports a complex language parsing and indexing documents on the fly. It also supports a complex language for for querying documents and querying the the documents and for for retrieving retrieving structured structured subsets subsets of of their their components. components. This DDD was was written This DDD written over over the the course course of of two two weeks weeks and and has has been been expanded expanded periodically since. periodically since.

" I m\ -

8.6
i"

" - ' 7 It

QU E RY O PTI M IZATI O N QUERY OPTIMIZATION

K2 has a uses both rules and cost K2 has a flexible, flexible, extensible extensible query query optimizer optimizer that that uses both rewrite rewrite rules and a a cost model. rules are model. The The rewrite rewrite rules are used used to to transform transform queries queries into into structurally structurally minimal minimal forms forms whose whose execution execution is is always always faster faster than than the the original original query; query; this this is is independent independent

8.7 8.7

User U s e r I nterfaces nterfaces ~=,~,.~.,,,,~,,~-.-~,,,,~.,,~,,~~

. . . . . . ~ooo~--~,, .,,~.,~,,o~,~,~~~,,,~,,,,~,,,~o~_~~===~~

243 243

of hand, a model uses of the the nature nature of of the the data. data. On On the the other other hand, a cost cost model uses information information about the data, such oins, about the nature nature of of the the data, such as as the the size size of of the the data data sets, sets, selectivity selectivity of of jjoins, available available bandwidth, bandwidth, and and latency latency of of the the data data sources. sources. The The cost cost information information is is used used to to choose choose between between minimal minimal queries queries that that are are incomparable incomparable with with respect respect to to the the rewrite rewrite rules. rules. After After translating translating the the query query into into an an abstract abstract syntax syntax tree, tree, which which K2 K2 uses uses to to represent represent queries queries internally, internally, it it is is manipulated manipulated by by applying applying a a series series of of rewrite rewrite rules. rules. This This is is where where the the bulk bulk of of K2's K2's optimization optimization work work is is done. done. First, First, a a collection collection of of rules rules is is applied applied that that simplifies simplifies the the query query by by taking taking pieces pieces expressed expressed using using certain certain kinds kinds of of tree tree nodes nodes and and replacing replacing them them with with others. others. This This reduces reduces the the number number of of types types of of nodes nodes needed needed to to deal deal with, with, thus thus reducing reducing the the number number and and complexity complexity of of the the rewrite rewrite rules rules that that follow. follow. Next, the the query is normalized. normalized. Normalization Normalization rules rules include include steps steps such as tak takNext, query is such as ing a a function applied to to a a conditional conditional structure structure and and rewriting rewriting it it so so the the function function ing function applied is each of Another normalization is applied applied to to each of the the expressions expressions in in the the condition. condition. Another normalization rule rule removes removes loops loops that that range range over over collections collections known known to to be be empty. empty. Repeated Repeated applica application of the 20, reduces tion of the normalization normalization rules, rules, which which currently currently number number more more than than 20, reduces the query minimal, or the query to to a a minimal, or normal, normal, form. form. A final set rules are A final set of of rewrite rewrite rules are then then applied applied to to the the normalized normalized query. query. These These rules rules include include parallelizing parallelizing the the scanning scanning of of external external data data sources sources and and pushing pushing selections, selections, projections, projections, and and joins joins down down to to the the drivers drivers where where possible. possible. Even Even after after all all the the rewrite rewrite rules rules have have been been applied, applied, there there may may still still be be room room for for further query may may have have a family of further optimization. optimization. In In particular, particular, a a query a family of minimal minimal forms forms rather than a forms, the rather than a single single one. one. To To choose choose between between the the minimal minimal forms, the expected expected execution time using a execution time of of each each version version of of the the query query is is estimated estimated using a cost cost model, model, and and the model is still in the fastest fastest query query form form is is chosen. chosen. The The current current cost cost model is still in the the development development stage. stage. While While it it is is functional functional and and works works well well most most of of the the time, time, it it does does not not always always choose choose the the optimal optimal form form of of the the query. query.

8.7

U S E R IINTERFACES NTE R FACES USER


K2 K2 has has been been developed developed using using a a client-server client-server model. model. The The K2 K2 server server listens listens for for connections either through socket or easy to connections either through a a socket or through through Java Java RMI. RMI. It It is is easy to develop develop a a client that connect to K2 through through one paths, issue client that can can connect to K2 one of of these these paths, issue queries, queries, and and receive receive results. basic clients clients come with K2: text-based client, client, an client, and and results. Three Three basic come with K2: a a text-based an RMI RMI client, one runs as one that that runs as a a servlet. servlet. The text-based client The interactive, interactive, text-based client connects connects to to a a K2 K2 server server through through a a socket socket connection. It connection. It accepts accepts a a query query in in OQL OQL through through a a command-line-style command-line-style interface, interface, sends it to displays it; sends it to the the server, server, gets gets the the result result back back as as formatted formatted text, text, and and displays it; then then it it waits for the next query to be entered. This simple client generally is used to test waits for the next query to be entered. This simple client generally is used to test

244 244 ~

8 8

T h e IInformation nformation The

Integration System K2 K2

the the socket socket connection connection to to K2 K2 and and to to issue issue simple simple queries queries during during the the development development process. It process. It is is not not intended intended to to be be an an interface interface for for end end users. users. The The other other type type of of user user connection connection is is through through RMI. RMI. These These connections connections are are capable capable of of executing executing multiple multiple queries queries at at once once and and can can halt halt execution execution of of queries queries in in progress. progress. This This is is the the connection connection method method that that K2 K2 servers servers use use to to connect connect to to other other K2 K2 servers servers to to distribute distribute the the execution execution of of a a query. query. There There is is a a client client that that makes makes an an RMI RMI connection connection to to a a K2 K2 server server with with ad administrator ministrator privileges. privileges. The The server server restricts restricts these these connections connections to to certain certain usernames usernames connecting password. Currently, connecting from from certain certain IP IP addresses addresses and and requires requires a a password. Currently, an an ad administrator ministrator can can examine examine the the state state of of the the server, server, add add and and remove remove connections connections to to individual individual drivers, drivers, stop stop currently currently running running queries, queries, disconnect disconnect clients, clients, and and bring bring the the server server to to a a state state where where it it can can be be stopped stopped safely. safely. More More functionality functionality is is planned planned for for administrators administrators in in the the future. future. A also included developed A client client that that runs runs as as a a servlet servlet is is also included with with K2. K2. Using Using code code developed at the the Computational Computational Biology and Informatics Informatics Laboratory the University University of of at Biology and Laboratory (at (at the Pennsylvania), this this servlet servlet allows allows entry entry of of ad acl hoc K2 K2 queries queries and and maintains maintains the the Pennsylvania), results for username individually. results for each each username individually. A component of the data A major major component of any any user user interface interface is is the the representation representation of of the data to to the the user. user. As As exemplified exemplified previously, previously, a a user user (perhaps (perhaps one one serving serving a a larger larger group) group) can can define define in in K2MDL K2MDL a a transformed/integrated transformed/integrated schema schema for for a a class class of of users users and and applications applications and and can can specify specify how how the the objects objects of of this this schema schema map map to to the the under undercan is lying data sources. this schema lying data sources. Users Users of of this schema (called (called an an ontology by by some) some) can issue sue vastly vastly simplified simplified queries queries against against it, it, without without knowledge knowledge of of the the data data sources sources themselves. themselves.

8.8

SCALA B I LlTY SCALABILITY


In In theory, theory, the the K2 K2 system system can can be be used used to to interconnect interconnect an an arbitrarily arbitrarily large large number number of data sources. In practice, the system has been configured with up to of data sources. In practice, the system has been configured with up to 30 30 data data sources PubMed, MacOStat, MacOStat, GenBank, GenBank, sources and and software software packages, packages, including including GUS, GUS, PubMed, Swiss-Prot, BLAST, KEGG, and relational data bases maintained maintained in Swiss-Prot, BLAST, KEGG, and several several relational databases in Oracle, Oracle, Sybase, MySQL. Even Sybase, and and MySQL. Even when when querying querying using using this this configuration, configuration, however, however, it it has has been been rare rare to to access access more more than than five five data data sources sources and and software software packages packages in in the the same same query. query. The primary obstacles scaling K2 The primary obstacles to to scaling K2 to to a a larger larger system system are are (1) writing writing data data drivers to connect to new data sources and (2) peak memory consumption. drivers to connect to new data sources and (2) peak memory consumption. The The difficulty writing data difficulty of of writing data drivers drivers to to external external data data sources sources has has been been mitigated mitigated to to some extent extent by by the fact that that they they are than instance some the fact are type type specific specific rather rather than instance specific. specific. For example, For example, once once an an Oracle Oracle driver driver is is written, written, it it can can be be used used for for any any Oracle Oracle

8.9 8.9

Impact

2 45 245

data source. source. On On the the other other hand, hand, an an AceDB driver must must take account the the data AceDB driver take into into account schema of of the the AceDB AceDB source source and and is is therefore therefore not not generic. generic. Drivers Drivers are are also also relatively relatively schema simple in in K2 K2 because because they they merely merely perform perform data data translation. translation. Any Any complex, complex, semantic semantic simple transformations are are performed performed by by K2MDL K2MDL code, code, which which is is high-level high-level and and therefore therefore transformations more maintainable. maintainable. more While peak peak memory has not not as as yet yet been been an an issue issue for the queries queries While memory consumption consumption has for the handled by by K2 K2 in in the the past, past, it it could could become become a a problem problem as as applications applications become become handled larger. K2 K2 provides provides a a means means of of limiting limiting the the number number of of queries queries it it will will run run simultasimulta larger. neously and and the the number number of of data data source source connections it will will maintain. maintain. This This allows allows neously connections it an administrator administrator to to tune tune the the system system to to the the capabilities capabilities of of the the machine machine on on which which it it an 1S running. runmng. is K2 is is a view integration it does does not not store any data data Because a view integration environment, environment, it store any Because K2 locally; however, however, it it may may need intermediate results results for that locally; need to to store store intermediate for operations operations that cannot be be streamed streamed (i.e., processed on on the intermediate results results are are cannot (i.e., processed the fly). fly). At At present, present, intermediate stored in main memory. stored in main memory. Examples of that cannot cannot be streamed include include sorting, sorting, set differ Examples of operations operations that be streamed set difference, nesting, nesting, and and join. join. For For example, when the the difference of two two data data sets sets is taken, ence, example, when difference of is taken, no output can be be issued issued until until both both data data sets are read; of the the data data sets sets may may have have no output can sets are read; one one of to be be cached cached while while the the other other is is streamed streamed and and the difference is is calculated. calculated. Similarly, to the difference Similarly, although a a jjoin oin output output can can be be produced produced as as soon soon as as a a match match is is found found between between although elements of of the the two two data data sets, sets, data data cannot cannot be be discarded until it it is is known known not not to to elements discarded until match any any future future input input from other data data set match from the the other set (see (see two two International International Conference Conference on articles [21 [21,, 22] 22] for for discussions discussions of of im imon the the Management Management of of Data Data (SIGMOD) (SIGMOD) articles plementations operators in streaming environments). Thus, when plementations of of operators in streaming environments). Thus, when data data sets sets are are large, these operations may persistent mem large, these operations may need need to to cache cache temporary temporary results results in in persistent memory. Such techniques techniques are ory. Such are not not currently currently part part of of the the K2 K2 system system and and require require futher futher development. development. Another Another difficulty difficulty of of scaling scaling to to an an arbitrarily arbitrarily large large environment environment is is the the sheer sheer complexity available. To complexity of of understanding understanding what what information information is is available. To mitigate mitigate this, this, smaller smaller mediated mediated components components can can be be composed composed to to form form larger larger mediated mediated com components. ponents. Thus, Thus, users users need need not not be be aware aware of of the the numerous numerous underlying underlying data data sources sources and and software software systems, systems, but but they they can can interact interact with with the the system system through through a a high-level high-level interface interface representing representing the the ontology ontology of of data. data.
. .

8.9 8.9

IIMPACT M PACT
As As with with Kleisli, Kleisli, a a tremendous tremendous enhancement enhancement in in productivity productivity is is gained gained by by expressing expressing complicated complicated integrations integrations in in a a few few lines lines of of K2MDL K2MDL code code as as opposed opposed to to much much larger C++. What larger programs programs written written in in Per! Perl or or C++. What this this means means for for the the system system integrator integrator

246

8 8

The nformation The IInformation

'/C'TO'TI K2 Integration .... System K2

is is the the ability ability to to build build central central client-server client-server or or mix-and-match mix-and-match components components that that interoperate interoperate with with other other technologies. technologies. Among Among other other things, things, K2 K2 provides: provides: (50 lines 9 Enhanced Enhanced productivity productivity (50 lines of of K2MDL K2MDL correspond correspond to to thousands thousands of of lines lines of C++) C++) of
9 Maintainability Maintainability and and easy easy transitions transitions (e.g., (e.g., warehousing) warehousing)

9 Re-usability Re-usability (structural (structural changes changes easy easy to to make make at at the the mediation mediation language language level) level)
9 Compliance Compliance with with ODMG ODMG standards standards

K2 K2 has has been been used used extensively extensively in in applications applications within within the the pharmaceutical pharmaceutical company company GlaxoSmithKline. Some GlaxoSmithKline. Some of of the the major major benefits benefits of of the the system system exploited exploited in in these these ap applications have been the ease with which ontologies can be represented in K2MDL plications have been the ease with which ontologies can be represented in K2MDL and and the the ability ability to to conveniently conveniently compose compose small small mediators mediators into into larger larger mediators. mediators. K2 K2 was was also also used used to to implement implement a a distributed distributed genomic-neuroanatomical genomic-neuroanatomical database. The bases and database. The system system combines combines data databases and software software developed developed at at the the Cen Center Bioinformatics at bases of ter for for Bioinformatics at the the University University of of Pennsylvania-including Pennsylvania~including data databases of genetic genetic and and physical physical maps, maps, genomic genomic sequences, sequences, transcribed transcribed sequences, sequences, and and gene gene expression all linked to external biology data bases and expression data, data, all linked to external biology databases and internal internal project project data data (GUS )-with mouse visualization packages (GUS [14] [14])~with mouse brain brain atlas atlas data data and and visualization packages developed developed at at the Computer Vision Vision Laboratory Brain Mapping the Computer Laboratory and and Brain Mapping Center Center at at Drexel Drexel University. University. The biological biological and and medical medical value value of the activity activity lay lay in ability to to correlate correlate specific The of the in the the ability specific brain structures with molecular molecular and physiological processes. processes. The technological brain structures with and physiological The technological value of of the the activity was that that K2 K2 was Macintosh operating operating system system en value activity was was ported ported to to a a Macintosh environment (MacOSX), visualization into the the envi vironment (MacOSX), visualization packages packages were were tightly tightly integrated integrated into environment, and both both an an ethernet ethernet and and a gigabit network network were ronment, and a gigabit were used used in in the the application. application. The work was significantly facilitated by the fact that K2 is implemented in Java. Java. The work was significantly facilitated by the fact that K2 is implemented in K2 now now runs runs on on Linux, Sun Solaris, Microsoft Windows, Windows, and MacOS K2 Linux, Sun Solaris, Microsoft and Apple Apple MacOS platforms. platforms.

,~.

8.10 8. 1 0
\

\.i,. ......

S U M MARY SUMMARY

This chapter chapter presented presented the the K2 K2 system system for integrating heterogeneous heterogeneous data sources. This for integrating data sources. K2 is is general general purpose, purpose, written written in in Java, Java, and and includes includes JDBC JDBC interfaces interfaces to to relational relational K2 sources as as well well as as interfaces interfaces for for a a variety variety of of special-format special-format bioinformatics bioinformatics sources. sources. sources The K2 K2 system system has has a a universal universal internal internal data data model model that that allows allows for the direct direct The for the representation of of relational, relational, nested nested complex complex value, value, object-oriented, object-oriented, information information representation retrieval (dictionaries), (dictionaries), and and a a variety variety of of electronic electronic data data interchange interchange (EDI) (EDI) formats, formats, retrieval including XML-based XML-based sources. sources. The The internal internal language language features features a a set set of of equivalence equivalence including

~ .

References References
. =. .= . ~ .

..

. . .

=.~o...,,~,~.~~,,~.-~,,,.~o~.,.~.~..... =.~~

......

~,~o,,.o,~o~ ....... =

247 2 4 7

laws laws on on which which an an extensible extensible optimizer optimizer is is based. based. The The system system has has user user interfaces interfaces based on ODMG standard, standard, including ODL-OQL combined based on the the ODMG including a a novel novel ODL-OQL combined design design for specifications of for high-level high-level specifications of mediators. mediators.

-.

.....

- -..

,-.'.

:..

ACK N OWLE DG M E NTS ACKNOWLEDGMENTS


The would like like to acknowledge the The authors authors would to acknowledge the contributions contributions of of David David Benton, Benton, Howard Howard Bilofsky, Bilofsky, Peter Peter Buneman, Buneman, Jonathan Jonathan Crabtree, Crabtree, Carl Carl Gustaufson, Gustaufson, Kazem Kazem Lellahi, Yoni Lellahi, Yoni Nissanov, Nissanov, G. G. Christian Christian Overton, Overton, Lucian Lucian Popa, Popa, and and Limsoon Limsoon Wong. Wong.

R E F E R E N CES REFERENCES
[1] [11 [2] [21 [3]
R. J. Robbins, ed. Report of of the Invitational DOE Workshop on Genome 993. Informatics, Informatics, April April 26-27, 1993. Baltimore, MD: DOE, 1 1993. J. Melton and A. Simon. Simon. Understanding the New SQL. San Francisco: Morgan Kaufman, 1993. R. G. G. Cattell and D . 0. D.. Barry, Barry, eds. eds. The Object Database Standard: Standard: ODMG ODMG 3 3.0. San Francisco: Morgan Kaufmann, 2000. P. Buneman, Libkin, D. Suciu, " Special Buneman, L. Libkin, Suciu, et al. "Comprehension "Comprehension Syntax. Syntax." Special Interest 994): Group on the Management of of Data (SIGMOD) Record 23, no. 1 1 (March 1 1994):

[4] [4]

87-96. [5]
S. S. Chawathe, H. Garcia-Molina, Garcia-Molina, J. Hammer, et al. "The TSIMMIS Project: Oth Integration Integration of Heterogeneous Information Information Sources." Sources." In Proceedings Proceedings of of the 1 lOth Meeting of of the Information Processing Processing Society of of Japan Conference. Conference. Tokyo, Japan:

1 994. 1994. [6]


A. Levy, Levy, D D.. Srivastava, and T. Kirk. Kirk. "Data Model and Query Evaluation in Global 1 995): Information Systems." Systems." Journal ofIntelligent of Intelligent Information Systems 5, no. 2 ((1995):

1 2 1-143. 121-143. [7]


L. M. Haas, P. L.M. P. M. Schwarz, Schwarz, P. P. Kodali, et al. "DiscoveryLink: A System System for Integrated Access to Life Sciences Data Sources. " IBM Systems Journal 40, no. 2 Integrated Access Life Sciences Sources."

(200 1 ) : 489-5 1l. (2001): 489-511. [8]


I-M. A. Chen and V. V. M. Markowitz. "An Overview of the Object-Protocol Model " Information Systems 20, no. 5 (OPM) and OPM Data Management Tools. Tools."

( 1 995): 393-4 18. (1995): 393-418. [9]


L. Popa and V. V. Tannen. Tannen. "An Equational Equational Chase for Path Conjunctive Conjunctive Queries, Constraints, " In Proceedings Proceedings of of the International Conference Conference on Constraints, and Views. Views." Database Theory (ICDT), Lecture Notes in Computer Science, Science, 39-57. Heidelberg, 999. Germany: Springer Veriag, Verlag, 1 1999.

248

8 8

The nformation The IInformation

Integration System K2 K2

[[10] 1 0] K. Lellahi and V. V. Tannen. "A Calculus for Collections and Aggregates." In

Proceedings Proceedings of the 7th 7th International Conference Conference on Category Category Theory and Computer Science, vol. 1290, Science, CTCS'97, Lecture Notes in Computer Science, Science, vo!.
edited by E. Moggi and G. Rosolini, 26 1-280. Heidelberg, Germany: 261-280. 1997. Springer-Verlag, 1 997. [ 1 1 ] T. [11] T. Etzold and P. P. Argos. "SRS: "SRS: An Indexing and Retrieval Tool for for Flat File File Data Biosciences 9, no. 1 1 ((1993): 1 993): 49-57. Libraries." Computer Applications of Biosciences [12] [12] W. W. Fujibuchi, S. S. Goto, H. Migimatsu, et al. "DBGETlLinkDB: "DBGET/LinkDB: An Integrated c Symposium on Biocomputing, 683-694. Database Database Retrieval Retrieval System." System." In In Pacifi Pacific 683-694. 1998. 1 998. [ 1 3] M. Rebhan, V. [13] V. Chalifa-Caspi, J. Prilusky, et et al. GeneCards: GeneCards. Encyclopedia for Genes, Genes, Proteins Proteins and Diseases. Diseases. Rehovot, Israel: Weizmann Institute of Science, Bioinformatics Unit and Genome Center, 1 997. 1997. http://bioinformatics.weizmann.ac.il/cards. http://bioinformatics.weizmann.ac.illcards. [ 14] S. Davidson, J. Crabtree, B. Brunk, et al. " K21Kleisli and GUS: [14] "K2/Kleisli GUS: Experiments in Integrated Access Access to Genomic Data Sources." IBM Systems Systems Journal 40, no. 2 (2001): 512-531. (2001 ): 5 1 2-53 1 . [ 1 5] S. S. Davidson, C. Overton, V. [15] V. Tannen, et al. "BioKleisli: A Digital Library for 996): Biomedical Researchers. " Journal of of Digital Libraries Libraries 1, no. 1 1 (November 1 1996): Researchers." 36-53. 36-53. [ 1 6] G. Wiederhold. "Mediators in the Architecture of Future Information Systems." [16] IEEE Computer 25, no. 3 (March 1992): 1 992): 38-49. 3 8-49.
Y. Papakonstantinou, Papakonstantinou, D. Quass, et al. "The TSIMMIS [ 1 7] H. Garcia-Molina, [17] Garcia-Molina, Y. Languages. " In Proceedings Proceedings of of the Approach to Mediation: Data Models and Languages."

Generation Information Technologies Second International International Workshop on Next Generation 1 85-1 93. 1995. 1 995. and Systems, 185-193. Proceedings of 1 991 [ 1 8 ] S. Abiteboul and Views. " In Proceedings [18] and A. Bonner. Bonner. "Objects and and Views." of the 1991 A CM SIGMOD International Conference on the Management of 8-247. 238-247. ACM of Data, 23 Francisco: ACM Press, 1 991. San Francisco: 1991.
[ 1 9] S. MacLane. Categories [19] Categories for the Working Mathematician. Berlin, Germany: Springer-Verlag, 1971. P. Buneman, S. Naqvi, Naqvi, V. Tannen, et "Principles of Programming with [20] E et al. "Principles Theoretical Computer Computer Science 149, no. 1 (1995): ( 1995): 3-48. 3-4 8 . Collection Types." Theoretical Z . Ives, D. Florescu, M. Friedman, et a l . "An Adaptive Query Execution System [2 1 ] Z. [21] al. Proceedings of of the 1999 1 999 ACM A CM SIGMOD International International Data Integration." Integration. " In Proceedings for Data Management of of Data, 299-310. 299-3 1 0 . San Francisco: Francisco: ACM Press, 1999. 1 999. Conference on Management
F. Tian, et "NiagaraCQ: A Scalable Continuous Query D . DeWitt, DeWitt, E [22] J. Chen, D. et al. "NiagaraCQ: Databases." In Proceedings Proceedings of of the 2000 2000 A CM SIGMOD SIGMOD System for Internet Databases." 379-390. San Francisco: Francisco: ACM ACM International Conference on Management Management of of Data, 379-390. International Press, 2000. Press, 2000.

CHAPTER CHAPTER

II

P/F D M Mediator M ediator P/FDM for a Bioi nform atics for a Bioinformatics Database Federation Federation Database
Graham J. J . L. L. Kemp Kemp and and Peter Peter M. M . D. D. Gray Gray Graham

The Internet is is an an increasingly increasingly important important research research tool scientists working in The Internet tool for for scientists working in biotechnology and the biological sciences. collections of biological data biotechnology and the biological sciences. Many Many collections of biological data can be be accessed via the World Wide Wide Web, Web, including data on and genome can accessed via the World including data on protein protein and genome sequences and structure, expression data, biological pathways, and molecular in sequences and structure, expression data, biological pathways, and molecular interactions. Scientists' ability to to use use these resources effectively effectively to to explore explore hy teractions. Scientists' ability these data data resources hyif it to ask ask precise potheses in potheses in silica silico is is enhanced enhanced if it is is easy easy to precise and and complex complex questions questions that span different kinds kinds of resources to that span across across several several different of data data resources to find find the the answer. answer. Some online online data data resources resources provide provide search search facilities facilities to scientists to Some to enable enable scientists to find find items interest in in a a particular particular database database more However, working working interac items of of interest more easily. easily. However, interactively tively with with an an Internet Internet browser browser is is extremely extremely limited limited when when one one want want to to ask ask complex complex questions involving questions involving related related data data held held at at different different locations locations and and in in different different formats formats as one must formulate a series of data access requests, run these against as one must formulate a series of data access requests, run these against the the var various databanks data banks and bases, and results retrieved ious and data databases, and then then combine combine the the results retrieved from from the the different sources. This time-consuming for different sources. This is is both both awkward awkward and and time-consuming for the the user. user. To streamline this process, a To streamline this process, a federated federated architecture architecture and and the the PlFDM P/FDM Medi Mediator are developed to integrate access to heterogeneous, distributed biological ator are developed to integrate access to heterogeneous, distributed biological data bases. The databases. The spectrum spectrum of of choices choices for for data data integration integration is is summarized summarized in in Figure Figure 9 . 1 . As 9.1. As advocated advocated by by Robbins, Robbins, the the approach approach presented presented in in this this chapter chapter does does not not re require quire that that a a common common hardware hardware platform platform or or vendor vendor database database management management system system (DBMS) [1] [1] be be adopted adopted by by the the participating participating sites. sites. The The approach approach presented presented here here (DBMS) needs needs a a "shared "shared data data model model across across participating participating sites," sites," but but does does not not require require that that the the participating participating sites sites all all use use the the same same data data model model internally. internally. Rather, Rather, it it is is sufficient sufficient for for the the mediator mediator to to hold hold descriptions descriptions of of the the participating participating sites sites that that are are expressed expressed in in a a common common data data model; model; in in this this system, system, the the P/FDM P/FDM Mediator, Mediator, the the functional functional data model [2] purpose. Tasks data model [2] is is used used for for this this purpose. Tasks performed performed by by the the P/FDM P/FDM Media Mediator include determining which external data bases are relevant in answering tor include determining which external databases are relevant in answering users' users' queries, bases, queries, dividing dividing queries queries into into parts parts that that will will be be sent sent to to different different external external data databases,

250 250
Tightly Coupled: Tightly Coupled:

9 9

P/FDM Mediator nformatics Database P/FDM Mediator for for a a Bioi Bioinformatics Database Federation Federation

single single organizational organizational entity overseeing overseeing information information resources resources relevant relevant to to genome research research

adoption adoption of of common common DBMSs DBMSs at at participating participating sites sites shared shared data data model across across participating participating sites sites

common semantics for for data data publishing publishing


Loosely : Loosely Coupled Coupled"
9 .1 9.1 F IGURE FIGURE common common syntax syntax for for data data publishing publishing

Continuum Continuum from from tightly tightly coupled coupled to to loosely loosely coupled coupled distributed distributed systems systems involving involving multiple databases databases [ 1]. multiple [1]. translating bases, and translating these these subqueries subqueries into into the the language(s) language(s) of of the the external external data databases, and combining combining the the results results for for presentation. presentation.

9. 1 9.1 9. 1.1 9.1.1

APPROACH APPROACH Altern ative Arch itectu res fo r IIntegrating nteg rati n g Alternative Architectures for Data bases Databases
The The aim aim is is to to develop develop a a system system that that will will provide provide uniform uniform access access to to heterogeneous heterogeneous data bases via high-level query will databases via a a single single high-level query language language or or graphical graphical interface interface and and will enable queries. This enable multi-database multi-database queries. This objective objective is is illustrated illustrated in in Figure Figure 9.2. 9.2. Data Data replication bases are replication and and multi-data multi-databases are two two alternative alternative approaches approaches that that could could help help to to meet meet this this objective. objective.
Data Data Replication Replication Approach Approach

In all data various data bases and In a a data replication architecture, architecture, all data from from the the various databases and data databanks would be banks of of interest interest would be copied copied to to a a single single local local data data repository, repository, under under a a single single DBMS. DBMS. This This approach approach is is taken taken by by Rieche Rieche and and Dittrich Dittrich [3], [3], who who propose propose an an ar architecture EMBL nu chitecture in in which which the the contents contents of of biological biological databanks databanks including including the the EMBL nucleotide cleotide sequence sequence databank databank and and Swiss-Prot Swiss-Prot are are imported imported into into a a central central repository. repository. However, a data replication replication approach may not not be this apap However, a data approach may be appropriate appropriate for for this plication data repository plication domain domain for for several several reasons. reasons. Significantly, Significantly, by by adopting adopting a a data repository

9 .1 9.1
~ ~

Approach
. ~ ... ~. . . o ~ , ~ , ~ , ~ , . . . . . . . . . ~ ~

251 2 51

Ad Ad Hoc Queries / Graphical User Interface

I
f r
r-....

........,

f'.,..
,.-

SRS SRS (Swiss-Pro!, (Swiss-Prot, BRENDA,etc.)

./

r-....

.......

./

f'.,..

-1-- .......

[
I'..
r

./

-f- ........,

(Ensembl)

'----.-/
.......

MySQL MySQL (Ensembl)

(Wormbase)

../

f'.,..

,.-

'----.-/
.......

ACEDB ACEDB (Wormbase) /

AMOS II

AMOSII

./

,.'-

- ,....
-

P/FDM

P/FDM

./

........,

"-

POET

POET

....... ./

../

,.- - f- ........,

r-..

'----.-/ '----.-/ '----.-/ r 1-- ....... r f- --..", ../ ./ ./ "r-..

9.2 FIGURE

Users Users should should be be able able to to access access heterogeneous, heterogeneous, distributed distributed bioinformatics bioinformatics resources resources via a a single single query query language language or or graphical graphical user user interface. interface. via

approach, approach, the the advantages advantages of of the the individual individual heterogeneous heterogeneous systems systems are are lost. lost. For For example, example, many many biological biological data data resources resources have have their their own own customized customized search search capa capabilities bilities tailored tailored to to the the particular particular physical physical representation representation that that best best suits suits that that data data set. set. Rieche Rieche and and Dittrich Dittrich [3] [3] acknowledge acknowledge the the need need to to use use existing existing software software and and pro propose exporters to to export export and and convert convert data data from from the the data data repository repository pose implementing implementing exporters into to software software tools. into files files that that can can be be used used as as input input to tools. Another Another disadvantage disadvantage of of a a data data replication replication approach approach is is the the time time and and effort effort required maintain an up-to-date repository. repository. Scientists the most required to to maintain an up-to-date Scientists want want access access to to the most recent data as soon as they have been deposited in a data bank. Therefore, whenever recent data as soon as they have been deposited in a databank. Therefore, whenever one bases is one of of the the contributing contributing data databases is updated, updated, the the same same update update should should be be made made to to the the data data repository. repository.
Multi-Database Multi-Database Approach Approach

A A multi-database multi-database approach approach that that makes makes use use of of existing existing remote remote data data sources, sources, with with data described in relationships in data described in terms terms of of entities, entities, attributes, attributes, and and relationships in a a high-level high-level schema schema is is favored. favored. The The schema schema is is designed designed without without regard regard to to the the physical physical storage storage format(s). Queries are expressed in terms of the conceptual schema, and format(s). Queries are expressed in terms of the conceptual schema, and it it is is

252

==== === = = =;==: ===

Mediator for for a Bioinformatics Bioinformatics Database Federation P/FDM Mediator

the complex software [4] to decide what the role role of of a a complex software component component called called a a mediator [4] to decide what component data sources need particular query, component data sources need to to be be accessed accessed to to answer answer a a particular query, organize organize the computation, and combine the results. Robbins [ [1] and Karp Karp [5] have also also the computation, and combine the results. Robbins 1 ] and [5] have advocated advocated a a federated, federated, multi-database multi-database approach. approach. In In contrast contrast to to a a data data replication replication approach, approach, a a multi-database multi-database approach approach takes takes advantage advantage of of the the customized customized search search capabilities capabilities of of the the component component data data sources sources in in the the federation federation by by sending sending requests requests to to these these from from the the mediator. mediator. The The component component re resources sources keep keep their their autonomy, autonomy, and and users users can can continue continue to to use use them them exactly exactly as as before. before. There bases are There is is no no local local mirroring, mirroring, and and updates updates to to the the remote remote component component data databases are available approach does available immediately. immediately. A A multi-database multi-database approach does not not require require that that large large data data sets necessary to sets be be imported imported from from a a variety variety of of sources, sources, and and it it is is not not necessary to convert convert all physical storage all data data for for use use with with a a single single physical storage schema. schema. However, However, extra extra effort effort is is needed mapping from component data bases onto needed to to achieve achieve a a mapping from the the component databases onto the the conceptual conceptual model. model.

9. 1 .2 9.1.2

The F Functional Model The u ncti o n a l Data Model


The n the The system system described described in in this this chapter chapter is is based based o on the P/FDM P/FDM object object database database system system [6], [6], which which has has been been developed developed for for storing storing and and integrating integrating protein protein structure structure data. data. P/FDM P/FDM is is itself itself based based on on the the functional functional data data model model (FDM) (FDM) [2], [2], whose whose basic basic con concepts cepts are are entities entities and and functions. functions. Entities Entities are are used used to to represent represent conceptual conceptual objects, objects, while object. Functions model while functions functions represent represent the the properties properties of of an an object. Functions are are used used to to model both both scalar scalar attributes, attributes, such such as as a a protein protein structure's structure's resolution resolution and and the the number number of of amino amino acid acid residues residues in in a a protein protein chain, chain, and and relationships, relationships, such such as as the the relationship relationship between between chains chains and and the the residues residues they they contain. contain. Functions Functions may may be be single-valued single-valued or or multi-valued, computed on multi-valued, and and their their values values can can either either be be stored stored or or computed on demand. demand. Entity Entity classes arranged in in subtype subtype hierarchies, classes can can be be arranged hierarchies, with with subclasses subclasses inheriting inheriting the the prop properties well as erties of of their their superclass, superclass, as as well as having having their their own own specialized specialized properties. properties. Con Contrast trast this this with with the the more more widely widely used used relational relational data data model model whose whose basic basic concept concept is the relational data is the relation-a relationma rectangular rectangular table table of of data. data. Unlike Unlike the the FDM, FDM, the the relational data model does model does not not directly directly support support class class hierarchies hierarchies or or many-to-many many-to-many relationships. relationships. Daplex Daplex is is the the query query language language associated associated with with the the FDM FDM and, and, to to illustrate illustrate the the syntax 9.3 shows syntax of of the the language, language, Figure Figure 9.3 shows two two Daplex Daplex queries queries expressed expressed against against an of the amino acid residues an antibody antibody database database [7] [7].. Query Query A A prints prints "the names of found at the position identifi ed by Kabat code number 88 88 in variable domains of identified of antibody light chains (VL domains). " This domains)." This residue residue is is located located in in the the core core of of the the VL and is residue at Kabat position position 23. VL domain domain and is spatially spatially adjacent adjacent to to the the residue at Kabat 23. Query Query B B prints "the prints "the names names of of the the residues residues at at these these two two positions positions together together with with the the computed distance between the centers " Thus, Thus, one one can can centers of of their alpha-carbon (CA) (CA) atoms. atoms." explore a structural hypothesis about the explore a structural hypothesis about the spatial spatial separation separation of of these these residues residues being being

~ . . . . . . . . . . = .

9. 1 9.1

Approach

.........

253 253

Query A A:: Query


f or r each n ig_domain fo each d d i in ig_domain such such that that domain_type (d) domain_type(d) name (d) , name(d),
=

ia able = "v v ar ari b l e "" and and chain_type c h a i n _ t y p e ((d) d ) = " light l i g h t ""

p r i n t (protein_code ( p r o t e i n _ c o d e ((d o m a i n _ s t r u c t u r e (d ( d ) ) ,, print domain_structure


88 " ) j name (kabat _res idue ( d , " name(kabat_residue(d, "88"))) ;

Query B: B" Query


for for each each s s in in structure structure domain_type (vl) domain_type(vl)

for in domain_structure_inv domain_structure_inv(s) such that that f or each each vl vl in ( s ) such


=

print (protein_code (protein_code ( (s), print s) ,

variable " and (vl) = " li ght " = " "variable" and chain_type chain_type(vl) "light"
=

name (kabat (kabat_residue (vl, "88") ), name _residue (vl , " 88 " ,

name (kabat _residue (vl , " 23 " , name(kabat_residue(vl, "23") ),

di st ance ( atom(kabat_residue (vl , " 23" ) , " CA " ) , distance(atom(kabat_residue(vl, "23"), "CA"),

atom(kabat_residue(vl, "88"), atom (kabat_residue (vl , "88 ") ,

"CA"))) ; " CA " ) j

9.3 F IGURE FIGURE

Daplex queries queries against against an an antibody antibody database. database. Query Query A: A: " "For Daplex For each light chain variable domain, domain, print the PDB entry code, the domain domain name, and the name of of the residue Kabat position position 88. " Query For each VL domain, print residue occurring at Kabat 88." Query B: B: " "For of the residues at Kabat positions 23 23 and and 88, 88, and and the PDB entry code, the names of the the distance between between their alpha-carbon atoms." atoms."

related related to to the the residue residue types types occurring occurring at at these these positions. positions. In In Query Query A, A, ig_domain ig_domain is is an an entity entity class class representing representing immunoglobulin immunoglobulin domains, domains, and and the the values values that that the the object identifiers identifiers of variable variable d takes takes are are the the object of instances instances of of that that class. class. Domain_type Domain_type and and chain_type c h a i n _ t y p e are are string-valued string-valued functions functions defined defined on on the the class class ig_domain. i g_domain. Domain_s D o m a i n _ structure t r u c t u r e is is a a relationship relationship function function that that returns returns the the object object identifier identifier of of the contains the the instance instance of of the the class class structure structure that that contains the ig_domain i g _ d o m a i n instance instance identified identified by (o domain_s by the the value value of of d. The The expression expression protein_code protein_code(d m a i n _ s t rtructure u c t u r e ( d ) ()d ) ) illustrates an example of function composition. Query B shows illustrates an example of function composition. Query B shows nested nested loops loops in in Daplex. Daplex. FDM FDM had had its its origins origins in in early early work work [2, [2, 8], 8], done done before before relational relational databases databases were OOP) and were a a commercial commercial product product and and before before object-oriented object-oriented programming programming ((OOP) and windows, windows, icons, icons, menus, menus, and and pointers pointers (WIMP) (WIMP) interfaces interfaces had had caught caught on. on. Although Although it it is is an an old old model, model, it it has has adapted adapted well well to to developments developments in in computing computing because because it it was was based based on on good good principles. principles. First, First, it it was was based based on on the the use use of of values values denoting denoting persistent persistent identifiers identifiers for for instances instances of of entity entity classes, classes, as as noted noted by by Kulkarni Kulkarni and and Atkinson Atkinson [9]. [9].

254 254

9
~ ~ ~ ~

P/FDM Mediator Mediator for for a Bioinformatics Bioinformatics Database Federation


~i:i ~ = ~

This later later became became central central to to the the object database manifesto manifesto [10]. [10]. Also, Also, it it had had the the This object database notion subtype hierarchy, this to include meth notion of of a a subtype hierarchy, and and it it was was not not difficult difficult to to adapt adapt this to include methods with with overriding, as in in OOP the notions notions of of an an entity, entity, a a property, property, ods overriding, as OOP [[11]. 1 1 ] . Second, Second, the and by a its inverse) and a a relationship relationship (as (as represented represented by a function function and and its inverse) corresponded corresponded closely to diagrams, which closely to the the entity-relationship entity-relationship (ER) (ER) model model and and ER ER diagrams, which have have stood stood the the test test of of time. time. Third, Third, it it used used a a query query language language based based on on applicative applicative expressions, expressions, which which combined combined data data extraction extraction with with computation. computation. Thus, Thus, it it was was a a mathematically mathematically well-formed 12, 13], well-formed language, language, based based on on the the functional functional languages languages [ [12, 13], and and it it avoided avoided the syntactic oddities of structured query language (SQL). the syntactic oddities of structured query language (SQL). In developing developing the the language language since since early early work work on on the the excluded excluded function function data data In model [9] Pro log was model (EFDM) (EFDM) [9] Prolog was used used as as the the implementation implementation language language because because it it is is so so good good for for pattern pattern matching, matching, program program transformation, transformation, and and code code generation. generation. Also, Also, the the data data independence independence of of the the FDM, FDM, with with its its original original roots roots in in Multibase Multibase [14], [14], allows allows to to interface interface to to a a variety variety of of kinds kinds of of data data storage, storage, instead instead of of using using a a persistent persistent programming language language with with its its own own data data storage. storage. Thus, Thus, unlike unlike the the relational relational or or programming object-relational object-relational models, models, FDM FDM does does not not have have a a particular particular notion notion of of storage storage (row (row or or tuple) tuple) built built into into it. it. Nor Nor does does it it have have fine fine details details of of arrays arrays or or record record structures, structures, as as used used in in programming programming languages. languages. Instead, Instead, it it uses uses a a mathematical mathematical notion notion of of mapping mapping from from entity entity identifier identifier to to associated associated objects objects or or values. values. Another Another change change has has been been to to strengthen strengthen the the referential referential transparency transparency of of the the original original Daplex Daplex language language by by making making it Zermelo-Fraenkel set it correspond correspond more more closely closely to to Zermelo-Fraenkel set expressions expressions (ZF-expressions), (ZF-expressions), a 1 3 , 15], a name name taken taken from from the the Miranda Miranda functional functional language language [ [13, 15], and and also also called called list comprehensions. comprehensions.

9 . 1 .3 9.1.3

Sch emas iin n the ration Schemas the Fede Federation


The The design design philosophy philosophy of of P/FDM P/FDM Mediator Mediator can can be be illustrated illustrated with with reference reference to to the proposed by by the the ANSI ANSI Standards Standards Planning Planning And And Re Rethe three-schema architecture proposed quirements which quirements Committee Committee (SPARC) (SPARC) [16]. [16]. This This consists consists of of the the internal level, which describes the physical structure of the database; the conceptual level, which which de dedescribes the physical structure of the database; the conceptual scribes scribes the the database database at at a a higher higher level level and and hides hides details details of of the the physical physical storage; storage; and and the external level, level, which which includes includes a a number number of of external external schemas schemas or or user user views. views. the external This This three-schema three-schema architecture architecture promotes promotes data data independence independence by by demanding demanding that that database systems logical and physical data database systems be be constructed constructed so so they they provide provide both both logical and physical data independence. Logical data provides that independence. Logical data independence independence provides that the the conceptual conceptual data data model model must evolve without without changing changing external external application application programs. programs. Only Only view view must be be able able to to evolve definitions changing (e.g., replace access definitions and and mappings mappings may may need need changing (e.g., to to replace access to to a a stored stored field field by by access access to to a a derived derived field field calculated calculated from from others others in in the the revised revised schema). schema). Physical Physical data data independence independence allows allows to to refine refine the the internal internal schema schema for for improved improved performance without needing to alter the way queries are formulated. performance without needing to alter the way queries are formulated.

9 .p 1 proach A 9.1

..................................................................................................................................

255 255

External External Schema Schema

E EM

Conceptual Conceptual Schema Schema C M

,-------'--, - - - - - - -,-----,
'--_ -_ _ _ _ _ _ _ _ _ _ lJU.J-

IInternal nternal Schema Schema

IM

_ _ _ _ _ _'_ _ _ _ _ --,-'-'--' _ _ _ _ _

I External External Schema Schema

ER R

I Conceptual Conceptual Schema Schema C CR RI I


9 .4 9.4 F I G U RE FIGURE

IInternal nternal Schema Schema

II R R

ANSI-SPARC schema schema architecture architecture describing describing the the mediator mediator (left) (left) and and an an external external

data data resource resource (right). (right). The The clear clear separation separation between between schemas schemas at at different different levels levels helps helps in in building building a a database federation in Figure 9.4 three database federation in a a modular modular fashion. fashion. In In Figure 9.4 the the ANSI-SPARC ANSI-SPARC threeschema architecture architecture is schema is shown shown in in two two situations: situations: in in each each of of the the individual individual data data resources resources and and in in the the mediator mediator itself. itself. First, First, consider consider an an external external data data resource. resource. The The resource's resource's conceptual schema (which called eR) (which is is called CR) describes describes the the logical logical structure structure of of the the data data contained contained in in that that resource. If relational database, database, it resource. If the the resource resource is is a a relational it will will include include information information about about table table names, names, column column names, names, and and type type information information about about stored stored values. values. With With SRS SRS [ 1 7] , it field names. [17], it is is the the databank databank names names and and field names. These These systems systems also also provide provide a a mech mechanism anism for for querying querying the the data data resource resource in in terms terms of of the the table/class/databank table/class/databank names names and and columnltaglattribute/field column/tag/attribute/field names names that that are are presented presented in in the the conceptual conceptual schema. schema.

256 256

===:==:=:= ===;

P/FD M Mediator nformatics Database Federation P/FDM Mediator for for a Bioi Bioinformatics

The internal schema The schema (or (or storage storage schema, schema, which which is is called called IR) IR) contains contains details details of of allocation allocation of of data data records records to to storage storage areas, areas, placement placement strategy, strategy, use use of of indexes, indexes, set set ordering, and internal data ordering, and internal data structures structures that that impact impact efficiency efficiency and and implementation implementation details details [6] [6].. This This chapter chapter is is not not concerned concerned with with the the internal internal schemas schemas of of individual individual data data resources. resources. The The mapping mapping from from the the conceptual conceptual schema schema to to the the internal internal schema schema has has already already been been implemented implemented by by others others within within each each of of the the individual individual resources resources and assumed this and it it is is assumed this has has been been done done to to make make best best use use of of the the resources' resources' internal internal organization. organization. A ER) describes A resource's resource's external external schema schema (which (which is is called called ER) describes a a view view onto onto the the data resource's conceptual conceptual schema. simplest, the could be data resource's schema. At At its its simplest, the external external schema schema could be identical allows for identical to to the the conceptual conceptual schema. schema. However, However, the the ANSI-SPARC ANSI-SPARC model model allows for differences differences between between the the schemas schemas at at these these layers layers so so different different users users and and application application programmers can can each each be be presented presented with with a a view view that that best best suits suits their their individual individual programmers requirements requirements and and access access privileges. privileges. Thus, Thus, there there can can be be many many external external schemas, schemas, each each providing providing users users with with a a different different view view onto onto the the resource's resource's conceptual conceptual schema. schema. A A resource's resource's external, external, conceptual, conceptual, and and internal internal schemas schemas are are represented represented on on the the right side side of of Figure Figure 9.4. right 9.4. The ANSI-SPARC ANSI-SPARC three-layer three-layer model model can can also also be be used used to to describe describe the the mediator mediator The central database federation, central to to the the database federation, and and this this is is shown shown on on the the left left side side of of Figure Figure 9.4. mediator's conceptual CM), also 9.4. The The mediator's conceptual schema schema ((CM), also referred referred to to as as the the federation's federation's integration schema, schema, describes describes the the content content of of the the (virtual) (virtual) data data resources resources that that are are members members of of the the federation, federation, including including the the semantic semantic relationships relationships that that hold hold between between data resources. This This schema expressed using FDM because because it data items items in in these these resources. schema is is expressed using the the FDM it makes makes computed computed data data in in a a virtual virtual resource. resource. Both Both the the derived derived results results of of arithmetic arithmetic expressions and look no data. Both Both expressions and derived derived relationships relationships look no different different from from stored stored data. are calculates and are the the result result of of functions-one functions--one calculates and the the other other extracts extracts from from storage. storage. As possible, the is designed As far far as as possible, the CM CM is designed based based on on the the semantics semantics of of the the domain, domain, rather of the rather than than consideration consideration of the actual actual partitioning partitioning and and organization organization of of data data in in the resources. Thus, the external external resources. Thus, through through functional functional mappings, mappings, different different attributes attributes of of the the same same conceptual conceptual entity entity can can be be spread spread across across different different external external data data resources, resources, and relationships between and subclass-superclass subclass-superclass relationships between entities entities in in the the conceptual conceptual model model of of the domain might 18]. the domain might not not be be present present explicitly explicitly in in the the external external resources resources [ [18]. N o one o agree na No one can can expect expect scientists scientists t to agree o on a single single schema. schema. Different Different scientists scientists are interested in different aspects of the data and will want to see data structured are interested in different aspects of the data and will want to see data structured in a matches the concepts, attributes, attributes, and in a way way that that matches the concepts, and relationships relationships in in their their own own personal model. This This is made possible possible by following the personal model. is made by following the ANSI-SPARC ANSI-SPARC model; model; the the principle data independence means the different principle of of logical logical data independence means the system system can can provide provide different EM is used to users with views onto users with different different views onto the the integration integration schema. schema. EM is used to refer refer to to an an external user of this chapter, external schema schema presented presented to to a a user of the the mediator. mediator. In In this chapter, queries queries are are expressed directly against against an CM), but could alternatively but these these could alternatively expressed directly an integration integration schema schema ((CM),

9.1 9 .1

Approach

257

be expressed expressed against against an an external external schema schema (EM)' (EM). If If so, so, an an additional additional layer layer of of mapping mapping be functions would would be be required required to to translate translate the the query query from from EM EM to t o CM. CM. functions A A vital vital task task performed performed by by the the mediator mediator is is to to map map between between CM CMand and the the union union of of the different different CR. CR. To To facilitate facilitate this this process, process, another another schema schema layer layer that, that, in in contrast contrast the to CM, CM, is based based on on the the structure structure and and content content of of the the external external data data resources resources is is to introduced. introduced. This This schema schema is is internal internal to to the the mediator mediator and and is is referred referred to to as as IM IM. The The mediator needs mediator needs to to have have a a view view onto onto the the data data resource resource that that matches matches this this internal internal schema; thus, should be schema; thus, IM IM and and ER ER should be the the same. same. FDM FDM is is used used to to represent represent IMIER. IM/ER. Having Having the the same same data data model model for for CM CM and and IMIER IM/ER brings brings advantages advantages in in processing processing multi-database queries, queries, as as will will be be seen seen in in Section Section 9 9.1.4. multi-database . 1 .4. By By redrawing redrawing Figure Figure 9.4, 9.4, the the situation situation where where there there are are u u different different external external schemas schemas presented presented to to users users and and r r data data resources, resources, the the relationship relationship between between schemas schemas in the the federation federation is is as as shown shown in in Figure Figure 9.5. 9.5. The The five-level five-level schema schema shown shown there there is is in similar to to that that described described by by Sheth Sheth and and Larson Larson [ [19]. similar 1 9] . From n building From past past experience experience iin building the the prototype prototype system, system, designing designing the the IMIER IM/ER schema schema in in a a way way that that most most directly directly describes describes the the structure structure of of a a particular particular external external resource resource adds adds practical practical benefits. benefits. In In adding adding a a remote remote resource resource to to CM, CM, one one may may focus focus mainly mainly on on those those attributes attributes in in the the remote remote resource resource that that are are important important for for making making joins joins across across to to other other resources. resources. This This clarifies clarifies the the role role of of each each schema schema level level and and the the purpose purpose of of the the query query transformation transformation task task in in transforming transforming queries queries from from one one schema level level to to the the next next level level down. down. schema

9.1.4 9. 1 . 4

Mediator Architecture M ed i ato r Arch itectu re


The role role of the mediator mediator is to process process queries against the the federation's The of the is to queries expressed expressed against federation's integration schema schema ( CM ) ' The holds meta-data meta-data describing describing the the integration integration integration (CM). The mediator mediator holds schema and also also the external schemas schemas of of the federation's data resources schema and the external of each each of the federation's data resources these meta-data meta-data are are held, held, for pattern matching, matching, (ER). In (ER). In PIFDM, P/FDM, these for convenience convenience of of pattern as Prolog clauses clauses compiled compiled from high-level schema as Prolog from high-level schema descriptions. descriptions. The of the the P/FDM Mediator is main The architecture architecture of P/FDM Mediator is shown shown in in Figure Figure 9.6. 9.6. The The main components of of the the mediator mediator are are described described in in the the following following paragraphs. paragraphs. components module reads reads a a Daplex Daplex query query (Daplex (Daplex is is the the query query language language for The parser module The for the FDM), FDM), checks checks it it for for consistency consistency against against a a schema schema (in (in this this case case the the integration integration the schema), and and produces produces a a list comprehension containing containing the the essential essential elements elements of of schema), the query query in in a a form form that that is is easier easier to to process process than than Daplex Daplex text (this internal internal form the text (this form is is called called ICode). The simplifier's simplifier's role role is is to produce shorter, shorter, more more elegant, elegant, and more consistent consistent The to produce and more ICode, mainly mainly through through removing removing redundant redundant variables variables and and expressions expressions (e.g., (e.g., if if ICode, the ICode ICode contains an expression variables, that that expression can be be the contains an expression equating equating two two variables, expression can eliminated, provided that all references references to to one one variable variable are replaced by by references references eliminated, provided that all are replaced

2 5 8 ............................................................................................................................................................ 258

9 9

P/FDM iator for P/FDMMed Mediator for a a Bioinformatics Bioinformatics Database Database Federation Federation

EM

EM

EM

1
ER

1
ER

CM

CM

1
3

ER
1

ERI]
2

E ER R

E ER R

C OR F

C OF R2

C OR 3 R3

CR

'R
9.5 9.5
FIGURE F IGURE

1 1

'I R R

2 2

'I R R

3 3

I IR R

Schemas in a database database federation.

to the the other), other), and and flattening flattening out out nested nested expressions expressions where where this this does does not not change change the the to meaning of of the the query. query. Essentially, Essentially, simplifying simplifying the the ICode ICode form form of of a a query query makes makes meaning the subsequent subsequent query query processing processing steps more efficient efficient by by reducing reducing the the number number of of the steps more equivalent ICode ICode combinations combinations that that need to be be checked. checked. equivalent need to rule-based rewriter rewriter matches matches expressions expressions in in the the query query with with patterns present The rule-based patterns present The on the the left-hand left-hand side side of of declarative declarative rewrite rewrite rules rules and and replaces replaces these these with with the the rightright on hand side side of of the the rewrite rewrite rule rule after making appropriate appropriate variable variable substitutions. substitutions. hand after making Rewrite rules can can be be used used to to perform perform semantic semantic query query optimization. optimization. This This capacapa Rewrite rules bility is is important important because because graphical graphical interfaces interfaces make make it it easy easy for users to to express express bility for users inefficient queries queries that that cannot cannot always always be be optimized optimized using using general general purpose purpose query query inefficient

Schemas & & Schemas Conditions Conditions

Remote Remote

I Integration Integration Scnema ScSema

Graphical User Graphical User Interface Interface

-------------------- ---------------------:

I Schema I Compiler[

, ,

{t .................... I I
ICode ICode

User User Queries Queries

Query Query Results Results

Meta-Data Meta-gata--,-I,s,mp,,.~r

Compiler I IConditionl

I I .~w..~11
"

Optimizer

1Ia~ro~e~n~

ICode

Muapp.ing Mapping "--I ICode nctions ---Functions -- 1 Rewriter I


Enhanced ICode

Splitter 1
ICode ICode 1 1 ICode ICode2 2 ICode ICode 3 3

Query

I Co~e ~ener~,or' I I Co~ ~ner~,or,, I


Query 1 1
Query Query2 2 Query 3 Query 3

I ~er,~

II, ~r~er,,~

~r~er,~
Results 1 1 Results

!1 ~r~er,,~
Results 2 2 Results Results 3 3 Results

I ~e~u,, Fu~er ]
9.6 9.6
FIGURE F I G U RE

Mediator architecture. architecture. The The components components of of the the mediator mediator are are shown shown inside inside the the Mediator dashed dashed line. line.

260 260

P/FD M Mediator nformatics Database Federation P/FDM Mediator for for a Bioi Bioinformatics

optimization optimization strategies. strategies. This This is is because because transforming transforming the the original original query query to to a a more more efficient one one may may require require domain domain knowledge knowledge (e.g., (e.g., two two or or more more alternative alternative naviga navigaefficient tion paths may exist exist between between distantly related object object classes but domain knowledge tion paths may distantly related classes but domain knowledge is needed needed to to recognize recognize that that these these are are indeed equivalent). is indeed equivalent). A A recent recent enhancement enhancement to to the the mediator mediator is is an an extension extension to to the the Daplex Daplex compiler compiler that allows generic rewrite rules to be expressed using a declarative high-level that allows generic rewrite rules to be expressed using a declarative high-level syntax syntax [20]. [20]. This This makes makes it it easy easy to to add add new new query query optimization optimization strategies strategies to to the the mediator. mediator. module performs The performs generic generic query query optimization. optimization. The optimizer module The reordering module reorders reorders expressions expressions in in the the ICode ICode to to ensure ensure that that The reordering all variable dependencies are are observed. observed. all variable dependencies The reads declarative declarative statements statements about about conditions conditions that that The condition compiler reads must must hold hold between between data data items items in in different different external external data data resources resources so so these these values values can can be be mapped mapped onto onto the the integration integration schema. schema. The The ICode rewriter rewriter expands expands the the original original ICode ICode by by applying applying mapping mapping func functions that transform references to the integration schema into references to tions that transform references to the integration schema into references to the the federation's component component data bases. Essentially federation's databases. Essentially the the same same rewriter rewriter mentioned mentioned pre previously used here, rules. These viously is is used here, but but with with a a different different set set of of rewrite rewrite rules. These rewrite rewrite rules rules enhance by adding enhance the the ICode ICode by adding tags tags to to indicate indicate the the actual actual data data sources sources that that contain contain particular entity particular entity classes classes and and attribute attribute values. values. Thus, Thus, the the ICode ICode rewriter rewriter transforms transforms ER of the the query query expressed expressed against against the the CM CM into into a a query query expressed expressed against against the the ER of one one or more more external external databases. databases. or The The crucial crucial idea idea behind behind the the query splitter is is to to move move selective selective filter filter operations operations in in the the query query down down into into the the appropriate appropriate chunks chunks so so they they can can be be applied applied early early and and efficiently efficiently using using local local search search facilities facilities as as registered registered with with the the mediator mediator [KIG94]. [KIG94]. The bases hold The mediator mediator identifies identifies which which external external data databases hold data data referred referred to to by by parts parts of inspecting the meta-data, and of an an integrated integrated query query by by inspecting the meta-data, and adjacent adjacent query query elements elements referring database are referring to to the the same same database are grouped grouped together together into into chunks. chunks. Query Query chunks chunks are are shuffled shuffled and and variable variable dependencies dependencies are are checked checked to to produce produce alternative alternative execution execution plans. description of good schedule/sequence plans. A A generic generic description of costs costs is is used used to to select select a a good schedule/sequence of of instructions accessing the bases. instructions for for accessing the remote remote data databases. Each ICode chunk is sent Each ICode chunk is sent to to one one of of several several code generators. These These translate translate ICode into bases, transforming ICode into queries queries that that are are executable executable by by the the remote remote data databases, transforming query query fragments to CR. New New code code generators generators can can be be linked linked into into the the mediator mediator fragments from from ER to at at runtime. runtime. Wrappers deal deal with with communication communication with with the the external external data data resources. resources. They They consist consist of of two two parts: parts: code code responsible responsible for for sending sending queries queries to to remote remote resources resources and and code code that that receives receives and and parses parses the the results results returned returned from from the the remote remote resources. resources. Wrappers Wrappers for for new new resources resources can can be be linked linked into into the the mediator mediator at at runtime. runtime. Note Note that that a a wrapper wrapper can can only only make make use use of of whatever whatever querying querying facilities facilities are are provided provided by by the the

9 ..1 9 1

Approach ..................................................

~ .

261 261

federation's CM ) federation's component component databases. databases. Thus, Thus, the the mediator's mediator's conceptual conceptual model model ((CM) will only be able to map onto those data values that are identified in the remote will only b e able to map onto those data values that are identified i n the remote resource's resource's conceptual conceptual model model (CR). (CR). Thus, Thus, queries queries involving involving concepts concepts like like gene gene and and chromosome in in CM CMcan can only only be be transformed transformed into into queries queries that that run run against against a a remote remote resource if if that that resource resource exports exports these these concepts. concepts. resource provides a The result (user fuser provides a synchronization synchronization layer, layer, which which combines combines results results re reThe result trieved bases so trieved from from external external data databases so the the rest rest of of the the query query can can proceed proceed smoothly. smoothly. The The result fuser fuser interacts interacts tightly tightly with with the the wrappers. wrappers. result

9.1.5 9 . 1 .5

Example Exa mple


A prototype prototype mediator mediator has has been been used used to to combine combine access access databanks databanks at at the the EBI EBI via via A 1 7] and an SRS SRS server server [ [17] and (remote) (remote) PIFDM P/FDM test test servers. servers. Remote Remote access access to to a a PIFDM P/FDM an database ] . This database is is provided provided through through a a CORBA CORBA server server [21 [21]. This example, example, using using a a small small integration integration schema, schema, illustrates illustrates the the steps steps involved involved in in processing processing multi-database multi-database queries. queries. In this this example, example, three three different different data databases are viewed viewed through through a a unifying unifying inte inteIn bases are gration schema schema (CM)' (CM), which which is is shown shown in in Figure Figure 9.7(a). 9.7(a). There There are are three three classes in classes in gration this schema: schema: protein, protein, enzyme, enzyme, and and swi swissprot_entry. A function function represent representthis ssprot_entry. A ec_number) is class enzyme, ing ing the the enzyme enzyme classification classification number number ((ec_number) is defined defined on on the the class enzyme, and and enzymes enzymes inherit inherit those those functions functions that that are are declared declared on on the the superclass superclass protein. protein. Each instance instance of of the the class class protein a set of s swi ssprot_entry protein can can be be related related to to a set of wis sprot_entry Each instances. instances. Figure 9.7(b) 9.7(b) shows shows the bases; Figure the actual actual distribution distribution of of data data across across the the three three data databases; each of a P/FDM PIFDM each of these these three three databases databases has has its its own own external external schema, schema, ER. ER. Db Db I I is is a database that that contains contains the codes and and name proteins. Db Db II 11 is also a a P/FDM PIFDM database the codes name of of proteins. is also database and and contains contains the protein code code (here (here called called pdb_code) enzyme classiclassi database the protein pdb_code) and and enzyme fication identify Swiss-Prot at the EBI that that are are related related fication code code of of enzymes. enzymes. To To identify Swiss-Prot entries entries at the EBI to a instance, one must first Bank (PDB) (PDB) to a given given protein protein instance, one must first identify identify the the Protein Protein Data Data Bank entry whose ID matches the protein code further links find entry whose ID matches the protein code and and then then follow follow further links to to find related Swiss-Prot entries. Relationships Relationships between between data data in in remote databases can can be be related Swiss-Prot entries. remote databases defined by by conditions conditions that that must must hold hold between the values values of of the the related related objects. objects. defined between the Constraints on on identifying identifying values values are are represented represented by by dashed dashed arrows in Figure 9.7. Constraints arrows in Figure 9.7. Figure 9.8 9.8 shows shows a a Daplex Daplex query query expressed expressed against against the the integration integration schema, schema, Figure This query query prints prints information about enzymes enzymes that that satisfy satisfy certain certain selection selection CM. This information about criteria and and their their related related Swiss-Prot Swiss-Prot entries. entries. Figure Figure 9.9 9.9 shows shows a a pretty-printed pretty-printed criteria version of of the the ICode ICode produced produced when when this this query query is is compiled. compiled. This This ICode ICode is is then then version processed by by the the query query splitter, splitter, producing producing ICode ICode (in (in terms terms of of the the resources' resources' external external processed schemas, ER) ER) that that will will be be turned turned into into queries queries that that will will be be sent sent to to the the three three external external schemas, Vl is common common to to all all three three data resources resources (Figure (Figure 9.10). 9.10). Note Note that that the the variable variable v data l is

262 262

9 9

P/FDM Mediator Mediator for for a a Bioi Bioinformatics Database Federation Federation P/FDM nformatics Database

...........................................................................................................................................................................................................................

protein protein_code protein_name protein_name swissprot entries swissprot_entries


protein_code enzyme

protein

(a) (a)

t ~ enzyme ee number ec_number

~ swisspro,-entry swissprot_entry id id acc acc del def

Integration Schema

IntegrationSchema
P/FDM P/FDM Db Db l I

protein

protein protein_code protein_name

/ _ ___1

pdb_entry id ... link

SRS, SRS, Swiss-Prot Swiss-Prot at at EBI EBI

<1 swisspr~ id ace deI ...


acc del id

swisspro,-entry

(b) (b)

Distributed Databases

DistributedDatabases

P/FDM P/FDM Db Db 11 II

9.7 9.7 F I G U RE FIGURE

Integration Integration schema and distributed distributed databases (P/FDM and SRS). SRS).

for such e) f o r each each e e in i n enzyme enzyme s u c h that t h a t ec_number e c n u m b e r ( e() for n swi f o r each each s s i in s w i sssprot_entries sprot_entries(e) ( e ) print ( e ) ,, de f( (ss ) , acc ; print ( (protein_name protein_name(e) def ), a c c (( ss )) ;
9.8 9.8 FIG U RE FIGURE Daplex query expressed against an integration schema.

" .i 1..I1 ." 1" "1 i. .i

[ [ V6 V 6 ,, V4 V 4 ,, V3 V 3 I]

; V1 V 1 +--+- enzyme enzyme ; V2 ssprot_entries (V1 V2 +--+- swi swis sprot_entries (Vl)) ; V3 f( (V2 V 3 = acc a c c (( VV2 2) ) ; ; V4 V4 = = de def V2) ) ; vs 1 .. i 1..i1 ." 1" V 5 = ec_number e c _ n u m b e r ( V( lVl ) ) ; ; vs V5 = = " "I .i (V1 V6 V 6 = protein_name protein_name (Vl)) 1 ]

9.9 9.9 FIG U RE FIGURE

ICode corresponding to the query in Figure Figure 9.8. 9.8.

9. 1 Approach Approach 9.1

263 263
1Code fo orr P P//FFDM Db 11 : I Code f DM D b II[ V Vl I e enzyme ; V V3 V2 ec_number ( V2 [ 1 I V 2 +nzyme ; 3 = e c_number(V2 ) )
V 1 = p db_code(V2 ) Vl pdb_code ( V2
=

V3 ; V 3 =

"i 1..i1 1i. "1 " " .. i.

] 1 ;

1Code for P/FDM Db I: I Code f or P /FDM D b IV4 p protein ; V vs = p protein_code ( V4 ) [ 6 I V 4 +rotein ; 5 = rotein_code (V4) [ V V6 I V6 protein_name ( V4 ) V 6 = p rotein_name(V4)
=

; V Vl vs 1 ; 1 = V 5 ]
= =

1Code for SRS I Code f or S RS- : V9 pdb_entry ; V V10 V 9 + -pd b_entry ; I0 = V8 I [ V V7 I 7, V 8 1


I

Vl = V V10 V 1 I0

V12 s swi ; V I 2 +w i sssprot_entry sprot_entry ;


V I3 = d ((V l2) V13 = i id V1 2 V 8 : acc acc(V l2) V8 ( V1 2
= =

; V Vl l li ink (V9 ; I Il +nk(V 9) )

id d((VV9 i 9) ) ; ;
;

7 = d e ff (V l2) ) ) ; V V7 de ( V12 ; V1 V 1 33 i n V l ll 1 ] ) ; in Vl

9. 1 0 9.10 FIGURE FIGURE

ICode sub-queries sub-queries against against the the actual data resources resources that that need need to to ICode actual data to be be accessed accessed to 9.8. answer query in in Figure answer the the query Figure 9.8.

foreign (w sw t_entries , f oreign(s i si ss pspro rot_e ntries, KeyICode K eyICode, , KeyICode = K eyICode = ebi_db ) ebi_db )

[protein [ p r o t e i n ]] ,, ::-

srs_spro t, srs_sprot,

ent i enti tty y, ,

(Vl , V2 , [V3 ,V V4 , V6 ] , (VI,V2, [V3, 4 ,,VV5 5,V 6],

restrict V2 ] , V6 ) , restrict ( (ebi_db e b i _ d b ::iid d, , [ [ebi_db e b i _ d b ::ssrs_spro r s _ s p r o t ]t ,] , [ [V2] ,V6), restric ( V5 ) , r e s t r i c t t_subquery _subquery (some some ( (V5) ,

[ generate generate ( (ebi_db ebi_db : :pdb_entry p d b _ e n t r y , , V3 V3 ) ), ,


restrict V3 ] V4 restrict ( (ebi_db ebi_db : :id id,, [ [ebi_db ebi_db : :pdb_entry pdb_entry] ] , ,[ [V3 ], ,V 4 ) ), , r es s t ,V V4 re t rr ii cc tt ( (protein_code p r o t e i n _ c o d e , , [protein [ p r o t e i n ]] , ,Vl VI, 4 )) , , restrict V3 ] , V5 ) restrict ( (ebi_db e b i _ d b ::ll i ink nk, , [ [ebi_db e b i _ d b ::ppdb_entry db_entry] ] , ,[ [V3] ,V5) on ( ] , ,V V5 , V6 ,V V5 [ expres e x p r e s ss ii on ([ [] , [V6 [V6, 5 ] ], ,expr expr ( (= =, V6, 5 )) )) ) ] ]

] ], ,

). . )

9 .1 1 9.11 F IGURE FIGURE

Mapping ssprot_entries in Mapping function function used used to to expand expand the the relationship relationship swi swis sprot_entries in the the integration integration schema schema into into ICode ICode that that refers refers to to data data held held at at the the EBI. EBI. query query fragments. fragments. Values Values for for this this variable variable retrieved retrieved from from P/FDM P/FDM Db Db 11 II are are used used in in constructing constructing queries queries to to be be sent sent to to the the other other data data resources. resources. ssprot_entry In the the example, example, the the class class protein p r o t e i n is is related related to to the the class class swi swis sprot_entry In in in the the integration integration schema schema by by a a multi-valued multi-valued relationship relationship function function called called swi ssprot_entries. The 1 1 is swis sprot_entries. The mapping mapping function function given given in in Figure Figure 9. 9.11 is used used in in transforming transforming queries queries that that contain contain this this relatiotlship relationship into into enhanced enhanced ICode ICode that that refers refers

264 264

iator for nformatics Database Federation P/FDM Med Mediator for a Bioi Bioinformatics
~ ~

for for each each r rl l in in kabat_residue kabat_residue(d, (d, " "34") 34 " ) 7 8 ) such for ( d , "78") for each each r2 r2 in in kabat_residue kabat_residue(d, such that that .0 di st tance atom ", ) , atom ( r2 , " CB " ) ) < dis a n c e ((a tom(( rrl l,, "" CCB B") atom(r2,"CB")) < 5 5.0 for ssprot_entries domain_s tructure ( d ) ) for each each s s in in swi swissprot e n t r i e s ((d o m a i n structure(d)) print protein_code (d ) , name ( r l ) , name ( r2 ) , def ( s ) , acc (s) ) ; p r i n t ((p rotein_code(d ), name(rl), name(r2), def(s), acc(s));
" "

for each (d) for each d d in in ig_domain i g _ d o m a i n such such that that name name(d)

VH " : " "VH"

9. 12 9.12 F IGURE FIGURE

Daplex query query that that combines combines computation computation and and data data retrieval. retrieval. Daplex

to external schemas, actual data to the the external schemas, ER, ER, of of the the actual data resources. resources. Mapping Mapping functions functions such such as high-level declarative as this this can can be be compiled compiled from from high-level declarative rewrite rewrite rules rules and and do do not not have have to hand. to be be written written by by hand.

9 . 1 .6 9.1.6

Query b i l ities Query Capa Capabilities


Daplex Daplex is is the the query query language language of of the the system. system. The The examples examples in in Figure Figure 9.3 9.3 and and Figure Figure 9 . 1 2 show calls can 9.12 show how how function function calls can be be composed composed in in queries. queries. The The compositional compositional form form makes it easy to and computations makes it easy to write write complex complex queries queries and computations over over the the database database that that can can be be optimized optimized by by a a query query optimizer. optimizer. This This is is a a point point often often overlooked overlooked by by the the oap OOP community; community; Java Java and and c++ C++ have have the the necessary necessary expressiveness, expressiveness, but but because because they they lack lack referential referential transparency transparency and and a a data data model model it it is is hard hard to to make make general general optimizers optimizers for for database has greater that SQL database applications applications in in them. them. Daplex Daplex has greater expressive expressive power power that SQL (e.g., (e.g., recursive recursive functions functions can can be be defined defined directly directly in in Daplex). Daplex). This This is is particularly particularly useful useful in in many areas of following transitive transitive relationships through a many areas of bioinformatics, bioinformatics, such such as as following relationships through a sequence of pathway or finding related related biological sequence of reactions reactions in in a a biochemical biochemical pathway or finding biological terms terms in in a a hierarchical hierarchical vocabulary. vocabulary. As mentioned mentioned in in Section . 1 .4, Daplex are converted for As Section 9 9.1.4, Daplex queries queries are converted into into ICode ICode for subsequent subsequent processing. processing. The The great great advantage advantage is is that that many many important important optimizations optimizations jjust ust involve expression. involve reordering reordering selection selection predicates predicates and and generators generators in in the the set set expression. These, in Prolog [6]. These, in turn, turn, are are conveniently conveniently implemented implemented as as rewrite rewrite rules rules in in Prolog [6]. It It was was also also shown shown how how to to expand expand definitions definitions of of derived derived functions functions in in the the course course of of opti optimizing mizing set set expressions expressions [22]. [22]. This This makes makes good good use use of of the the referential referential transparency transparency of of expressions expressions in in functional functional programming. programming. By By contrast, contrast, where where the the computation computation is embedded in C++ or is embedded in C++ or Visual Visual Basic Basic with with arbitrary arbitrary assignments, assignments, it it is is very very hard hard to comto do do significant significant optimization. optimization. This This has has led led to to the the widespread widespread adoption adoption of of com prehensions (as ZF-expressions [23] have prehensions (as ZF-expressions are are now now called). called). Buneman Buneman et et al. al. [23] have shown shown the distinguishing list, bag, and the importance importance of of distinguishing list, bag, and set set comprehensions, comprehensions, so, so, strictly strictly speaking, ZF-expressions speaking, ZF-expressions compute compute bags bags but but represent represent them them by by lists. lists.

9 .~ 117 9.1 App 617 roach 6176176

265 265

The language enables arbitrary calculations with The Daplex Daplex query query language enables arbitrary calculations to to be be combined combined with data retrieval operations Figure 9.12 data retrieval operations [7] [7].. For For example, example, Figure 9.12 shows shows a a query, query, expressed expressed against an an integration integration schema, schema, that that performs performs a a geometric geometric calculation calculation on on data data in in against an antibody antibody database database and and relates relates objects objects satisfying satisfying the the given given criteria criteria with with data data in in an a a remote remote Swiss-Prot Swiss-Prot database. database. The The function function distance distance computes computes a a value value rather rather than than retrieving . 1 .2, functions retrieving a a stored stored value. value. As As explained explained in in Section Section 9 9.1.2, functions whose whose values values are in the are stored stored or or derived derived look look the the same same in the query, query, and and the the user user cannot cannot tell tell from from looking was retrieved disc or looking at at a a result result value value whether whether it it was retrieved from from disc or computed. computed. Daplex calls to Daplex queries queries frequently frequently include include calls to functions functions written written in in procedural procedural languages. For example, example, when working with with data data on on 3D protein structures, structures, one one languages. For when working 3D protein often calls calls out out to to geometric geometric code code from from within within queries, queries, including including C C routines routines to to often measure bond angles torsion angles, measure bond angles and and torsion angles, and and code code to to superpose superpose one one structural structural fragment on on another another to to compare compare 3D similarity. Following the same same approach, it fragment 3D similarity. Following the approach, it would be results of would be possible possible to to treat treat the the results of a a computation computation on on a a remote remote machine, machine, such ust like such as as a a BLAST BLAST search, search, that that are are generated generated dynamically dynamically at at run-time run-time jjust like data data values that are values that are stored stored persistently persistently on on disc. disc. The The system system does does not not yet yet have have a a wrapper wrapper for but, in ust like for BLAST, BLAST, but, in principle, principle, such such a a wrapper wrapper would would be be implemented implemented jjust like any any other other derived derived function function in in P/FDM. P/FDM. The mediator does results, so The mediator does not not currently currently cache cache query query results, so subsequent subsequent queries queries cannot to a user interfaces cannot refer refer to a result result set. set. However, However, both both user interfaces described described in in Section Section 9.2.2 9.2.2 enable enable follow-on follow-on queries queries to to be be constructed constructed incrementally incrementally based based on on the the previous previous query. query. For complex P/FDM For more more complex P/FDM applications applications that that cannot cannot be be expressed expressed in in Daplex, Daplex, Prolog Prolog programs programs Prolog can can be be used used [24] [24].. However, However, unlike unlike Daplex Daplex queries, queries, these these Prolog are 1 ) . The are not not optimized optimized automatically automatically (see (see Section Section 9.2. 9.2.1). The PIFDM P/FDM system system provides provides a Prolog routines routines that perform primitive a set set of of Prolog that perform primitive data data access access operations, operations, such such as as retrieving retrieving the the object object identifier identifier of of an an instance instance of of an an entity entity class, class, retrieving retrieving the the scalar scalar value retrieving the value of of an an attribute attribute of of an an object object with with a a given given object object identifier, identifier, or or retrieving the object identifier identifier of object of a a related related object. object. Queries Queries that that require require access access to to several several data data sources the federation Prolog. sources in in the federation can can be be written written directly directly in in Prolog.

9. 1 .7 9.1.7

Data So u rces Sources


PIFDM P/FDM was was previously previously used used with with various various data data sources sources including including hash hash files, files, rela relational bases, flat files (including tional data databases, flat files (including some some accessed accessed via via SRS), SRS), POET, POET, AMOS AMOS 11 II and and AceDB. AceDB. There There is is no no control control over over changes changes being being made made to to remote remote resources, resources, as as remote remote sites Depending on sites retain retain their their autonomy. autonomy. Depending on the the nature nature of of the the changes changes to to a a remote remote resource, this resource, this may may require require changes changes in in one one or or both both parts parts of of the the wrapper wrapper for for that that resource. Changes resource. Changes to to a a remote remote resource resource need need not not require require changes changes to to be be made made to to

266 266

9 9

nformatics Database P/FDM Mediator Mediator for for a a Bioi Bioinformatics Database Federation Federation

the CM)' though the mediator's mediator's conceptual conceptual model model ((CM), though they they may may require require some some changes changes to to be made to the the declarative declarative mapping mapping functions functions associated associated with in the the changed changed be made to with data data in resource. resource.

9 .2 9.2
..... Z.. " 9 7.~

ANALYS IS ANALYSIS
The The use use of of mediators mediators was was originally originally proposed proposed by by Wiederhold Wiederhold [4] [4] and and became became an an important important part part of of the the knowledge knowledge sharing sharing effort effort architecture architecture [25] [25].. Examples Examples of of such such intelligent, intelligent, information-seeking information-seeking architectures architectures are are Infosleuth Infosleuth [26] [26] and and KRAFT KRAFT [27] [27].. In client machine, else be In this this architecture, architecture, the the mediator mediator can can run run on on the the client machine, or or else be avail available able as as middleware middleware on on some some shared shared machine, machine, while while the the wrapper wrapper is is on on the the remote remote machine containing containing the the knowledge knowledge source. source. The The idea idea behind behind this this is is that that existing existing machine knowledge knowledge sources sources can can evolve evolve their their schemas, schemas, yet yet present present a a consistent consistent interface interface to to the suitable changes the mediator mediator via via suitable changes to to the the wrapper. wrapper. For For this this purpose purpose the the wrapper wrapper may may be be as as simple simple as as an an SQL SQL view, view, or or it it may may be be more more complex, complex, involving involving mapping mapping of case, the of code. code. In In any any case, the site site is is able able to to preserve preserve some some local local autonomy. autonomy. Other Other medi mediators to worry about how Also, new ators do do not not have have to worry about how the the site site evolves evolves internally. internally. Also, new sites sites can join a by registering can join a growing growing network network by registering themselves themselves with with a a facilitator. facilitator. All All the the mediator needs needs to know is how to mediator to know is how to contact contact the the facilitator facilitator and and that that any any knowledge knowledge sources recommends will sources the the facilitator facilitator recommends will conform conform to to the the integration integration schema. schema. This This chapter chapter describes describes an an alternative alternative architecture, architecture, where where the the wrappers wrappers reside reside with the the mediator. mediator. This This has has the the advantage advantage that that there there is is no no need need to to get get the the knowledge knowledge source source to to install install and and maintain maintain custom-provided custom-provided wrapper wrapper software. software. In In the the architecture, architecture, shown shown in in Figure Figure 9.6, 9.6, the the code code generators generators produce produce code code in language or in a a different different query query language or constraint constraint language. language. Thus, Thus, they they are are used used in in two two directions. they map map queries into a directions. In In one one direction, direction, they queries or or constraints constraints into a language language that that can can be be used used directly directly at at the the knowledge knowledge source. source. This This can can be be crucial crucial for for efficiency efficiency because it it allows allows one to move selection predicates predicates closer closer to because one to move selection to the the knowledge knowledge source source in a using local local indexes. significant effect in a form form that that is is capable capable of of using indexes. This This can can have have a a significant effect with database because it it saves data back with database queries queries because saves bringing bringing many many penny penny packets packets of of data back through interface, only be filtered through the the interface, only to to be filtered and and rejected rejected on on the the far far side side [28]. [28]. In In the the other direction, wrappers wrappers are used to values (e.g., by using using scaling scaling factors other direction, are used to map map data data values (e.g., by factors to lookup table table to replace values values by identifiers). to change change units units or or by by using using a a lookup to replace by their their new new identifiers). Note that building a so-called global integration schema is not Note that building a so-called integration schema is not advocated. advocated. These These have have often often been been criticized criticized on on the the grounds grounds that that attempts attempts to to map map every every single single concept laborious and concept in in one one all-embracing all-embracing schema schema is is both both laborious and never-ending. never-ending. Instead, Instead, an incrementally growing visualized, driven an incrementally growing integration integration schema schema is is visualized, driven by by user user needs. needs. Ideally the would be be built built interactively Ideally the schema schema would interactively using using a a GUI GUI and and rules rules that that suggest suggest various mappings, ONION various mappings, as as proposed proposed by by Mitra Mitra et et al. al. [29] [29] in in their their O N I O N system system for for incremental development development of ontology mappings. crucial thing incremental of ontology mappings. The The crucial thing to to realize realize is is that that

9.2 9.2

Analysis

267

the the integration integration schema schema represents represents a a virtual virtual database, database, which which allows allows it it to to evolve evolve much more more easily than a a physical database. much easily than physical database. Related Related work work in in the the bioinformatics bioinformatics field field includes includes the the Kleisli Kleisli system system presented presented in 1 ] . The Kleisli is in Chapter Chapter 6 6 [30, [30, 3 31]. The query query language language used used in in Kleisli is the the Collection Collection Pro Programming gramming Language Language (CPL), (CPL), which which is is a a comprehension-based comprehension-based language language in in which which the the generators calls to library functions generators are are calls to library functions that that request request data data from from specific specific databases databases according according to to specific specific criteria. criteria. Thus, Thus, when when writing writing queries, queries, the the user user must must be be aware aware of how how data data are are partitioned partitioned across across external external sites. sites. This This contrasts contrasts with with the the approach approach of taken in in the the P/FDM P/FDM Mediator, Mediator, where where references references to to particular particular resources resources do do not not fea feataken ture integration schema based on ture in in the the integration schema or or in in user user queries. queries. Of Of course, course, an an interface interface based on domain could be domain concepts concepts and and without without references references to to particular particular resources resources could be built built on on top of of Kleisli. Kleisli. top [32] writes The The TAMBIS TAMBIS system system presented presented in in Chapter Chapter 7 7 [32] writes query query plans plans in in CPL. CPL. Plans Plans in in TAMBIS TAMBIS are are based based on on a a classification classification hierarchy, hierarchy, whereas whereas PIFDM P/FDM plans plans are ad hoc hoc SQL3-like SQL3-1ike queries. queries. However, However, the the overall overall approach approach is is are oriented oriented toward toward ad similar to to using using a a high-level high-level intermediate intermediate code code translated translated through through wrappers. wrappers. similar Another 1 [33]. Another related related project project is is DiscoveryLink, DiscoveryLink, presented presented in in Chapter Chapter 1 11 [33]. The The architecture architecture of of the the DiscoveryLink DiscoveryLink system system is is similar similar to to that that presented presented in in this this chapter. chapter. DiscoveryLink data model and all all the DiscoveryLink uses uses the the relational relational data model instead instead of of FDM, FDM, and the databases databases accessed accessed via via DiscoveryLink DiscoveryLink must must present present an an SQL SQL interface. interface.

9 .2 . 1 9.2.1

O pti m i zatio n Optimization


Optimization Optimization takes takes a a great great advantage advantage from from using using an an easily easily transformable transformable high highlevel level representation representation based based on on functional functional composition. composition. Three Three kinds kinds of of optimization optimization are are done done within within the the P/FDM P/FDM Mediator. Mediator. First, First, the the rewriter semantic query query optimization. optimization. Additionally, Additionally, rewriter can can apply apply rules rules that that perform perform semantic rewrite rules can be used to given by rewrite rules [20] [20] can be used to implement implement the the logical logical rules rules given by Jarke Jarke and and Koch Koch [34] [34].. They They can can implement implement many many forms forms of of rewrites rewrites based based on on data data semantics, semantics, as as discussed discussed in in King King [35] [35].. They They can can spot spot opportunities opportunities to to replace replace iteration iteration by by indexed indexed search search [33]. [33]. In In experiments experiments using using the the AMOS AMOS II II system system [36] [36] as as a a remote remote resource, resource, rewrite able to to implement rewrite rules rules were were able implement flattening flattening and and un-nesting un-nesting transformations transformations that compiling subqueries subqueries in similar approach that prevent prevent wasting wasting time time compiling in AMOSQL. AMOSQL. A A similar approach could DBMSs. Most could be be adapted adapted to to features features of of other other DBMSs. Most importantly, importantly, rewrites rewrites that that change the relative change the relative workload workload between between two two processors processors in in a a distributed distributed query query can can be all these combined, as will be performed. performed. Finally, Finally, all these rewrites rewrites can can be be combined, as some some of of them them will enable others place. Thus, combinations without enable others to to take take place. Thus, one one can can deal deal with with many many combinations without having to to foresee foresee them them and and code code them them individually. individually. having Second, performs generic generic query query optimizations. Second, the the optimizer optimizer performs optimizations. The The philosophy philosophy of use heuristics of the the optimizer optimizer is is to to use heuristics to to improve improve queries. queries. It It examines examines alternative alternative

268 268

P/FDM nformatics P/FDM Mediator M e d i a t o r for f o r a Bioi Bioin f o r m a t i c s Database Database Federation Federation ~ ~ ~ ~ , ~ ~ ~ ~ ~~

execution execution plans plans and, and, although although it it uses uses a a simple simple cost cost model, model, it it is is successful successful in in avoid avoiding inefficient strategies, and it often selects the most effective approach [15]. The ing inefficient strategies, and it often selects the most effective approach [ 1 5] . The optimizer was was subsequently subsequently rewritten rewritten using using a a simple simple heuristic heuristic to to avoid avoid the the com comoptimizer binatorial problem possible execution binatorial problem of of examining examining all all possible execution plans plans for for complex complex queries queries [22] [22].. Third, the query splitter group together query elements elements into Third, the query splitter attempts attempts to to group together query into chunks chunks that that can can be be sent sent as as single single units units to to the the external external data data resources, resources, thus thus providing providing the the remote much information information as remote system system with with as as much as possible possible to to give give it it greater greater scope scope for for optimizing optimizing the the sub-query. sub-query. Outside mediator, the Outside the the mediator, the approach approach is is able able to to take take advantage advantage of of the the optimiza optimization capabilities the external tion capabilities of of the external resources. resources. There There is is scope scope for for introducing introducing adaptive adaptive query query processing processing techniques techniques to to im improve the execution results are prove the execution plans plans as as execution execution proceeds proceeds and and as as results are returned returned to to the the mediator [37], not yet in our mediator [37], but but this this has has not yet been been done done in our prototype prototype system. system.

9 .2.2 9.2.2

User IInterfaces nte rfaces


While While queries queries against against a a P/FDM P/FDM schema schema can can be be formulated formulated directly directly in in either either Prolog Prolog or or Daplex, Daplex, this this requires requires some some programming programming ability ability and and care care must must be be taken taken to to use use the the correct correct syntax. syntax. Therefore, Therefore, two two interfaces interfaces were were developed developed to to formulate formulate queries queries without Daplex. Both without the the user user having having to to learn learn to to program program in in either either Prolog Prolog or or Daplex. Both of of these these interfaces interfaces have have a a representation representation of of the the schema schema at at their their heart, heart, and and they they enable enable the the user user to to construct construct well-formed well-formed Daplex Daplex queries queries by by clicking clicking and and typing typing values values for for attributes restrict the set. attributes to to restrict the result result set. A A Java-based Java-based visual visual interface interface for for P/FDM P/FDM [38] [38] was was developed developed with with a a graphical graphical representation . 1 3 shows representation of of the the database database schema schema at at its its center. center. Figure Figure 9 9.13 shows this this interface interface in schema for construct queries in use use with with the the schema for an an antibody antibody database database [7] [7].. Users Users construct queries by by clicking clicking on on entity entity classes classes and and relationships relationships in in the the schema schema diagram diagram and and constraining constraining the the values values of of attributes attributes selected selected from from menus. menus. As As this this is is done, done, the the Daplex Daplex text text of under construction (the query of the the query query under construction is is built built up up in in a a sub-window sub-window (the query editor editor window). database via ]. window). Queries Queries are are submitted submitted to to the the database via a a CORBA CORBA interface interface [21 [21]. Results Results satisfying satisfying the the selection selection criteria criteria are are displayed displayed in in a a table table in in a a separate separate result result window. window. A A particularly particularly novel novel feature feature of of the the interface interface is is copy-and-drop, copy-and-drop, which which enables the enables the user user to to select select and and copy copy data data values values in in the the result result window window and and then then drop the query drop these these into into the query editor editor window. window. When When this this is is done, done, the the selected selected values values are are merged merged into into the the original original query query automatically, automatically, in in the the appropriate appropriate place place in in the the query more specialized query text, text, to to produce produce a a more specialized query. query. The The Java-based Java-based interface interface runs runs as as a application, but it does a Java Java application, but it does not not yet yet run run within within a a Web Web browser. browser. In Web interface markup language In addition, addition, a a Web interface was was developed developed with with hypertext hypertext markup language (HTML) (HTML) forms forms and and accesses accesses the the mediator mediator via via a a CGI CGI program program (Figure (Figure 9.14). 9.14). Such Such interfaces automatically from file. The interfaces can can be be generated generated automatically from a a schema schema file. The interface's interface's front front

9.2 Analysis Ana lysis 9.2

=====

269

269

for .ach 10 for

for .ach 01 In for .ach

..ch

1 2 In

Prlnt ( .ouree ( 1 0 ) .

chain auch that o l-ig_do.al chaln( ( c::::::J to c::::::J I r3 In redd.. auch that r3-<lb.olut.po8(
oa.e ( 1 0) . 8r t ( 1 0 ) . do

In

19-<lo... ln

10) 01.

1type ( 1 0) . eooent_ld (o l ) .

c:::::J )

ld"8(ol)

9. 13 9.13 F I G U RE FIGURE

Visual Visual Navigator Navigator query query interface interface [38]. [38].

page lists the entity entity classes classes in in the the schema, and the the user user selects selects one one of of these these as as page lists the schema, and the boxes are used to the starting starting point point for for the the query. query. As As the the query query is is built built up, up, check checkboxes are used to indicate user can indicate those those attributes attributes whose whose values values are are to to be be printed. printed. The The user can constrain constrain the . 1 . 1 . 1 or the value value of of an an attribute attribute by by typing typing into into its its entry entry box box (e.g., (e.g., 1 1.1.1.1 or <2.5) <2.5) and and can can navigate navigate to to related related objects objects using using the the selection selection box box labeled labeled relationships at at the the bottom bottom of of each each object's object's representation representation within within the the Web Web page. page. Figure Figure 9.14 9.14 shows shows the the Web Web interface interface at at the the point point where where the the user user has has formulated formulated the the query query used used in in the .5. Pressing button will will cause cause the the equivalent equivalent the example example in in Section Section 9.1 9.1.5. Pressing the the Submit button Daplex Daplex query query to to be be generated. generated. When querying, it it is is easy easy When using using a a graphical graphical user user interface interface that that supports supports ad hoc querying, for for naive naive queries queries that that involve involve little little or or no no data data filtering filtering to to be be expressed. expressed. This This can can result result in in queries queries that that request request huge huge result result sets sets from from remote remote resources. resources. An An alterna alternative tive approach, approach, as as in in TAMBIS, TAMBIS, would would be be to to provide provide only only user user interfaces interfaces that that guide guide the the user user toward toward constructing constructing queries queries with with a a particular particular structure structure and and that that have have a a suitable suitable degree degree of of filtering. filtering. However, However, such such an an interface interface would would constrain constrain the the user user to to

270 270

==========

9 9

P/FDM Med Mediator for a aB Bioinformatics Database Federation Federation PjFDM iator for i o informatics Database

View

Go

Com

un c

lor

U se the checllbous OD the leA ot the page to select the attributes that you wan to print. To constrain the value of an attribute, type into its entry box. To navigate to a related object, use the seled!on box labe.Ded relalio hips. Press SUbmit to proceed.. enzyme-l

HI p --------------- _t

lIt.

1.1 .1

lec_nmtber(l1rina) ....11

Unked to mt_entry- l by relationshlp swisspml_ ntries


' __ __ __ __ __ __ __

=====:1 protein_code (rtnng) I. ' ==== == == == ===: PlVlein_name (sUing)

I....

none

relationships

== == == ===:I (string) == == =====:I t (strina) :==== == =====:I def (string) It == == == ===:Ice (strioa)
ace

lid

(striog)

Dont

relationships
------

r"""""" """,IqHgmo-1'Jpt I hfgrma!Jct"",,cbGrQ!ID I Co!!plmlk!OC! I laMr!!lrp

9. 14 9.14 F IGURE FIGURE

Web-based query interface. interface.

9.2 Ana lysis 9.2 .... Analysi,.~s

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

271 27 1

asking asking only only parameterized parameterized variants variants of of a a set set of of canned canned queries queries anticipated anticipated by by the the in interface terface designer. designer. While While such such interfaces interfaces could could be be implemented implemented easily, easily, P/FDM P/FDM design design specification specification favors favors that that users users have have the the freedom freedom to to express express arbitrary arbitrary queries queries against against a schema, and an an area area for for future future work work is is identifying identifying and and dealing dealing with with queries queries that that a schema, and could place unreasonable on the the component component data databases in the the federation. federation. could place unreasonable loads loads on bases in Current interfaces interfaces do do not not provide provide personalization capabilities. It It is, is, however, however, Current personalization capabilities. possible provide users federation (see possible to to provide users with with their their individual individual views views of of the the federation (see EM EM schemas 9.5), but schemas in in Figure Figure 9.5), but this this would would be be done done by by the the database database federation's federation's ad administrator ministrator on on behalf behalf of of users, users, rather rather than than by by users users themselves. themselves.

9.2.3 9.2.3

Sca la b i l ity Scalability


When a a new new external is added to the the federation, the contents contents of of that that When external resource resource is added to federation, the resource must be entities, attributes, attributes, and resource must be described described in in terms terms of of entities, and relationships-the relationships~the basic in the the FDM. FDM. For For example, example, entity entity classes classes and and attributes attributes are are used to basic concepts concepts in used to describe the database, the describe the tables tables and and columns columns in in a a relational relational database, the classes classes and and tags tags in in an an AceDB database, and databanks and SRS. The AceDB database, and the the databanks and fields fields accessed accessed by by SRS. The integration integration schema schema has has to to be be extended extended to to include include concepts concepts in in the the new new resource, resource, and and mapping mapping functions ICode rewriter functions to to be be used used by by the the ICode rewriter must must be be generated. generated. Because Because the the media mediator modular architecture in which tor has has a a modular architecture in which query query transformation transformation is is done done in in stages, stages, the the only software components only new new software components that that might might have have to to be be written written are are code code generators generators and wrappers-the wrappers~the components components shown shown with with dark dark borders borders in in Figure Figure 9.6. and 9.6. All All other other components components within within the the mediator mediator are are generic. generic. However, However, the the federation federation administra administrator tor might might want want to to add add declarative declarative rewrite rewrite rules rules that that can can be be used used by by the the rewriter rewriter to to improve improve the the performance performance of of queries queries involving involving the the new new resource. resource. Code Code generators generators for for new new data data sources sources can can be be written written in in one one or or two two days days when when using using existing existing code code generators generators as as a a guide. guide. In In general, general, as as the the expressions expressions being being evaluated evaluated obey obey the the prin principles ciples of of substitutability substitutability and and referential referential integrity, integrity, expressions expressions that that match match patterns patterns in in rewrite rewrite rules rules can can be be substituted substituted with with other other expressions expressions that that have have the the same same value. value. This This means means new new mappings mappings can can be be added added without without the the risk risk of of encountering encountering special special cases cases or or some some arbitrary arbitrary limit limit on on the the complexity complexity of of expressions, expressions, as as can can happen happen with with SQL. SQL. A A federated federated database database system system and and a a mediator mediator system system are are similar similar architectures architectures that easily one database sources. sources. In that differ differ in in terms terms of of how how easily one can can attach attach new new database In a a fed federated erated architecture, architecture, the the integration integration schema schema is is relatively relatively fixed fixed and and designed designed with with particular bases can particular database database sources sources in in mind. mind. Extra Extra data databases can be be added, added, with with some some effort, bases effort, by by the the database database administrator. administrator. A A mediator mediator tries tries to to integrate integrate new new data databases available at available at their their given given Web Web addresses addresses on on the the basis basis of of descriptions descriptions provided provided by by the the end end user. user. The The whole whole process process is is more more dynamic. dynamic. When When dealing dealing with with a a new new source, a good mediator mediator will will try rules it source, a good try to to spot spot heuristic heuristic optimization optimization rules it can can re-use re-use

272

P/FDM Mediator nformatics Database Federation Mediator for for a Bioi Bioinformatics

from bases it from similar similar data databases it knows knows about. about. In In general, general, it it is is more more intelligent intelligent and and less less reliant reliant on on human human intervention. intervention. A A long-term long-term goal goal is is that, that, as as a a suite suite of of code code gener generators and and wrappers is added added to become easy easy to ators wrappers is to the the P/FDM P/FDM Mediator, Mediator, it it will will become to add add new the mediator new resources resources by by presenting presenting the mediator with with new new remote remote schemas schemas and and specifying specifying which which code code generators generators and and wrappers wrappers should should be be used. used.

9.3

CO NCLUSIONS CONCLUSIONS
The sa The P/FDM P/FDM Mediator Mediator iis a computer computer program program that that supports supports transparent transparent and and in integrated queries can can be be tegrated access access to to different different data data collections collections and and resources. resources. Ad hoc queries asked asked against against an an integration integration schema, schema, which which is is a a pre-defined pre-defined collection collection of of entity entity classes, attributes, and classes, attributes, and relationships. relationships. The The integration integration schema schema can can be be extended extended at at any any time time by by adding adding declarative declarative descriptions descriptions of of new new data data resources resources to to the the mediator's mediator's set-up set-up files. files. Rather Rather than than building building a a data data warehouse, warehouse, the the developed developed system system brings brings data data from from remote remote sites sites on on demand. demand. The The P/FDM P/FDM Mediator Mediator arranges arranges for for this this to to happen happen without without further further human human intervention. intervention. The The presented presented approach approach preserves preserves the the auton autonomy omy of of the the external external data data resources resources and and makes makes use use of of existing existing search search capabilities capabilities implemented implemented in in those those systems. systems. Bioinformatics a "crisis 1 ] , which Bioinformatics faces faces a "crisis of of data data integration" integration" [ [1], which is is best best addressed addressed through bases to through federations federations that that allow allow their their constituent constituent data databases to develop develop autono autonomously and mously and independently. independently. The The existence existence of of schemas schemas at at different different levels, levels, as as shown shown in .3, makes in Section Section 9.1 9.1.3, makes apparent apparent the the requirements requirements for for query query transformation transformation in in a a mediator in in a a database database federation. federation. The The transformations transformations in in the the system system are are all based mediator all based on well-defined well-defined mathematical mathematical theory theory using using function function composition, composition, as as pioneered pioneered by by on Shipman Buneman [12]. This results results in Shipman [2] [2] and and Buneman [12]. This in a a modular modular design design for for the the mediator mediator that that enables enables the the federation federation to to evolve evolve incrementally. incrementally.

ACK N OWLEDG M E NT ACKNOWLEDGMENT


The prototype P/FDM Mediator described this chapter chapter was The prototype P/FDM Mediator described in in this was implemented implemented by by Nicos Nicos Angelopoulos. Angelopoulos. This This work work was was supported supported by by a a grant grant from from the the BBSRCIEPSRC BBSRC/EPSRC Joint 6). Joint Programme Programme in in Bioinformatics Bioinformatics (Grant (Grant Ref. Ref. lIBIF0671 1/BIF06716).
_ _ _ ' m.

R E F E R E NCES REFERENCES
R " R.. J. Robbins. "Bioinformatics: "Bioinformatics: Essential Infrastructure for Global Biology. Biology."

[ 1] [1]

Journal 1 996): 465-478. Journal o{ of Computational Computational Biology Biology 3, no. 3 ((1996)-

References References

273 273

[2] [3]

D D.. W W. Shipman. "The "The Functional Data Data Model Model and the Data Language DAPLEX."

A CM Transactions on Database Systems 6, no. 1 1 9 8 1 ): 140-173. ACM 1 ((1981)" 140-173.


B. Rieche and and K K.. R. Dittrich. "A Federated DBMS-Based DBMS-Based Integrated Environment for Molecular Biology." In Proceedings Proceedings of of the Seventh International Conference on Scientifi c and Statistical Database Management, 1 1 8-127. Los Alamitos, CA: IEEE Scientific 118-127. Computer 994. Computer Society, Society, 1 1994. G. Wiederhold. " Mediators in the Architecture of Future Information " "Mediators Information Systems. Systems."

[4] [5]

IEEE Computer 25, no. 3 ( 1 992): 3 8-49. (1992)" 38-49.


P. D. Karp. "A Vision of DB Interoperation." Interoperation." In Proceedings Proceedings of of the Second Meeting on the Interconnection of of Molecular Biology Databases, Cambridge, UK. July 20-22, 995. 20-22, 1 1995. P. P. M. D D.. Gray, K. G. Kulkarni, and and N. W W. Paton. Object-Oriented Databases: A Semantic Data Model Approach. Hemel Hempstead, Hertfordshire: Prentice Hall International, 992. International, 1 1992. [7] G. J. L. Kemp, Kemp, Z. Jiao, P. P. M. D. Gray, et al. "Combining "Combining Computation Computation with Database Accesss in Biomolecular Computing." Computing." In Applications of of Databases: Proceedings of of the First International Conference, edited by W. Litwin and T. T. Risch, 3 1 7-335. Heidelberg, Germany: Springer-Verlag, 994. 317-335. Springer-Verlag, 1 1994. L. Kerschberg and J. E. S. Pacheco. A Functional Data Model. Rio de Janeiro, Brazil: Department 976. Department of Informatics, Informatics, Universidade Catolica Rio de Janeiro, 1 1976. K. G. Kulkarni and M. P. " P. Atkinson. "EDFM: Extended Functional Data Data Model. Model." The Computer Journal 29, no. 1 1 986): 3 8-46. 1 ((1986): 38-46.

[6]

[8] [8] [9]

[10] E Bancilhon, D. DeWitt, et al. "Deductive and Object-Oriented Object-Oriented [ 1 0] M. Atkinson, F. st International Conference on Deductive and Databases . " In Proceedings of Databases." of the 1 1st Object-Oriented Databases j-M. Nicholas, and S. Databases (DOOD (DOOD '89), "89), edited by W. Kim, J-M. Nishio, 223-240. 990. 223-240. Amsterdam, The Netherlands: North-Holland, North-Holland, 1 1990.
M. D. Gray, D. S. Moffat, [ 1 1 ] P. [11] P.M.D. Moffat, and and N. W. Paton. "A Prolog Interface to a Functional Data Model Database." In Advances in Database Technology-EDBT TechnologyBEDBT '88, "88, edited by J. W. Smith, S. Ceri, and M. Missikoff, 34-48. 34-48. Heidelberg, Germany: Springer-Verlag, 98 8 . Springer-Verlag, 1 1988. [12] P. Buneman and R. E. Frankel. "Fql: A Functional Query Language." In

Proceedings of of the ACM SIGMOD International Conference on Management of of 979. Data, edited by P. A. Bernstein, 52-58 52-58.. Boston: ACM Press, 1 1979.
A. Turner. "Miranda: [13] [13] D. D.A. "Miranda: A Non-Strict Functional Language with with Polymorphic Types . " In Functional Types." Functional Programming Languages and Computer Architecture, Lecture Notes in Computing Computing Science, Science, Vol. Vol. 201, edited by J-P. J-P. Jouannaud, Jouannaud, 1-16. 1-16. Heidelberg, Germany: Springer-Verlag, Springer-Verlag, 1985. A. Landers and [ 14] T. [14] T.A. and R. L. Rosenberg. "An Overview of MULTIBASE." In Distributed

Data Bases, Bases, Proceedings Proceedings of of the 2nd International Symposium on Distributed Data 53-1 84. Amsterdam, The Netherlands: Bases, edited by H-J. Schneider, 1 153-184.
North-Holland, 982. North-Holland, 1 1982.

274 274

P/FDM Mediator nformatics Database Federation Mediator for for a Bioi Bioinformatics

[ 1 5 ] N. W. Patan [15] N.W. Paton and P. P. M. D D.. Gray. "Optimising and Executing Daplex Daplex Queries Using 1 990): 547-555. Prolog." The Computer Journal Journal 33, no. 6 ((1990): 547-555. [ 1 6 ] ANSI Standards Planning and Requirements Committee. "Interim Report of the [16] ANSIJX3/SPARC Study Group on Data Base Base Management Systems." FDT -Bulletin FDTmBulletin ANSI/X3/SPARC 1 975): 1-140. of A CM SIGMOD ofACM SIGMOD 7, no. 2 ((1975): [ 1 7] T. [17] T. Ezold and and P. P. Argos. "SRS Indexing and Retrieval Tool for Flat File File Data Libraries. " Computer Applications 1 993): 49-57. Applications in the Biosciences 9, no. 1 1 ((1993): 49-57. Libraries." [ 1 8 ] S. Grufman, [18] Grufman, F. E Samson, S. S. M. Embury, et al. "Distributing Semantic Constraints Between Between Heterogenous Heterogenous Databases." In Proceedings of of the 13th Annual Conference on Data Engineering, edited by A. Gray and P-A. P-A. Larson, 33-42. New York: IEEE Computer Society Press, 1 997. 1997. [ 1 9] A. P. Sheth and J. A. Larson. "Federated Database Systems for Managing [19] A.P. Distributed, Heterogenous and Autonomous Databases. " ACM A CM Computing Surveys Databases." 22, no. 3 ((1990): 183-236. 1 990): 1 83-236.

[20] G. G.J.L. E M. D. Gray, and A. R. Sjostedt. Sj6stedt. "Improving Federated Database [20] J. L. Kemp, P. Queries Using Declarative Rewrite Rules for Quantified Subqueries." Journal of of Intelligent 7, no. 2-3 ) : 28 1-299. Intelligent Information Systems 1 17, 2-3 (2001 (2001): 281-299.
[21 ] G. J. L. Kemp, C. J. Robertson, P. [21] G.J.L. E M. D D.. Gray, et al. "CORBA and XML: Design Choices for Database Federations. " In Proceedings of Federations." of the Seventeenth British 91-208. B. Lings and K. Jeffery, Jeffery, 1 191-208. National Conference Conference on Databases, edited by B. Heidelberg, Germany: Springer-Verlag, 2000. [22] Z. Jiao Jiao and P. P. M. D D.. Gray. "Optimisation of Methods in a Navigational Query Language." In Proceedings of of the Second International International Conference Conference on Deductive Deductive and Object-Oriented Databases, Databases, edited by C. Delobel, M, Kifer, Kifer, and Y. Masunaga, 22-42. 99 1 . 22-42. Heidelberg, Germany: Springer-Verlag, 1 1991. [23] P. E Buneman, L L.. Libkin, D D.. Suciu, et al. "Comprehension Syntax." SIGMOD SIGMOD Record 994): 87-96. 23, no. 1 1 (March 1 1994): [24] J. L. Kemp and [24] G. G.J.L. and P. P. M. D D.. Gray. "Finding Hydrophobic Microdomains Using an Object-Oriented " Computer Applications Object-Oriented Database. Database." Applications in the Biosciences 6, no. 4 ( 1 990): 357-299. (1990): 357-299. [25] Enabling Technology for Knowledge Sharing." [25] R. Neches, R. Fikes, T. T. Finin, et al. " "Enabling Artifi cial Intelligence 12, no. 3 ( 1 99 1 ): 36-56. Artificial (1991): [26] J. Bayardo, B. Bohrer, R. S. [26] R. R.J. S. Brice, Brice, et al. "InfoSleuth: Semantic Integration of Information in Open " In SIGMOD 997, Proceedings Open and Dynamic Environments. Environments." SIGMOD 1 1997, of of the A CM SIGMOD SIGMOD International International Conference on Very Very Large Large Data Bases, edited by J. Peckham, 1 95-206. New York: ACM Press, 1 997. 195-206. 1997. [27] P. M. D . Gray, A. D. Preece, N. J. Fiddian, et al. "KRAFT: Knowledge Fusion from P.M.D. Distributed Databases and Knowledge Bases." In Proceedings Proceedings of of the 8th International Workshop on Database and Expert Systems Applications, International Applications, edited by 1 . Los Alamitos, CA: IEEE IEEE Computer Society Press, 1 997. R. R. Wagner, 682-69 682-691. 1997.

References References

275 275
[28] J. L. Kemp, J. J. Iriarte, and P. [28] G. G.J.L. P. M. D. Gray. "Efficient Access Access to FDM Objects Objects Stored " In Directions in Databases: Proceedings of the Stored in a Relational Relational Database. Database." the Twelfth British National Conference on Databases, edited edited by D. S. S. Bowers, Bowers, 170-186. Springer-Verlag, 1 1994. 1 70-1 86. Heidelberg, Germany: Springer-Verlag, 994.

P. Mitra, Mitra, G. Weiderhold, and M. Kersten. "A Graph-Oriented Model for [29] P. Articulation of Ontology Interdependencies." In Advanced Database Technology-EDBT P. C. Lockerman, M. H. Scholl, TechnologymEDBT 2000, edited by C. Zaniolo, P. et aI., al., 86-100. 86-100. Heidelberg, Germany: Springer-Verlag, Springer-Verlag, 2000. P. Buneman, S. S. B. Davidson, K. Hart, et al. "A Data Transformation System for [30] P. Biological 1 st International Biological Data Sources." In VLDB '95, "95, Proceedings of of the 2 21st Conference P. M. D. Gray, and Conference on Very Very Large Data Bases, Bases, edited by U. Dayal, P. S. 58-1 69. San Francisco: Morgan Kaufmann, 1 995. S. Nishio, 1 158-169. 1995. [31] L. Wong. "Kleisli: Its Exchange Format, Supporting Tools and an Application in Protein Interaction Extraction." of the IEEE International International Extraction." In Proceedings of Symposium 1-28 . New York: 21-28. Symposium on Bio-Informatics and Biomedical Engineering, 2 Computer Society Press, 2000. IEEE Computer [32] N. N.W. E Baker, et al. "Query Processing in the the TAMBIS [32] W. Paton, R. Stevens, P. Bioinformatics Source Integration System. " In Proceedings of of the 1 11 l th International International System." Conference c and Statistical 3 8-147. New Conference on Scientifi Scientific Statistical Database Database Management, 1 138-147. New York: IEEE Computer 999. Computer Society Press, 1 1999.
[33] L. M. Haas, [33] L.M. Haas, P. E Kodali, J. E. Rice, et al. "Integrating Life Life Sciences Sciences Data-With DatamWith A Little Garlic. " In IEEE International International Symposium on Bio-Informatics and Garlic." Engineering, 5-12. 5-12. New 2000. New York: IEEE Computer Computer Society Press, 2000. Biomedical Engineering, Range Nesting: A Fast Quantified [34] M. Jarke and and J. Koch. Koch. " "Range Fast Method Method to Evaluate Quantified SIGMOD "83, 83 Proceedings of Queries." of the Annual Meeting, edited by D. J. Queries." In SIGMOD 1 96-206. Boston: Boston: ACM 1 983. DeWitt and G. Gandarin, Gandarin, 196-206. ACM Press, 1983.
' ,

[35] J.J. Optimisation by by Semantic Semantic Reasoning. Reasoning. Ann Arbor, MI: University J. J. King. Query Optimisation of Michigan Press, 1 1984. 984.
AMOS H II Concepts, 2000, [36] T. Risch, V. Josifovski, and T. T. Katchaounov. AMOS Concepts, June June 23, 2000, http://www.dis.uu.se/udbl/amos/doc/amos_concepts.html. http'//www.dis.uu.se/~udbl/amos/doc/amos_concepts.html.
Z. G. Ives, D. Florescu, M. Friedman, et al. "An Adaptive Query Execution System [37] Z.G. for Data Data Integration." Integration." In SIGMOD SIGMOD 1999, 1 999, Proceedings of of the ACM A CM SIGMOD SIGMOD for Faloutsos, International Conference Conference on Management Management of International of Data, edited edited by A. Delis, C. Faloutsos, and S. Ghandeharizadeh, 299-3 1 0 . Boston: Boston: ACM ACM Press, 1999. 1 999. and Ghandeharizadeh, 299-310.
1. Gil, D . Gray, and and G. J. L. Kemp. "A Visual Interface and Navigator for the the [38] I. Gil, P. M. D. and Navigator 1 999 User Interfaces Interfaces to Data Data Intensive Systems, PIDFM Object Database." Database." In 1999 P/DFM Intensive Systems, N. W. Paton and and T. T. Griffiths, Griffiths, 54-63. 54-63. Los Alamitos, Alamitos, CA: IEEE Computer Computer edited by N. Society Press, Press, 1999. 1 999.

This Page Intentionally Left Blank

_ _ _ _ _ &1 &1 1 &1 1 &l 1I 1III ffi

_ _ _ _ _ _ UIlJ M1 1fMJ1\l m' tm D . l& . l ""' III _ _ II I11 I_ _ III ,[j Ill Ill 1II 11 111 1

CHAPTER CHAPTER

1 0 10

IIntegration ntegration Chal l enges in Challenges in G ene E x pression Data Gene Expression Data M anagement Management
Victor A. Chen, Victor M M.. Markowitz, Markowitz, John John Campbell, Campbell, I-Min I-Min A. Chen, Anthony Anthony Kosky, Kosky, Krishna Krishna Palaniappan, Palaniappan, and and Thodoros Thodoros Topaloglou Topaloglou

DNA DNA microarrays have emerged as the leading technology for measuring gene expression, because of throughput. A single micro array ex expression, primarily primarily because of their their high high throughput. A single microarray experiment provides measurements (mRNA) transcription periment provides measurements for for the the messenger messenger RNA RNA (mRNA) transcription level thousands of 1 ] . While level for for tens tens of of thousands of genes genes in in parallel parallel [[1]. While this this technology technology opens opens new applications, it new opportunities opportunities for for functional functional genomics genomics and and drug drug discovery discovery applications, it also presents new management challenges also presents new bioinformatics bioinformatics and and data data management challenges arising arising from from the need to interpret, and archive vast vast amounts the need to capture, capture, organize, organize, interpret, and archive amounts of of experimental experimental data. Furthermore, Furthermore, to support support meaningful biological reasoning, gene expression data data need need to to be be analyzed analyzed in in the the context context of of rich rich sample sample and and gene gene annotations. annotations. GeneExpress GeneExpress is is a a data data management management system system that that contains contains quantitative quantitative gene gene expression expression information information for for thousands thousands of of normal normal and and diseased diseased samples samples and and for for experimental and cellular under a variety of experimental animal animal model model and cellular tissues tissues generated generated under a variety of treat treatment . Initially ment conditions conditions [2] [2]. Initially the the GeneExpress GeneExpress system system was was developed developed with with the the goal goal of of supporting supporting effective effective exploration, exploration, analysis, analysis, and and management management of of gene gene expression expression data data generated generated at at Gene Gene Logic Logic using using the the Affymetrix Affymetrix GeneChip GeneChip platform platform [3], [3], inte integrated grated with with comprehensive comprehensive information information on on samples, samples, clinical clinical profiles, profiles, and and rich rich gene gene annotations. Building Building such annotations. such a a system system required required resolving resolving various various data data integration integration problems to associate gene expression data with sample data and gene problems to associate gene expression data with sample data and gene annota annotations. A subsequent goal tions. A subsequent goal for for the the GeneExpress GeneExpress system system was was to to provide provide support support for for incorporating incorporating gene gene expression expression data data generated generated outside outside of of Gene Gene Logic. Logic. Addressing Addressing this additional goal this additional goal required required the the resolution resolution of of various various levels levels of of syntactic syntactic and and se semantic mantic heterogeneity heterogeneity of of sample sample data, data, gene gene annotations, annotations, and and gene gene expression expression data, data, which which were were often often generated generated under under different different experimental experimental conditions. conditions. These These goals goals have have been been addressed addressed using using a a data data warehousing warehousing methodology methodology adapted adapted to to the the special special requirements . requirements of of the the gene gene expression expression domain domain [4] [4].

278 2 78 ~

~=~~

10

Integration Challenges in Gene Expression Data Management ~ ~:~ ~ ~ ~ ~ ~ ~ ~ ~ ~ =

This chapter chapter discusses discusses the the challenges challenges associated associated with with data data integration integration in in the the This context context of of a a gene gene expression expression data data management management system system and and describes describes how how the the GeneExpress system system addresses addresses these these challenges. challenges. Section Section 10.1 10.1 provides provides an an overview overview GeneExpress of the the area area of of gene gene expression expression data data management. management. Section Section 10.2 1 0.2 provides provides a a brief brief of description of of Gene Gene Logic's Logic's GeneExpress GeneExpress system. system. Section Section 10.3 1 0.3 discusses discusses the the key key description semantic problems problems associated associated with with integrating integrating gene gene expression expression and and related related data data semantic and and how how they they are are addressed addressed in in the the context context of of GeneExpress. GeneExpress. Section Section 10.4 l OA describes describes how third-party third-party gene gene expression expression data data can can be be integrated integrated into into GeneExpress. GeneExpress. A A sumsum how and observations observations in in Section 1 0.5 concludes concludes the the chapter. chapter. mary and mary Section 10.5

10.1 10.1

G E N E EXPRESSION EXPR E S S I O N DATA DATA MANAGEMENT: MANAG E M E NT: GENE BACKG RO U N D BACKGROUND
The gene expression is reviewed reviewed briefly in this First disThe gene expression data data application application is briefly in this section. section. First dis cussed are the the data data spaces modeled by by a data cussed are spaces that that need need to to be be modeled a gene gene expression expression data management system, then initiatives to to establish standards for gene expression expression management system, then initiatives establish standards for gene and related data. and related data.

1 0. 1 . 1 10.1.1

G e n e Expression Expressi o n Data S paces Gene Data Spaces


Gene f protein-coding Gene expression expression systems systems measure measure mRNA mRNA transcription transcription level level o of protein-coding genes cell. The genes in in a a cell. The mRNA mRNA mix used used in in gene gene expression expression experiments experiments is is derived derived from biomaterials (samples) lines. A from biomaterials (samples) such such as as tissues tissues and and cell cell lines. A microarray microarray typically typically is thousands of associated with is designed designed to to detect detect thousands of specific specific target target sequences sequences associated with these these genes genes through through hybridization. hybridization. The The reported reported measurements measurements are are meaningful meaningful only only when when something something is is known known about about the the samples samples and and the the target target sequences sequences and and their their associated genes. genes. The associated The first first goal goal of of gene gene expression expression data data management management is is to to integrate integrate expression sample and expression data data with with sample and gene gene annotations annotations and and to to allow allow users users to to use use these these annotations annotations to to explore, explore, analyze, analyze, and and interpret interpret expression expression data data [4, [4, 5]. 5]. Typically, Typically, a a gene integrates data gene expression expression data data management management system system integrates data from from three three different different data data spaces: sample annotations, spaces: sample annotations, gene gene annotations, annotations, and and gene gene expression expression measurements, measurements, each each of of which which is is described described in in the the following following sections. sections.
Biological Biological Sample Sample Data Data Space Space

The The main main object object in in the the sample sample data data space space is is the the sample sample representing representing the the biological biological material material that that is is the the focus focus of of an an experiment. experiment. Samples Samples originate originate from from a a variety variety of of sources protocols. Annotations sources with with different different data data standards standards and and handling handling protocols. Annotations asso associated ciated with with each each sample sample should should address address its its physical physical features features and and quality, quality, as as well well as as the the accuracy accuracy and and extent extent of of the the information information recorded. recorded. Ultimately, Ultimately, sample sample data data

1 0. 1 10.1

Gene GeneExpression Data Management: Background

279

279

are are recorded recorded in in the the sample sample data data space space of of a a gene gene expression expression system. system. A A sample sample can can be be of of tissue, tissue, cell, cell, or or processed processed RNA RNA type, type, and and it it originates originates from from a a donor donor or organism given species ganism of of a a given species (e.g., (e.g., human, human, mouse, mouse, rat). rat). Attributes Attributes associated associated with with samples their nature nature and organ site, site, diagnosis, samples describe describe their and condition condition (e.g., (e.g., organ diagnosis, disease, disease, stage stage of of disease), disease), as as well well as as donor donor information information (e.g., (e.g., demographic demographic and and clinical clinical record record for for human human donors donors or or strain, strain, genetic genetic modification, modification, and and treatment treatment informa information for for animal animal donors) donors).. Samples are commonly organized in in groups groups that that can can be be tion Samples are commonly organized further grouped grouped into into studies studies or or projects, projects, such such as as time/dose time/dose studies. Information further studies. Information on samples in on how how samples in such such groups groups are are related related to to one one another another is is therefore therefore a a necessary necessary annotation annotation for for the the sample sample data data space. space.

Gene Annotations Data Space Space Gene Annota tions Data


Gene help to Gene annotations annotations help to associate associate the the expression expression data data reported reported for for sequence sequence fragments array to biological entities fragments on on a a micro microarray to biological entities such such as as genes genes and and proteins. proteins. The The main problem annotations of main problem here here is is that that sequence sequence annotations, annotations, and and annotations of the the func function of known known genes, can change over time time as as the the availability availability of of more more sequence, sequence, tion of genes, can change over better better computational computational tools, tools, and and new new research research lead lead to to better better gene gene prediction prediction re results. suits. Furthermore, Furthermore, the the sources sources for for gene gene annotations annotations are are usually usually primary primary or or con consolidated bases that solidated data databases that are are heterogeneous heterogeneous and and may may contain contain inconsistent inconsistent data. data. Consequently, Consequently, the the effort effort of of keeping keeping up-to-date up-to-date gene gene annotation annotation data data for for sequence sequence fragments on microarrays combines the complexities of database integration with with the the ongoing ongoing research research in in the the field field of of gene gene identification. identification. The The main main object object in in the the gene gene annotation annotation data data space space is is the the gene gene fragment, fragment, repre representing array senting an an entity entity for for which which the the expression expression level level is is being being determined. determined. For For micro microarray technologies, gene fragments are associated with a specific microarray type, technologies, gene fragments are associated with a specific microarray type, such such as as a a GeneChip GeneChip human human probe probe array array (e.g., (e.g., HG_U95A). HG_U95A). The The annotations annotations associated associated with with a a gene gene fragment fragment describe describe its its biological biological context, context, including including its its associated associated pri primary (EST) sequence mary expressed expressed sequence sequence tag tag (EST) sequence entry entry in in GenBank; GenBank; membership membership in in a a gene-oriented gene-oriented sequence sequence cluster; cluster; association association with with a a known known gene gene (i.e., (i.e., a a gene gene that that is is recorded official nomenclature Human Gene recorded in in an an official nomenclature catalog, catalog, such such as as the the Human Gene Nomencla Nomenclature Database [HUGO] [HUGO] [6]); ture Database [6]); functional functional characterization, characterization, such such as as Gene Gene Ontology Ontology (GO) (GO) annotations; annotations; and and association association to to known known metabolic metabolic and and signaling signaling pathways. pathways.
Gene Expression Gene Expression Measurement Measurement Data Data Space Space

Gene array systems channel and Gene expression expression micro microarray systems are are broadly broadly classified classified into into single single channel and two channel systems. A single channel system takes a single sample of biological two channel systems. A single channel system takes a single sample of biological material material and and provides provides absolute absolute measures measures of of gene gene expression expression for for that that sample, sample, while while a a two two channel channel system system takes takes a a pair pair of of samples samples and and provides provides measurements measurements of of the the difference difference in in relative relative gene gene expression expression between between them. them. Single Single channel channel systems systems are are

280

1 0 10

_ _ "w<>""=-=>,,,,.,,", ",, q A ''''.,*", '0M-MPA""" """.,""" ff<1W" _'Y "' _ ='_=___""" """"__"__ _ > " " ,"" ,< " ,00 ' Y= ,="' ''0lN ',,,__ _ "" ,0/' ="""""= _" =/ '-=='""f"'"=== "" .,.,.__._-=__ <= _ _ _ _

IIntegration ntegration C h a l l enges iin n Gene nagement Challenges Gene Expression Expression Data Data Ma Management

best represented represented by by the Affymetrix GeneChip GeneChip platform platform [7] [7].. This This chapter chapter focuses focuses on on best the Affymetrix the the management management of of gene gene expression expression data data generated generated using using the the GeneChip GeneChip platform. platform. Note, however, however, that that most data management management and and integration integration issues issues discussed discussed in in Note, most data this this chapter chapter apply apply to to gene gene expression expression data data in in general, general, regardless regardless of of the the underlying underlying technology technology platform. platform. Typically, generated by Typically, data data generated by a a microarray microarray system system can can be be classified classified into into three three data abstraction. This data types, types, each each representing representing a a different different level level of of abstraction. This hierarchy hierarchy of of data data types common, with types is is common, with slight slight differences, differences, to to all all microarray microarray platforms platforms and and consists consists of: of:
1. Raw data consisting 1. consisting of of binary binary image image files files generated generated by by scanners scanners

or probe intensity data data consisting consisting of of values values associated associated with with each each probe probe 2. Grid or

or or oligonucleotide oligonucleotide sequence sequence examined examined on on a a microarray microarray


3. Gene expression estimates generated generated by by combining combining data data on on related related probes probes on on a a mlcroarray microarray

Each Each data data type type may may have have multiple multiple data data formats formats or or representations representations associated associated with with it, file-based formats it, such such as as text text or or binary binary file-based formats or or database database representations. representations. The The transformation transformation between between data data types types is is carried carried out out by by platform-specific platform-specific algorithms. algorithms. It It is is not not uncommon uncommon to to use use more more than than one one algorithm algorithm to to transform transform data data from from one one data data type type to to the the next next [8, [8, 9]. 9]. The The following following paragraphs paragraphs briefly briefly describe describe the the hierarchy hierarchy of of data data types types in in the the context context of of the the GeneChip GeneChip platform. platform. Affymetrix's called probe arrays) Affymetrix's GeneChip GeneChip microarrays microarrays (also (also called arrays) are are tiled tiled with with oligonucleotide sequences, each base-pairs in oligonucleotide sequences, each 25 25 base-pairs in length, length, known known as as probes. Each Each probe probe is is designed designed to to hybridize hybridize to to a a known known mRNA mRNA fragment fragment representing representing a a target target consists of gene Probes are grouped into gene or or EST. EST. Probes are grouped into probe pairs, each each of of which which consists of a a perfect-match (PM) probe being (PM) and and a a mismatch (MM) (MM) probe, probe, with with the the MM MM probe being created middle ((13 1 3th ) base created from from the the PM PM probe probe by by changing changing the the middle th) base to to measure measure non nonspecific specific binding. binding. Each Each target target gene gene or or EST EST is is represented represented by by a a probe set consisting consisting of of up up to to 20 20 probe probe pairs. pairs. A A GeneChip GeneChip probe probe array array experiment experiment involves involves preparing preparing the the RNA RNA sample, sample, carrying carrying out out the the probe probe array array experiment experiment (hybridization, (hybridization, washing, washing, staining), staining), and and scanning . The scanning the the probe probe array array [7] [7]. The scanning scanning process process generates generates a a file file containing containing an an raw data. image of the probe array, which constitutes the image of the probe array, which constitutes the The scanned images images are using methods methods such such as The scanned are interpreted interpreted using as the the GeneChip GeneChip microarray microarray suite suite (MAS) (MAS) analysis analysis algorithms. algorithms. The The MAS MAS cell cell averaging algorithm algorithm averages averages pixel pixel intensities intensities and and computes computes cell-level cell-level intensities intensities in in which which each each cell cell represents represents one one probe probe on on the the probe probe array. array. The The output output from from this this process process is is a a file file containing containing the the estimated estimated intensities intensities for for each each probe probe on on the the probe probe array, array, which which constitutes hybridization constitutes the the probe data. These These intensities intensities indicate indicate the the amount amount of of hybridization that oligonucleotide sequence that occurred occurred for for each each oligonucleotide sequence on on the the array. array.

==:=",=,::;==0::;'.:,;:" ===,=,:::' ::;:,: ;::;;;:: :::::" : ,:::,,::,: ,,": .';:c=:::;,":,: :;;;';;:::;;; -;;, '" ,::,;:,:; c. :: ,,::;;;: ; ;;;::.', :;;:::;::;::;'::':; ;. c,::;,;:; ; ;:::::==

1 0. 1 Gene Gene Expression Expression Data Data Management: Management: Background Backg rou n d 10.1

281 281

Probe intensity intensity files can can be further further analyzed analyzed with with methods methods such such as a s the the MAS Probe algorithms, which which generate generate gene expression estimates estimates by by summarizsummariz chip analysis algorithms, gene expression ing the the intensities intensities of of each each probe probe set set that that corresponds corresponds to to a a gene gene or or EST EST fragment fragment ing targeted by by the the probe probe array. array. Alternative Alternative gene gene expression expression estimates estimates may may be be based based targeted on single single or or multiple (e.g., replicate) experiments. on multiple (e.g., replicate) experiments. GeneChip Laboratory Laboratory Information Information Management Management System (LIMS) provides The GeneChip support for for transforming transforming data data between the different different data data types types and and for for loading loading support between the the gene estimates into into a a relational relational database database based based on on the the Affymetrix Affymetrix the gene expression expression estimates Analysis Data Data Model Model (AADM) (AADM) [10]. [10]. Analysis The different or data different data data types and and their their associated formats formats result result in files files or data structures of of different different sizes. sizes. For For example, example, for for an an experiment experiment using using an an HG_U133 HG_U133 structures GeneChip probe probe array, the raw raw image file is around around 45 megabytes in size, the the probe GeneChip probe intensity data data file file is around 12 12 megabytes, intensity is around megabytes, and and the the summarized summarized gene gene expression expression data roughly 22,000 data consists consists of of roughly 22,000 values. values.

1 0. 1 .2 10.1.2

Sta n d a rds: Benefits and a n d Li m itati ons Standards: Limitations


Effective exploration exploration of of microarray microarray data data has has been het Effective been hindered hindered by by the the variety variety and and heterogeneity of the data formats used. This problem has been recognized by several erogeneity of the data formats used. This problem has been recognized by several organizations, such as the the European European Bioinformatics Bioinformatics Institute the U.S. U.S. NaNa organizations, such as Institute (EBI), (EBI), the tional Center Biotechnology Information (NCBI), and and the National Center Center for tional Center for for Biotechnology Information (NCBI), the National for Genome Resources (NCGR), in their efforts to establish public data repositories for for gene gene expression expression information. information. Microarray Microarray manufacturers manufacturers have have also also proposed proposed formats, such as the AADM used for the GeneChip LIMS relational database 10], database [ [10], to facilitate data sources of to facilitate data exchange exchange between between different different sources of gene gene expression expression data data and and the the development development of of gene gene expression expression analysis analysis packages. packages. Different Different standardization standardization efforts efforts have have been been consolidated consolidated by by the the Microarray Microarray Gene Gene Expression Expression Database Database Group Group (MGED), (MGED), a a consortium consortium of of academic academic and and com commercial mercial organizations organizations with with the the shared shared goal goal of of defining defining standard standard formats formats that that will will allow allow gene gene expression expression data data repositories repositories to to share share and and exchange exchange data. data. MGED MGED has recently recently published published Minimum Minimum Information Information About About a a Microarray Microarray Experiment Experiment (MIAME), (MIAME), a a recommendation recommendation for for the the minimum minimum information information required required for for a a mi microarray croarray experiment experiment [5], [5], and and has has developed developed a a data data exchange exchange format format (Microarray (Microarray Gene ) and Gene Expression Expression Markup Markup Language Language [MAGE-ML] [MAGE-ML]) and object object model model (Microarray (Microarray Gene Gene Expression Expression Object Object Model Model [MAGE-OM] [MAGE-OM])) for for microarray microarray experiment experiment data. data. Existing Existing definitions definitions and and proposed proposed standards standards for for gene gene expression expression data data provide provide useful useful guidelines guidelines for for organizing organizing expression expression data data in in systems systems such such as as GeneExpress. GeneExpress. Adequate Adequate standards standards for for the the representation representation of of sample sample and and gene gene annotations, annotations, how however, ever, have have not not yet yet been been established. established. MIAME's MIAME's recommended recommended standards standards for for gene gene annotation annotation for for the the fragments fragments on on a a microarray microarray are are minimal minimal to to simplify simplify compliance. compliance. For For example, example, the the suggested suggested annotations annotations for for probes probes on on a a microarray microarray consist consist of of their their

282

10 1 0

Challenges Gene Expression Expression Data Data M Management IIntegration ntegration Chal lenges iin n Gene a nagement

identity, identity, sequence, sequence, and and the the associated associated composite composite target target sequence, sequence, along along with with gene gene symbol model organism organism database. in-depth gene symbol or or reference reference to to a a model database. However, However, in-depth gene ex expression data pression data analysis analysis requires requires access access to to functional functional characteristics characteristics of of these these target target gene interpret data gene fragments fragments to to interpret data analysis analysis results. results. Similarly, Similarly, MIAME's MIAME's minimum minimum required required sample sample annotations annotations are are not not sufficient sufficient to needed for to establish establish the the context context needed for comprehensive comprehensive gene gene expression expression data data analysis. analysis. Clinical Clinical history, history, morphology, morphology, and and pathology pathology for for samples samples are are needed needed to to interpret interpret gene data. For example, it necessary to gene expression expression data. For example, it is is necessary to know know the the precise precise stage stage of of a a tumor or or medications medications taken taken during acquisition of of a a cancerous to interpret interpret tumor during acquisition cancerous sample sample to expression measurements measurements for expression for the the sample. sample. For For sample sample data, data, standardization standardization involves involves establishing establishing controlled controlled vocabular vocabularies ies of of terms terms for for specific specific data data domains, domains, such such as as the the Systematized Systematized Nomenclature Nomenclature for 1 1 ] for for Medicine Medicine (SNOMED) (SNOMED) [ [11] for anatomy anatomy or or diseases. diseases. These These efforts efforts are are usually usually sponsored sponsored by by professional professional organizations organizations within within a a specific specific field field (e.g., (e.g., SNOMED SNOMED is is supported easily accessible supported by by the the College College of of American American Pathologists) Pathologists) and and are are not not easily accessible to to academic academic organizations organizations because because of of their their associated associated costs. costs. For For gene gene annotations, annotations, the the most most notable notable standardization standardization effort effort is is the the devel development opment of of the the Gene Gene Ontology Ontology (GO) (GO) by by the the GO GO Consortium Consortium [12] [12].. The The goal goal of of GO vocabulary to GO is is to to provide provide a a dynamic dynamic controlled controlled vocabulary to describe describe the the role role of of genes genes and molecular function, and gene gene products products in in terms terms of of molecular function, biological biological process, process, and and cellular cellular components. components. Data Data exchange exchange formats formats or or standards standards emphasize emphasize the the syntactic syntactic aspects aspects of of ex expression data pression data and, and, to to a a lesser lesser degree, degree, the the meaning meaning of of the the data data in in cases cases where where the the representation is documented. However, representation is well well documented. However, these these formats formats do do not not address address the the semantic issues issues regarding regarding the the comparability comparability (or (or compatibility) compatibility) of of gene gene expression expression semantic data. data. Data Data comparability comparability is is a a prerequisite prerequisite for for analyzing analyzing expression expression data data from from mul multiple tiple experiments experiments or or multiple multiple sites sites together together and and is is discussed discussed in in Section Section 10.3. 10.3.

1 0.2 10.2

TH EG E N E EXPRESS SYSTE M THE GENEEXPRESS SYSTEM


Gene Logic's GeneExpress GeneExpress system support for Gene Logic's system provides provides support for managing managing expression expression data data generated using using the high throughput generated the Affymetrix Affymetrix GeneChip GeneChip platform platform in in a a high throughput produc production tion environment. environment. Sample, Sample, gene gene annotation, annotation, and and gene gene expression expression data data are are collected collected from separate data Sample data collected and managed using from separate data sources: sources: Sample data are are collected and managed using a a sample sample data data management management system; system; gene gene annotations annotations are are acquired acquired from from a a variety variety of of public public and bases and and private private genome genome data databases and integrated integrated into into a a gene gene annotation annotation database; database; and and the the main main source source for for gene gene expression expression data data is is an an Affymetrix Affymetrix GeneChip GeneChip LIMS LIMS database. built using database. GeneExpress GeneExpress was was built using data data warehousing warehousing and and online online analytical analytical processing . processing (OLAP) (OLAP) concepts concepts adapted adapted to to the the gene gene expression expression data data domain domain [4] [4].

10.2 The GeneExpress System

283

10.2.1 1 0. 2 . 1

GeneExpress System Co Components G e n e Exp ress System m po n e nts


The GeneExpress GeneExpress data data store store consists consists of of the the GeneExpress GeneExpress Data Data Warehouse Warehouse The (GXDW). bases containing (GXDW). GXDW GXDW is is made made up up of of component component data databases containing sample, sample, gene gene annotation, annotation, and and gene gene expression expression data data and and process process information information specific specific to to the the gen generation and and analysis analysis of of the the expression expression data data [[13]. eration 13]. The gene gene expression expression data data in in GXDW GXDW iis represented b by a three-dimensional three-dimensional The s represented ya array with with expression expression values values indexed indexed by by gene gene fragments fragments (identified (identified by by their their target target array sequence microarray type), sequence and and the the microarray type), samples, samples, and and algorithm algorithm or or measurement measurement type. type. This data data structure structure is is implemented implemented by by the the Gene Gene Expression Expression Array Array (GXA) (GXA) as as a a This collection collection of of matrices, matrices, each each associated associated with with a a particular particular GeneChip GeneChip probe probe array array type type (e.g., (e.g., HG_U95A) HG_U95A) and and measurement measurement type type (e.g., (e.g., a a version version of of the the MAS MAS algorithm) algorithm).. Each matrix has axes representing samples and gene fragments. The GXA provides Each matrix has axes representing samples and gene fragments. The GXA provides a basis basis for for the the GeneExpress GeneExpress Analysis Analysis Engine, Engine, which which implements implements various various analysis analysis a methods methods in in a a highly highly efficient efficient manner. manner. The GXDW, and Analysis The GXDW, GXA, GXA, and Analysis Engine Engine applications applications reside reside on on a a GeneExpress GeneExpress server. pace File server. The The server server also also hosts hosts the the Works Workspace File System, System, which which allows allows users users to to store analysis results and and share share them them throughout throughout an an organization. organization. store analysis results Data in in GXDW GXDW can can be be accessed accessed using using the the GeneExpress GeneExpress Explorer Explorer application, application, Data which which provides provides support support for for specifying specifying gene gene and and sample sample sets sets of of interest interest and and for for such gene and sample sets using analyzing gene expression data in analyzing gene expression data in the the context context of of such gene and sample sets using a a variety of of analysis tools. GeneExpress Explorer is Java variety analysis tools. GeneExpress Explorer is implemented implemented as as a a client-side client-side Java application, which runs runs on on desktops accesses GXDW GXDW through through Java Java DataBase application, which desktops and and accesses DataBase Connectivity (JDBC) (JDBC) and the analysis layer. The Connectivity and the analysis server server through through a a CORBA CORBA layer. The main main components and of the in Figure components and architecture architecture of the GeneExpress GeneExpress system system are are illustrated illustrated in Figure 1 0. 1 . The results of of gene gene expression expression analysis analysis can in the the context of 10.1. The results can be be examined examined in context of gene annotations, annotations, such such as to third-party third-party tools, tools, gene as pathways, pathways, and and can can be be exported exported to such as GeneSpring, or or Partek, visualization or or further further analysis. such as Spotfire, Spotfire, GeneSpring, Partek, for for visualization analysis. The The gene and associated data can can also also be be accessed gene expression expression and associated data accessed directly directly through through Applica Application Programming Interfaces (APIs), (APIs), which which are a number number of popular tion Programming Interfaces are available available for for a of popular programming languages and platforms. platforms. programming languages and

10.2.2 1 0. 2 . 2

G e n e Express Deployment Dep l oyment and a n d Update U pdate Issues Issues GeneExpress
I n most most cases, cases, a a GeneExpress GeneExpress system system for for a a particular particular customer customer resides resides o na a dedidedi In on cated either deployed customer site cated server. server. These These machines machines are are either deployed at at the the customer site and and connected connected to to the the customer's customer's internal internal network, network, or or they they are are located located at at Gene Gene Logic Logic and and accessed accessed via via a a Virtual Virtual Private Private Network Network (VPN) (VPN) mechanism. mechanism. The The data data content content of of each each GeneGene Express system, system, involving involving both both GXDW GXDW and and the the GXA GXA matrices, matrices, is is updated updated on on a a Express regular regular schedule schedule (e.g., (e.g., bi-monthly bi-monthly or or quarterly). quarterly) .

284 284

10 1 0

Challenges Gene Expression Data Data M Management IIntegration ntegration Chal lenges iin nG e n e Expression a n a gement

ClientSlde Components

( .....

GeneExpress Explorer (GXX)

.: t

:4: .

Analysis Engine

Components

Server-5lde

GeneExpress Data Warehouse (GXOW)

;.1

GeneExpress Array (GXA)

Il rr
Workspace File System

1 0.1 10.1 F I G U RE FIGURE

GeneExpress System System Architecture. Architecture. GeneExpress

The GXDW are The sample, sample, process, process, and and gene gene expression expression data data components components of of GXDW are built by built by extracting extracting the the data data for for the the relevant relevant samples samples from from a a master master production production version of maintained at samples pro version of GXDW, GXDW, which which is is maintained at Gene Gene Logic. Logic. The The subset subset of of samples provided to GeneExpress customer customer is vided to each each GeneExpress is determined determined by by the the specific specific GeneExpress GeneExpress product product license license for for the the customer customer and and will will usually usually contain contain new new samples samples that that have have been update. The sample, process, been processed processed by by Gene Gene Logic Logic since since the the last last content content update. The sample, process, and and gene gene expression expression portions portions of of the the production production GXDW GXDW are are maintained maintained in in an an in incremental fashion, with cremental fashion, with new new samples samples and and experiments experiments being being added added as as they they become become available. Similarly, available. Similarly, the the set set of of GXA GXA matrices matrices for for a a particular particular customer customer is is built built by by extracting the internal extracting the the portions portions of of the internal production production GXA GXA matrices matrices that that pertain pertain to to the to the the samples samples being being supplied supplied to the customer. customer. The update mechanism The update mechanism for for the the gene gene annotation annotation data data component component of of GXDW GXDW is is somewhat somewhat different. different. To To keep keep abreast abreast of of current current genomic genomic data data available available in in the the public domain, it public domain, it is is necessary necessary to to refresh refresh the the gene gene annotation annotation database database periodically. periodically. The static of the such as as gene gene fragments and array The static portion portion of the data, data, such fragments and array design, design, will will not not change change unless unless new new arrays arrays are are introduced. introduced. However, However, links links to to genes genes and and all all the the public public genomic genomic objects objects may may change change to to reflect reflect new new versions versions of of their their data data sources. sources. Because Because of of the the complex complex interdependencies interdependencies of of the the various various genomic genomic data data sources sources and and the the fact fact that many such data data sources that many such sources do do not not provide provide incremental incremental updates, updates, it it is is not not feasible feasible to database in fashion. Instead, to update update the the gene gene annotation annotation database in an an incremental incremental fashion. Instead, it it must must be reloaded each usually performed be completely completely reloaded each time time it it is is refreshed. refreshed. This This process process is is usually performed on basis because because of overhead involved. on a a quarterly quarterly basis of the the high high overhead involved.

10.3 Managing Gene Expression Data: Integration Challenges

285 285

1 0.3 10.3

MANAG ING G E N E EXPR E S S I O N DATA: MANAGING GENE EXPRESSION DATA: IINTEGRATION NTEG RATI O N CHALLE N G ES CHALLENGES
This section section presents presents some arise from This some of of the the key key challenges challenges that that arise from the the management management of of gene gene expression expression and and related related data data and and briefly briefly describes describes how how each each of of these these chal challenges is addressed in the GeneExpress system. Many of these challenges involve resolving semantic conflicts in gene expression, sample, and gene annotation data to integrate these data in a gene expression data management system. This section discusses the problems caused array ver discusses the data data management management problems caused by by differences differences in in micro microarray versions, and normalizations, sions, differences differences in in algorithms algorithms and normalizations, and and non-biological non-biological variability variability in in expression expression data data are are discussed discussed first, first, followed followed by by challenges challenges regarding regarding sample sample data data and and gene gene annotation annotation data. data.

1 0. 3 . 1 10.3.1

G e n e Expressi o n Data : Array rs i o n s Gene Expression Data" Array Ve Versions


Microarray platforms keep evolving with new probe array versions benefiting from from technological technological improvements improvements (e.g., (e.g., higher higher density density arrays arrays and and better better probe probe selection) selection) and and advances advances in in deciphering deciphering the the genome. genome. For For example, example, Affymetrix Affymetrix re recently replaced cently released released the the HG_U133 HG_U133 series series of of the the human human probe probe arrays, arrays, which which replaced the similar samples the previous previous HG_U95 HG_U95 series series of of arrays. arrays. Running Running the the same same or or similar samples on on two two series series of of probe probe arrays arrays doubles doubles the the amount amount of of data data generated. generated. However, However, in in many cases, this is necessary because the newer arrays may produce expression many cases, this is necessary because the newer arrays may produce expression data data for for target target transcript transcript sequences sequences that that are are not not available available on on the the previous previous versions. versions. In In addition, addition, there there may may be be multiple multiple versions versions of of a a probe probe array array within within a a particular particular array HG_U95A array series series if if problems problems are are discovered discovered with with a a particular particular array array (e.g., (e.g., HG_U95A versions versions 1 1 and and 2 2 within within the the HG_U95 HG_U95 series). series). Comparing Comparing data data generated generated using using different different series series of of probe probe arrays arrays entails entails addressing addressing a a complex complex semantic semantic data data inte integration gration problem, problem, with with gene gene annotation annotation data data providing providing only only partial partial support support for for resolving resolving it. it. In In general, general, data data generated generated using using different different probe probe array array series series or or versions versions are are not not comparable, comparable, nor nor can can they they be be transformed transformed to to make make them them comparable. comparable. This This is selection of probe arrays, is in in part part due due to to the the selection of target target genes genes and and ESTs ESTs for for new new probe arrays, which are often based on newly published biological information. Furthermore, which are often based on newly published biological information. Furthermore, representative representative probes probes for for the the target target genes genes on on the the new new probe probe arrays arrays may may be be differ different due to better representative ent due to availability availability of of better representative sequences sequences or or improved improved techniques techniques for choosing oligos for choosing oligos within within a a representative representative sequence. sequence. New New probe probe arrays arrays may may also also be be associated associated with with improved improved analysis analysis algorithms algorithms for for determining determining summary summary inten intensity which will directly comparable older algorithms. sity values, values, which will not not be be directly comparable with with older algorithms. Con Consequently, sequently, in in order order to to allow allow comparison comparison of of gene gene expression expression data data generated generated for for

286

10 1 0

IIntegration ntegration Chal lenges in anagement Challenges in Gene Gene Expression Expression Data Data M Management

new new samples samples using using new new probe probe arrays arrays with with data data for for existing existing samples, samples, it it is is nec necessary essary to to re-run re-run the the existing existing samples samples using using the the new new probe probe array array versions versions and and algorithms. On the the other other hand, hand, it it is is often often still still valuable valuable to to maintain maintain data data generated generated using using On older older probe probe arrays arrays because because they they may may provide provide the the basis basis for for existing existing analyses analyses or or prediction prediction models, models, which which users users do do not not wish wish to to re-create, re-create, because because sample sample material material may may no no longer longer be be available available for for re-running re-running the the experiments experiments using using new new arrays, arrays, or or because because samples samples may may no no longer longer be be considered considered important important enough enough to to warrant warrant re rerunning running them them using using new new arrays. arrays. Further, Further, older older probe probe arrays arrays may may include include gene gene fragments of of interest interest that that have have been been omitted omitted or or do do not not have have a a good good representation representation fragments on the the newer newer arrays. arrays. on GeneExpress GeneExpress supports supports multiple multiple probe probe array array sets sets for for each each species species and and allows allows users users to to choose choose a a probe probe array array set set in in addition addition to to a a species species when when performing performing anal analyses. yses. Annotations Annotations associating associating homologous homologous or or related related gene gene fragments fragments on on different different versions probe array array are versions of of a a probe are provided provided in in the the gene gene annotation annotation database database of of GXDW GXDW and and can can be be used used to to map map fragments fragments on on a a given given probe probe array array to to fragments fragments on on another another version of of the the probe probe array. array. Direct Direct comparisons comparisons of of gene gene expression expression data data based based on on version probe arrays arrays are not supported. supported. different probe The The amount amount of of data data generated generated with with multiple multiple probe probe array array versions versions is is kept kept manageable, manageable, in in part, part, because because GXDW GXDW and and GXA GXA contain contain only only the the estimates estimates for for expression data. Images expression measures measures and and gene-level gene-level summary summary data. Images and and probe probe intensity intensity files are are archived archived on on an enterprise network-accessed storage system system and files an enterprise network-accessed storage and are are not not incorporated into into standard standard GeneExpress GeneExpress systems. When a incorporated systems. When a new new algorithm algorithm or or a a new new array version version needs needs to to be be supported supported within within GeneExpress, GeneExpress, the the information probe information probe array describing the probe array design must must be in the the gene expression data describing the probe array design be entered entered in gene expression data space and a space and a new new matrix matrix included included in in the the GXA. GXA.

10.3.2 1 0.3.2

Gene Algorithms G e n e Expression Expressi o n Data" Data : A l g o rith m s and a n d Normalization N o r m a l ization
Different algorithms algorithms can can be be applied applied to to generate generate gene gene expression expression data data at at differdiffer Different ent levels levels including including image, image, probe probe level, level, and and gene gene expression expression estimate estimate data. data. For For ent example, recently several alternative alternative methods methods have have been been developed developed to to estimate ex example, recently several estimate expression measures measures from probe data data ([8, ( [8, 9]) 9] ) in in addition addition to to Affymetrix' Affymetrix' GeneChip GeneChip pression from probe MAS algorithms. algorithms. For For GeneChip, GeneChip, the the MAS MAS 5.0 5 .0 algorithm algorithm has has recently recently replaced replaced MAS the MAS MAS 4.0 4.0 algorithm algorithm and and is is required required for for analyzing analyzing the the data data generated with the the the generated with newer newer versions versions of of probe probe arrays. arrays. To To take take advantage advantage of of a a new new or or alternative alternative algoalgo rithm, it it is is necessary necessary to to re-analyze re-analyze raw raw or or probe probe data data and and generate new estimates estimates rithm, generate new of gene gene expression. expression. It It is is important important to to note note that that expression expression estimates estimates generated generated of

1 0.3 10.3

Managing Gene Expression Data: Integration Challenges

287 287

by by different different algorithms algorithms are are not not directly directly comparable. comparable. Furthermore, Furthermore, some some algo algorithms rithms depend depend on on certain certain parameters parameters that that may may also also affect affect the the generated generated expression expression estimates. estimates. In In GeneExpress, GeneExpress, a a number number of of factors factors are are recorded recorded that that may may determine determine the the comparability comparability of of expression expression data, data, including including the the following. following.
1 algorithms employed 1.. The The algorithms employed to to generate generate expression expression estimates, estimates, namely namely MAS MAS 4.0 4.0

(employed probe arrays 1 ) or (employed for for all all probe arrays through through the the end end of of 200 2001) or MAS MAS 5.0 5.0 (required (required for for the the new new HG-U133 HG-U133 probe probe arrays arrays and and optional optional for for other other probe probe arrays) arrays) are are recorded. using different different algorithms recorded. Data Data generated generated using algorithms are are not not comparable. comparable. probe array variability are probe array lot lot variability are also also recorded. recorded. Data Data generated generated using using different different scaling are transformed transformed to common factor using straightforward scaling factors factors are to a a common factor using straightforward multiplication. multiplication.

2. Scaling used to Scaling factors factors used to reduce reduce discrepancies discrepancies caused caused by by sample sample preparation preparation or or

3. Normalizations, be applied 3. Normalizations, that that may may be applied to to the the values values generated generated by by the the MAS MAS or or other other algorithms algorithms are are recorded: recorded: GeneExpress GeneExpress provides provides support support for for several several normalization normalization methods methods including including Standard Standard Curve Curve Normalization, Normalization, based based on on using spike-ins of known concentrations for certain (bacterial) genes when using spike-ins of known concentrations for certain (bacterial) genes when preparing 14]. Data preparing samples samples for for experiments experiments [ [14]. Data must must be be generated generated using using the the same same normalization normalization to to be be comparable. comparable.

The The Gene Gene Expression Expression analysis analysis software, software, GeneExpress GeneExpress Explorer, Explorer, ensures ensures that that data data analyzed analyzed together together have have been been generated generated using using the the same same algorithms algorithms and and normaliza normalization tion methods. methods.

1 0.3.3 10.3.3

G en e Expressi o n Data : Va ri a b i l ity Gene Expression Data" Variability


Determining Determining if if gene gene expression expression data data from from two two or or more more sources, sources, such such as as different different organizations sites within within an organizations or or different different sites an organization, organization, are are comparable comparable involves involves assessing non-biological differences assessing non-biological differences that that may may affect affect analysis analysis results. results. While While gene geneto-gene to-gene differences differences and and sample-to-sample sample-to-sample differences differences will will be be present present in in any any set set of experimental data, of experimental data, it it is is important important to to determine determine if if there there are are other other significant significant sources sources of of variability. variability. Many Many factors factors may may contribute contribute to to such such variability, variability, including including differences samples; differences differences in in the the processes processes for for obtaining obtaining and and storing storing samples; differences in in ex experimental techniques; differences perimental practices practices and and techniques; differences in in adjustment adjustment of of equipment, equipment, such such as as scanners; scanners; and and so so on. on. Statistical Statistical methods methods are are used used to to identify identify the the magnitude magnitude and and qualitative qualitative nature nature of non-biological variability. of non-biological variability. Initial Initial exploration exploration ideally ideally involves involves samples samples collected collected

288 288

10 10
~ ~ ~ : ~ ~ ` ~ ~

I ntegration Challenges C h a l l enges in i n Gene Expression Expression Data Data Management M a nagement Integration
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ......... : . ~ ~

from the the same same type type of of tissue tissue (i.e., (i.e., from from the the same same type type of of organ organ and and a a similar similar location location from in the organ) and with the same pathology. In this case, data comparability in the organ) and with the same pathology. In this case, data comparability can can be be assessed using using the the experiments. samples are assessed the entire entire set set of of genes genes involved involved in in the experiments. If If samples are from the same same type type of of tissue but with different pathologies, pathologies, data comparability from the tissue but with different data comparability can be be assessed assessed using using only only genes genes that that are are not not likely likely to to be be involved involved in in the the biological biological can difference between between the the two two groups groups of of samples. samples. difference Exploratory statistical statistical techniques techniques employed comparability of of Exploratory employed for for assessing assessing the the comparability such such samples samples include include univariate univariate (single (single experiment) experiment) and and bivariate bivariate (pairs (pairs of of experiexperi ments) ments) analyses. analyses. One One simple simple way way to to compare compare numerous numerous univariate univariate distributions distributions is is displaying boxplots boxplots of of the the distributions distributions side side by by side side [15]. [ 1 5 ] . Such Such boxplots boxplots would would by by displaying whether there there are are significant significant effects effects due due to, to, for for example, scaling or or saturasatura indicate example, scaling indicate whether tion, which would result result in in a a shift in the the distribution distribution of of expression expression values. values. Further Further tion, which would shift in exploration would involve involve assessing assessing the the reproducibility reproducibility of values be exploration would of expression expression values between experiments experiments and of expression each group group of tween and the the variability variability of expression values values within within each of experiments and of experiments. experiments. experiments and between between groups groups of Gene Logic sources of of variability gene expression Gene Logic limits limits non-biological non-biological sources variability in in the the gene expression data it generates strictly controlled and monitoring data it generates by by following following strictly controlled procedures procedures and monitoring the the quality control control measures, for the quality measures, both both for for running running experiments experiments and and for the collection collection and and preparation of samples. Once Once data data are are generated generated from from experiments, experiments, quality control preparation of samples. quality control based on on statistical statistical methods methods are used to to ensure that data included in in procedures are used ensure that data included procedures based GeneExpress are are not not unduly unduly affected affected by by non-biological non-biological factors. factors. GeneExpress

1 0.3.4 10.3.4

S a m p l e Data Sample Data


Accurate Accurate and and consistent consistent characterization characterization of of samples samples is is essential essential in in dealing dealing with with gene because errors gene expression expression data data because errors can can have have a a substantial substantial effect effect on on expression expression analysis. not sufficient to base classification solely solely on analysis. It It is is not sufficient to base sample sample classification on annotations annotations provided the supplier 1 ) samples provided by by the supplier because because ((1) samples may may be be mis-Iabeled mis-labeled (e.g., (e.g., a a diseased diseased tissue normal) and tissue being being labeled labeled as as normal) and (2) (2) there there may may be be inconsistencies inconsistencies of of classifi classification perspective of cation due due to to the the perspective of the the pathologist pathologist or or scientist scientist who who did did the the initial initial labeling. labeling. In In the the GeneExpress GeneExpress system system sample sample classification classification validation validation involves involves a a careful careful review review of of the the micro-section micro-section images images by by a a pathologist pathologist and and a a thorough thorough re review information accompanying view of of the the clinical clinical information accompanying each each sample. sample. Using Using SNOMED SNOMED [[11], 1 1 ] , the sample can be further characterized by topography, morphology, the sample can be further characterized by topography, morphology, dis disease, ease, and and disease disease stage. stage. The The use use of of SNOMED SNOMED and and other other controlled controlled vocabularies vocabularies in in the the GeneExpress GeneExpress system system leads leads to to a a more more robust robust classification classification of of samples samples and and provides provides a a consistent consistent representation representation of of the the data data to to users. users. However, However, even even with with an vocabulary such an established established controlled controlled vocabulary such as as SNOMED, SNOMED, the the choice choice of of terms terms to to characterize characterize a a tissue tissue type type or or disease disease may may be be ambiguous, ambiguous, so so Gene Gene Logic's Logic's pathol pathologists ogists use use a a consistent consistent system system of of rules rules to to determine determine which which SNOMED SNOMED terms terms to to use. use.

1 0.3 10.3

Managing Gene Expression Data" Integration Challenges

289

289

1 0.3.5 10.3.5

G e n e Annotations Gene Annotations


Associating Associating gene gene fragments fragments with with annotations annotations from from various various public public and and private private data data sources sources provides provides the the genomic genomic context context for for interpreting interpreting gene gene expression expression data. data. In Integrating tegrating such such annotations annotations into into a a data data warehouse, warehouse, as as opposed opposed to to accessing accessing the the remote database approach approach (see, remote data data sources sources through through a a federated federated database (see, for for example, example, Eckman Eckman et et al.'s al.'s article article in in Bioinformatics Bioinformatics [16]), [16]), allows allows better better representation representation of of the the semantics, semantics, powerful powerful query query expression, expression, improved improved query query performance, performance, and and also also al allows lows the the quality quality of of the the data data to to be be checked checked during during the the integration integration process process (a (a similar similar 1 7] ) . conclusion in an conclusion is is reached reached in an IBM Systems SystemsJournal article article by by Davidson Davidson et et al. al. [ [17]). Acquiring Acquiring gene gene annotations annotations from from various various data data sources sources involves involves identifying identifying im important and reliable regularly querying sources, parsing parsing and portant and reliable data data sources, sources, regularly querying these these sources, and interpreting the results, and and establishing related entities, entities, such such interpreting the results, establishing associations associations between between related as correlation of as the the correlation of gene gene fragments fragments and and known known genes. genes. Gene annotation annotation or or gene gene index databases are generally generally based based on on data data col colGene index data bases are lected well-established and lected from from well-established and reliable reliable public public data data sources. sources. For For example, example, gene gene fragments non-redundant classes classes based fragments can can be be organized organized in in non-redundant based on on UniGene, UniGene, and and as associated sociated with with known known genes genes recorded recorded in in LocusLink. LocusLink. However, However, such such data data sources sources may not may not contain contain genomic genomic information information for for all all species: species: Some Some may may provide provide good good human human and and mouse mouse gene gene annotations annotations but but not not cover cover other other species species such such as as yeast yeast or or rat. cases, it it is rat. In In such such cases, is necessary necessary either either to to find find alternative alternative data data sources sources or or to to derive derive gene annotations for these these species species by by finding homologous genes genes on on better better anno annogene annotations for finding homologous tated human or mouse. The tated species, species, such such as as human or mouse. The choice choice of of which which approach approach to to use use may may change change from from time time to to time time depending depending on on the the availability availability of of annotations. annotations. Gene Gene fragments fragments are are further further associated associated with with gene gene products products (e.g., (e.g., protein protein data data from Swiss-Prot), GO from Swiss-Prot), GO ontology ontology terms, terms, enzymes, enzymes, metabolic metabolic and and signaling signaling pathways, pathways, chromosome homologies. For chromosome maps, maps, genomic genomic contigs, contigs, and and cross-species cross-species gene gene homologies. For ge genomic information pathways, there nomic information such such as as pathways, there is is no no unique unique data data source source that that satisfies satisfies all all needs. needs. For For example, example, the the Kyoto Kyoto Encyclopedia Encyclopedia of of Genes Genes and and Genomes Genomes (KEGG) (KEGG) provides pathways, but provides good good metabolic metabolic pathways, but it it is is not not complete, complete, while while other other public public or or private pathway data sources provide valuable additional data. Integration of sim private pathway data sources provide valuable additional data. Integration of similar ilar or or potentially potentially overlapping overlapping data data from from two two or or more more data data sources sources requires requires the the potential problems of inconsistent data potential problems of redundant redundant and and inconsistent data to to be be addressed. addressed. Genomic Genomic data data sources sources are are usually usually updated updated on on different different schedules, schedules, and and the the size size of usually prohibits of such such data data sources sources usually prohibits all all versions versions of of a a data data source source from from being being loaded into into a a data data warehouse. The gene gene annotation annotation component component of of the GeneExloaded warehouse. The the GeneEx press data warehouse contains press data warehouse contains more more than than 5 5 gigabytes gigabytes of of data data with with only only the the most most current version current version of of data data collected collected from from various various data data sources. sources. However, However, storing storing data data from from only only one one version version of of a a data data source source may may lead lead to to inconsistencies; inconsistencies; one one source source may in a version of which may may may reference reference entities entities in a different different version of another another data data source, source, which have been updated or may no longer exist. Further, data sources may change their have been updated or may no longer exist. Further, data sources may change their

290

1 0 10

IIntegration ntegration Chal lenges iin n Gene nagement Challenges Gene Expression Expression Data Data Ma Management

data schema between removing, or data structure structure or or schema between versions versions (e.g., (e.g., adding, adding, removing, or modifying modifying attributes attributes or or fields). fields). In In addition, addition, keywords keywords can can be be changed, changed, and and data data files files can can be reorganized. reorganized. Such Such changes changes necessitate necessitate revisions revisions of of data data collection collection tools tools and and be reconciliation reconciliation of of data data mappings. mappings. The gene gene annotation annotation component component of of GXDW GXDW provides provides an an integrated integrated view view of of The the various object the genomic genomic data data space, space, based based on on a a unified unified schema schema that that spans spans the the various object spaces relevant relevant to private data sources used. spaces to each each of of the the public public or or private data sources used. One One key key feature feature of models the primary objects genomic data of the the schema schema is is that that it it models the primary objects from from the the genomic data space space in in a generic generic way, way, though though such such objects from a a wide wide variety variety of of data data sources. sources. a objects originate originate from This minimizes minimizes the the frequency frequency of of schema schema changes changes needed, needed, even even as as the the structures structures of of This the the primary primary data data sources sources evolve. evolve. To To keep keep up-to-date up-to-date with with the the evolving evolving gene gene annotation annotation data data sources, sources, the the gene gene annotation annotation component component of of GXDW GXDW is is refreshed refreshed periodically. periodically. Each Each refresh refresh involves involves extracting extracting data data from from the the latest latest versions versions of of more more than than a a dozen dozen relevant relevant pub public and and private private data data sources, sources, including including UniGene, UniGene, LocusLink, LocusLink, Swiss-Prot, Swiss-Prot, On Online lic line Mendelian Mendelian Inheritance Inheritance in in Man Man (OMIM), (OMIM), Enzyme, Enzyme, GO, GO, KEGG, KEGG, proprietary proprietary path pathway way databases, databases, and and model model organism organism genome genome databases databases for for organisms organisms such such as as E. coli coli and and yeast. During the the integration integration and and the the assembly process, various various data data E. yeast. During assembly process, transformations transformations and and data data cleansing cleansing operations operations are are performed performed to to resolve resolve conflicts conflicts and errors. Due rapidly evolving and correct correct data data errors. Due to to the the rapidly evolving nature nature of of these these data data sources, sources, their their content content may may change, change, both both syntactically syntactically and and semantically, semantically, between between refreshes. refreshes. Consequently, Consequently, establishing establishing cross-database cross-database links links often often requires requires manual manual curation curation to to deal deal with with orphans orphans and and links links to to retired retired entries. entries. For For example, example, LocusLink LocusLink may may refer refer to Enzyme Commission Enzyme catalog to an an Enzyme Commission (EC) (EC) number number that that is is obsolete obsolete in in the the Enzyme catalog database, in case it will be database, in which which case it will be necessary necessary to to identify identify the the correct, correct, current current EC EC number number and and update update the the data data sources. sources. The The data-warehousing data-warehousing strategy strategy employed employed for for constructing constructing and and maintaining maintaining GXDW various derived GXDW supports supports various derived annotations annotations such such as as cross-species cross-species homology homology re relations lations between between genes genes of of different different organisms organisms and and other other objects. objects. This This is is particularly particularly valuable for valuable for comparative comparative expression expression analysis analysis between between model model organisms. organisms. The The inte integration relationships between gration of of genomic genomic data data sources sources helps helps uncover uncover non-obvious non-obvious relationships between genes, such such as fragments, and genome -> genes, as co-clustered co-clustered gene gene fragments, and covers covers large large parts parts of of the the genome -> transcriptome -> proteome needed for transcriptome -> proteome -> -> metabolome metabolome information information needed for gene gene expression expression analysis. analysis. Due Due to to the the rapidly rapidly changing changing nature nature of of the the gene gene annotation annotation data data and and data data sources, it continually for sources, it is is important important to to search search continually for new new sources sources of of gene gene annotation annotation data data and and to to re-evaluate re-evaluate existing existing data data sources. sources. When When a a new new data data source source is is considered considered for GeneExpress, decisions decisions must for GeneExpress, must be be made made regarding regarding whether whether the the new new data data source source can will replace can or or will replace any any existing existing data data source, source, whether whether existing existing curation curation methods methods must modified, whether must be be modified, whether the the data data model model or or schema schema needs needs to to be be revised, revised, and and how how existing should be existing data data should be associated associated with with data data from from the the new new source. source.

10.4 Third-Party Gene Expression Expression Data Data iin GeneExpress 1 0.4 IIntegrating ntegrati ng Thi rd-Pa rty Gene n GeneExpress

29 1

10.4 1 0. 4

IINTEGRATING NTE G RATI N G TH I R D- PARTY G ENE THIRD-PARTY GENE EXPRESSION DATA IIN GENEEXPRESS EXPRES S I O N DATA N G E N E EXPRESS
The GeneExpress GeneExpress system system was was originally originally developed developed for for the the purpose purpose of of managing, managing, The exploring, exploring, and and analyzing analyzing gene gene expression expression data data generated generated at at Gene Gene Logic, Logic, primarily primarily using the Affymetrix GeneChip platform. However, as the system has been adopted by various various customers, customers, some some of of which which have have their their own own internal internal efforts efforts to to generate generate by gene gene expression expression data, data, the the need need to to integrate integrate customer customer data data into into the the GXDW, GXDW, so so as as to enable enable analysis analysis of of Gene Gene Logic Logic and and customer customer gene gene expression expression data data together, together, has has to become apparent. To To support support the the integration integration of of customer customer sample sample and and gene gene expression expression data data into into GeneExpress, GeneExpress, the the GX GX Connect Connect tool tool has has been been developed developed at at Gene Gene Logic. Logic. GX GX Con Connect supports supports integration integration of of gene gene expression expression data data residing residing in in an an AADM-based AADM-based nect GeneChip GeneChip LIMS LIMS database database and and sample sample data data conforming conforming to to the the Gene Gene Express Express Sam Sample Data Data Exchange Exchange Format Format into into GXDW. GXDW. When When there there is is a a need need to to integrate integrate gene gene ple annotation annotation data, data, l 1 gene gene expression expression data data represented represented using using alternative alternative formats, formats, or or data that, that, for for other other reasons, reasons, cannot cannot be be integrated integrated using using GX GX Connect, Connect, custom custom data data data integration tools tools are are developed. developed. integration The The following following section section discusses discusses some some of of the the challenges challenges involved involved in in integrating integrating customer gene expression data customer gene expression data with with Gene Gene Logic Logic data data and and how how these these challenges challenges have context of GeneExpress. First, exchange formats have been been addressed addressed in in the the context of GeneExpress. First, data data exchange formats that maintaining mappings that simplify simplify the the tasks tasks of of developing developing and and maintaining mappings of of customer customer data data to GXDW GXDW are are described. described. Next Next described described are some of the structural and semantic to are some of the structural and semantic data transformation issues developing such such mappings. of data transformation issues involved involved in in developing mappings. Finally, Finally, some some of the data data management management issues issues associated associated with data loading loading and updating the the Gene the with data and updating Gene Logic content content of of a a system system containing containing both both Gene Gene Logic Logic and and customer data conclude conclude Logic customer data the the discussion. discussion.

10.4.1 1 0. 4. 1

Data Exchange Exc h a n g e Formats Form ats Data


To avoid developing and and maintaining maintaining multiple multiple data migration and and loading tools To avoid developing data migration loading tools data exchange exchange formats for each each external external data data source source considered considered for integration, data for for integration, as intermediate intermediate representations representations for for data data being being transferred transferred from from various various data data serve serve as sources to to the the GeneExpress GeneExpress data data warehouse. warehouse. The The process process of of integrating external sources integrating external data is is then then divided divided into into two two phases: phases: (1) ( 1 ) structural structural transformations transformations and and semantic semantic data mappings need need to to be be applied applied to to the the external external data data to to convert convert them them into into the the data data mappings

1 . Data Data exchange exchange formats and integration integration tools tools for for gene gene annotation annotation data data are are planned planned for for future future versions versions 1. formats and of of GX GX Connect. Connect.

292

c, ' ' _"'""" ''"''' _ ' '''' '_ " ," >,, ,,"", , , ," cc>M "" M<w r+"," "" ,**,, ' ,,",' w,,*, , ",,"*" , ,,

10

exchange exchange formats; formats; (2) (2) the the data data in in the the data data exchange exchange formats formats needs needs to to be be loaded loaded into into the the warehouse. warehouse. Note Note that that developing developing and and maintaining maintaining tools tools that that convert convert data data from from sources sources into into a a well-defined well-defined data data format, format, such such as as one one based based on on extensible extensible markup (XML) or markup language language (XML) or a a similar similar notation, notation, is is generally generally easier easier than than developing developing tools to to transform transform data data and and populate populate a a target target data data warehouse. warehouse. tools A A number number of of formats formats have have been been proposed proposed for for gene gene expression expression data, data, as as men mentioned 0 . 1 .2. Because tioned in in Section Section 1 10.1.2. Because the the focus focus was was so so far far on on integrating integrating Affymetrix Affymetrix GeneChip GeneChip expression expression data data into into GeneExpress, GeneExpress, Affymetrix Affymetrix model model AADM AADM [10] [10] was was used used as as the the data data exchange exchange format format for for gene gene expression expression data. data. In In this this format, format, ex expression pression data data are are associated associated with with samples, samples, gene gene fragments, fragments, analysis analysis methods, methods, and and various experimental experimental parameters. parameters. various For For sample sample and and clinical clinical data, data, standard standard formats formats such such as as AADM AADM have have not not yet yet been been established. established. Consequently, Consequently, data data exchange exchange formats formats that that satisfy satisfy GX GX require requirements were were defined. defined. ments The repreThe central central object object class class of of the the sample sample data data exchange exchange format format is is sample, repre senting biological materials probe senting the the biological materials (e.g., (e.g., tissue tissue or or cell-line) cell-line) investigated investigated using using probe arrays (see Figure Figure 10.2). 10.2). Attributes arrays (see Attributes associated associated with with samples samples may may describe describe their their structural and and morphological morphological characteristics characteristics (e.g., (e.g., organ organ site, site, diagnosis, diagnosis, disease, disease, structural stage (e.g., a a human human or or an an animal animal stage of of disease). disease). A A sample sample is is associated associated with with a a donor (e.g., treatments and has additional model), which may in turn be qualified by various and has additional model), which may in turn be qualified by various attributes attributes (e.g., (e.g., clinical clinical records records and and demographics demographics for for human human donors donors or or strain strain and modification for animal donors). Each sample sample may may be be associated associated with with and genetic genetic modification for animal donors). Each several experiments (e.g., (e.g., using using different different chip chip types). types). Samples Samples may may be be grouped grouped several experiments into studies, which which may may be be further into study based on time or or further subdivided subdivided into study groups based on time into treatment treatment parameters. parameters. attributes Various classes classes in in the the sample include catch-all catch-all attributes Various sample data data exchange exchange format format include that can accommodate any any data, represented as tagged-value pairs, pairs, that that do not that can accommodate data, represented as tagged-value do not otherwise fit format. otherwise fit the the format. For data data represented represented in in the described previously, previously, the the For the data data exchange exchange formats formats described GX Connect Connect tool tool can used to to control automate the the process process of of data transfer GX can be be used control and and automate data transfer into the the GeneExpress GeneExpress warehouse warehouse [2]. [2] . This This tool tool can can be be deployed deployed at at customer customer sites sites into

Study ;IStud rou4 -I Sam,e I


Study G rounl-

Experiment }.,.] [._~ Donor i_ Treatment


Treatment

10.2 1 0.2 FIGURE FIGURE

Sample Sample data data exchange exchange format. format.

10.4 1 0.4

Integrating Th Third-Party Gene Expression Expression Data Data iin i rd-Party Gene n GeneExpress

293

and and be be used used to to perform perform incremental incremental (e.g., (e.g., nightly) nightly) updates updates to to GXDW. GXDW. Conse Consequently, quently, the the main main task task associated associated with with integrating integrating customer customer data data becomes becomes defining defining and implementing implementing the the semantic semantic and and structural structural transformations transformations necessary necessary to to con conand vert customer customer data data into into the the data data exchange exchange formats, formats, to to prepare prepare them them for for loading loading vert into GXDW. GXDW. into

10.4.2 1 0.4.2

Structural Data Tra Transformation Structu ra l Data nsfo rm ation IIssues ss u es


Data from from individual individual data data sources sources may may be be supplied supplied in in a a flattened flattened or or un-normalized un-normalized Data form, such such as as Microsoft Microsoft Excel Excel spreadsheets, spreadsheets, so so determining determining their their structure structure and and form, how to to map map them them to to the the various various data data exchange exchange formats formats is is often often a a complex complex and and how involved task. task. First, First, it it is necessary to determine the the dependencies dependencies and and correlations correlations involved is necessary to determine between individual data objects, objects, which which may may be be provided provided during during the the data data export export between individual data process or data. In process or may may need need to to be be determined determined by by searching searching for for patterns patterns in in the the data. In either case, it is is necessary necessary to to confirm confirm that that the the correlations found are are consistent consistent either case, it correlations found with the the intended intended semantics semantics of of the the data. data. with Data Data dependencies dependencies and and correlations correlations can can be be used used to to form form an an object object model model for for the and to data exchange the source source data data and to define define a a mapping mapping from from this this model model to to the the data exchange formats. formats. Defining Defining such such a a mapping mapping requires requires structural structural conflicts conflicts between between the the models models to to be be resolved, resolved, and and in in some some cases, cases, it it may may be be necessary necessary to to choose choose between between several several possible solutions. possible solutions. For example, example, the sample data For the GeneExpress GeneExpress sample data exchange exchange format format classifies classifies samples samples in a hierarchy, with levels represented by the and StudyStudy in a two-level two-level hierarchy, with the the levels represented by the classes classes Study and Sample data data exported exported from an external external data data source source might might employ from an employ a a three threeGroup. Sample level such as level hierarchy, hierarchy, such as Project, Project, Study, Study, and and Treatment. Treatment. There There are are two two possible possible ways resolve such such a Either combine combine the the exported exported Study ways to to resolve a difference difference in in structure: structure: Either and Treatment classes into the sample data exchange format Study-Group class and Treatment classes into the sample data exchange format class and map map the the exported exported Project Project class data exchange format Study class class and class to to sample sample data exchange format map the the exported exported Project Project and and Study exchange format format or Study classes classes to to sample sample data data exchange or map Study class class and and the the Treatment class class to to the the Study-Group class. class. In addition, addition, it it is is necessary necessary to to deal deal with with the evolution of of data bases and and formats formats In the evolution databases over time. time. Both Both the the external external data data sources sources and and the the GeneExpress GeneExpress data data warehouse warehouse over may change change either either their their structure structure or or their their controlled controlled vocabularies vocabularies or or data data formats formats may to reflect reflect changes changes in in requirements. requirements. These These changes changes require require updates updates to to the the mappings. mappings. to

10.4.3 1 0.4.3

S em a ntic Data Data Mapping M a pp i n g Issues Issues Semantic


For gene gene expression expression data, data, the the semantic semantic challenges challenges o f integrating integrating data data from from multiple multiple For of sources are are similar similar to to those those described described in in section section 10.3. 1 0.3. Experimental Experimental data data from from sources different platforms are generally not comparable. Even if experiments are from different platforms are generally not comparable. Even if experiments are from

294

1 0 10

IIntegration ntegration Chal lenges iin n Gene anagement Challenges Gene Expression Expression Data Data M Management

the same same platform, platform, expression expression values values may may have have to to be be adjusted adjusted (e.g., (e.g., to to compensate compensate the for different scanner settings) before they can be compared. Moreover, expression for different scanner settings) before they can be compared. Moreover, expression data will will not not be be comparable comparable unless unless they they are are analyzed analyzed using using the the same same version version of of a a data probe array array and probe and the the same same algorithm. algorithm. The mappings mappings for for sample sample data data are are usually the most most difficult difficult because because there there is is The usually the no widely widely accepted accepted standard standard for for representing representing clinical clinical data data [ [18]. In the the following following no 1 8 ] . In sections sections some some of of the the problems problems of of mapping mapping sample sample and and gene gene annotation annotation data data are are discussed. discussed.
Sample Sample Data Data Mapping: Mapping: Studies Studies

Expression data data are often organized organized into studies. For data, studies Expression are often into studies. For Gene Gene Logic Logic data, are group data that address are used used to to group data that address specific specific questions questions about about the the effects effects of of certain certain variables on) on variables (such (such as as treatment treatment conditions, conditions, disease disease stage, stage, time, time, and and so so on) on gene gene expression levels. may be be further further divided divided into into study groups, which which repre repreexpression levels. Studies Studies may sent samples grouped according according to to certain certain attributes, attributes, such such as as specific specific treatment treatment sent samples grouped conditions, points, or disease stages. conditions, time time points, or disease stages. The structure structure and and nature nature of of a a study study performed performed outside outside of of Gene Gene Logic Logic may may be be The conceptually different conceptually different from from studies studies defined defined in in the the context context of of GeneExpress GeneExpress data. data. To To group group customer customer samples samples into into studies studies or or study study groups, groups, it it is is necessary necessary to to identify identify an structure in source sample sample data an equivalent equivalent structure in the the source data model, model, which which may may use use different different terminology organize data along different terminology or or organize data along different principles. principles. If If there there is is no no appropriate appropriate concept concept in in the the source source data data model, model, rules rules can can be be incorporated incorporated into into the the mapping mapping from from the source data data model exchange format, format, allowing the source model into into the the sample sample data data exchange allowing studies studies and and study groups to be created based on other source data attributes, such as tissue to data attributes, such as tissue type Alternatively, customer type or or treatment. treatment. Alternatively, customer data data can can be be organized organized into into studies studies and and study study groups groups manually manually by by editing editing the the data data once once they they have have been been converted converted to to the the sample data sample data exchange exchange format. format.
Sample Sample Data Data Mapping: Mapping: Nomenclature Nomenclature

To sample data exchange format, To map map individual individual sample data values values to to the the sample sample data data exchange format, dif differences and formatting must be ferences of of nomenclature, nomenclature, units, units, and formatting must be resolved. resolved. Differences Differences in in nomenclature are nomenclature are the the most most difficult difficult to to deal deal with, with, and and often often there there is is no no single, single, optimal optimal resolution resolution for for such such differences. differences. Various Various attributes attributes in in the the data data exchange exchange formats controlled vocabularies. formats are are represented represented using using controlled vocabularies. In In particular, particular, in in the the sample sample data data exchange exchange format, format, sample sample organ organ types, types, pathologies, pathologies, and and disease disease diagnoses diagnoses are are represented subsets of 1]. represented using using subsets of the the SNOMED SNOMED vocabulary vocabulary [1 [11]. External repositories often External sample sample data data repositories often use use their their own own vocabularies vocabularies for for such such concepts, within a a given concepts, and and even even within given standard standard such such as as SNOMED, SNOMED, different different patholo pathologists not agree gists or or other other experts experts may may not agree on on which which term term should should be be used used for for a a certain certain disease or organ type. For example, in a recent integration project, a customer disease or organ type. For example, in a recent integration project, a customer

1 0.4 10.4

rty Gene n GeneExpress Integrating Third-Pa Third-Party Gene Expression Expression Data Data iin GeneExpress

295

295

included samples samples with with the the diagnosis diagnosis labeled labeled DIABETES. D I A B E T E S . The The SNOMED SNOMED vocab vocabincluded ulary ulary includes includes several several varieties varieties of of diabetes diabetes and and related related complications, complications, so so it it was was necessary necessary to to consult consult with with the the customer customer to to determine determine the the best best choice choice of of mapping. mapping. After After some some discussion discussion it it was was determined determined that, that, given given the the differences differences of of interpre interpretations, the map this ITY in tations, the best best choice choice was was to to map this to to the the term term OBES OBESITY in GeneExpress. GeneExpress. Similarly, abbreviations, such Similarly, the the customer customer data data might might include include abbreviations, such as as DRG, DRG, which which was was mapped mapped to to DORSAL DORSAL ROOT ROOT GANGLION, GANGLION, or or common common terms, terms, such such as as FAT, which which was mapped to POSE TISSUE. Moreover, was mapped to ADI ADIPOSE Moreover, a a SNOMED SNOMED term term code code is is usually usually associated associated with with one one primary primary term term and and one one or or more more synonyms. synonyms. Some Some customers customers may prefer the one may prefer a a different different synonym synonym than than the one chosen chosen by by Gene Gene Logic. Logic. Sample data may differ in example, drug Sample data may also also differ in the the choice choice of of units: units: For For example, drug treat treatments use units JLMol or ments can can use units such such as as #Mol or ng/ml, ng/ml, while while age age can can be be provided provided in in days, days, weeks, A conversion table is is required weeks, or or years. years. A conversion table required to to map map any any units units to to comparable comparable units units in in the the sample sample data data exchange exchange format. format. Formatting of items also also needs resolved. For example, the Formatting of individual individual items needs to to be be resolved. For example, the sample data exchange uses the the terms sample data exchange format format uses terms Male and and Female to to represent represent the the sex sex of while a customer database of a a donor, donor, while a customer database may may use use male and and female or or just just M M and and F F.. Further, s misspelling fa Further, data data may may contain contain typographic typographic errors, errors, such such a as misspelling the the name name o of a supplier. supplier. When When vocabularies vocabularies are are small, small, or or for for controlled controlled vocabularies, vocabularies, it it may may be be possible possible to to spot spot and and correct correct such such errors errors manually, manually, but but in in general, general, these these errors errors can can go go undetected. undetected. All these these conflicts conflicts need need to to be be resolved resolved as as part part of of the the mapping mapping from from the the source source All data sample data cases, it data format format to to the the sample data exchange exchange format. format. In In some some cases, it is is not not possible possible to to implement such conflicts conflicts automatically, implement rules rules to to resolve resolve such automatically, so so manual manual inspection inspection and and curation performed before curation of of the the data data must must be be performed before mapping mapping it. it. In In general, general, if if the the source source data controlled vocabularies, units, data are are consistent consistent in in their their use use of of controlled vocabularies, formatting, formatting, and and units, it it is is possible possible to to hardwire hardwire the the correct correct mappings mappings into into the the mapping mapping implementation. implementation. However, However, whenever whenever a a new new conflict conflict arises, arises, it it is is necessary necessary to to find find a a resolution resolution and and adapt adapt the the mapping mapping implementation. implementation. Sample Sample mapping mapping provides provides consistency consistency between between Gene Gene Logic Logic and and customer customer sam sample classifications rather classification. Sample ple classifications rather than than finding finding an an optimal optimal classification. Sample classifica classification tion in in GeneExpress GeneExpress is is based based on on sound sound clinical clinical and and pathology pathology principles principles in in the the strict nomenclature. However, strict framework framework of of the the SNOMED SNOMED nomenclature. However, not not all all medical medical con concepts cepts map map straightforwardly straightforwardly to to SNOMED SNOMED terms, terms, and and therefore, therefore, there there may may not not be be a a best best classification classification for for a a concept concept but but rather rather several several reasonable reasonable ones. ones.
Gene Annotation Gene Annotation Data Data Mapping Mapping

In In general, general, gene gene annotations annotations are are not not involved involved in in the the integration integration of of expression expression data data from from multiple multiple sources. sources. In In certain certain cases, cases, however, however, it it is is necessary necessary to to integrate integrate gene non-Gene Logic expression data gene annotations annotations associated associated with with non-Gene Logic expression data (e.g., (e.g., to to extend extend

296 296

1 0 10

IIntegration ntegration C h a l lenges iin n Gene a nagement Challenges Gene Expression Expression Data Data M Management

the probe arrays arrays with the system system to to include include custom custom probe with proprietary proprietary gene gene fragments fragments or or to to support support a a customer's customer's proprietary proprietary gene gene annotation annotation data) data).. Gene annotations annotations generally generally have have well well understood understood semantics; semantics; although, although, there there Gene are ambiguities with regard regard to to the the classification classification of of some some of of these these annotations annotations (see (see are ambiguities with 1 9] for Pearson's Pearson's article article in in Nature Nature [ [19] for a a discussion discussion of of problems problems associated associated with with gene gene nomenclature nomenclature and and identification) identification).. Because n proprietary bases, a Because gene gene annotations annotations are are often often stored stored iin proprietary data databases, a possible possible approach approach is is to to provide provide links links to to these these annotations, annotations, instead instead of of importing importing them them into into GXDW. GXDW. This This approach approach supports supports neither neither the the ability ability to to query query the the contents contents of of these these data bases directly nor superimpose databases directly nor superimpose expression expression data data on on these these annotations annotations (e.g., (e.g., superimpose superimpose expression expression levels levels associated associated with with genes genes displayed displayed on on a a pathway pathway or or chromosome chromosome map), map), but but it it can can make make the the information information readily readily accessible accessible from from Gene Gene Express. In cases, individual individual genes Express. In such such cases, genes within within GeneExpress GeneExpress are are linked linked to to network networkaccessible reports custom accessible reports or or interactive interactive services. services. When When query query access access is is required, required, custom gene annotations can mechanism similar gene annotations can be be integrated integrated into into GeneExpress GeneExpress using using a a mechanism similar to to that that used used for for sample sample data. data. Defining Defining a a mapping mapping remains remains non-trivial, non-trivial, but but as as gene gene annotation annotation data data are are often often more more rigorously rigorously structured structured than than clinical clinical information, information, the problem is usually less severe. severe. the problem is usually less Another problem Another problem specific specific to to gene gene annotations annotations is is the the fact fact that that related related but but differ different annotations are likely likely to key chal ent annotations are to reside reside in in multiple multiple sources. sources. This This introduces introduces a a key challenge: lenge: reconciling reconciling differences differences between between different different gene gene annotation annotation sources. sources. When When different different versions versions of of a a single single source source (e.g., (e.g., UniGene) UniGene) conflict, conflict, it it is is usually usually accept acceptable to sources conflict, able to defer defer to to the the newer newer version. version. When When different different sources conflict, there there may may not not be of resolving resolving the the differences. be an an ideal ideal way way of differences. In addition, a customer customer may may prefer prefer alternative alternative sources sources for for gene gene annotation annotation In addition, a data data (e.g., (e.g., protein protein data data sources sources other other than than Swiss-Prot Swiss-Prot or or sequence sequence clusters clusters other other than provided in UniGene) rather than those those provided in UniGene) rather than than those those used used in in GeneExpress. GeneExpress. Even Even when used, different policies may when the the same same data data sources sources are are used, different refresh refresh policies may lead lead to to the the use versions or or different builds of Furthermore, use of of different different versions different builds of the the same same data data source. source. Furthermore, there may may be multiple ways ways to associate two related biological links there be multiple to associate two related biological objects objects (e.g., (e.g., links from fragments to gene clusters clusters may based on from gene gene fragments to known known gene may be be based on data data supplied supplied by by the the probe searches using probe array array manufacturer manufacturer or or on on homology homology searches using the the fragment's fragment's target target sequence) customer gene sequence).. Consequently, Consequently, integrating integrating customer gene annotations annotations with with Gene Gene Logic Logic gene requires resolving gene annotations annotations requires resolving potentially potentially complex complex data data discrepancies. discrepancies.

1 0. 4. 4 10.4.4

Data Load i ng IIssues ssues Loading


Once Once data data from from external external data data sources sources have have been been mapped mapped to to the the data data exchange exchange formats, processing and formats, additional additional processing and curation curation may may be be required required before before integrating integrating and into the warehouse. First, necessary to and loading loading them them into the warehouse. First, it it is is necessary to detect detect invalid invalid data, data,

==:========='.''= =====': .'=

1 0.4 10.4

Integrating Third-Party Gene Gene Expression Expression Data Data in in GeneExpress GeneExpress

297 297

such s missing such a as missing clinical clinical data data associated associated with with samples samples or or inconsistent inconsistent associations associations of sample sample and and gene gene expression expression data. data. In In general, general, data data migration migration tools, tools, such such as as GX GX of Connect, handle such Connect, handle such cases cases by by skipping skipping the the data data affected affected by by errors errors and and issuing issuing warning warning messages messages in in a a log log file. file. Data Data editing editing can can be be used used to to correct correct problems problems not not resolved resolved during during the the mapping mapping process. process. Next, Next, differences differences between between identifiers identifiers of of external external objects objects and and objects objects already already in in the the warehouse warehouse must must be be resolved resolved to to maintain maintain database database consistency. consistency. Transfor Transforbefore loading mations this type out using mations of of this type are are carried carried out using staging databases before loading data data into into the the warehouse warehouse itself. itself. In In addition, addition, it it is is necessary necessary to to keep keep track track of of any any identi identifiers fiers created created for for customer customer data data so so that that if if customer customer data data objects objects are are dropped dropped and and reloaded reloaded (e.g., (e.g., to to allow allow the the data data to to be be edited), edited), they they do do not not reappear reappear with with different different identifiers. identifiers. Finally derived data, such Finally such as as quality quality control control data data (e.g., (e.g., measures measures of of saturation saturation for the the scanners), scanners), are are also also computed computed during during the the final final loading loading stage. stage. for

1 0.4.5 10.4.5

U pd ate Issues Update Issues


Section 1 0.2.2 describes process of updating a GeneExpress system Section 10.2.2 describes the the process of updating a GeneExpress system containing containing only only Gene Gene Logic Logic data. data. The The content content update update becomes becomes more more complex complex if if the the system system contains loaded with contains both both Gene Gene Logic Logic data data and and customer customer data, data, either either loaded with the the GX GX Connect tool or with custom tools. Both Gene Logic data and customer data Connect tool or with custom tools. Both Gene Logic data and customer data change change over over time, time, therefore therefore content content update update procedures procedures must must ensure ensure that that new new data data from from both both sources sources are are maintained maintained correctly correctly in in the the GeneExpress GeneExpress data data warehouse. warehouse. Data 1 ) data Data in in the the GXDW GXDW can can be be classified classified into: into: ((1) data shared by by Gene Gene Logic Logic and and customer customer data data stores, stores, such such as as controlled controlled vocabularies; vocabularies; and and (2) (2) data data that that are are not not shared, shared, that that is, is, data data generated generated by by either either Gene Gene Logic Logic or or the the customer customer only. only. Ex Examples of amples of shared shared data data include include SNOMED SNOMED terms terms and and species species information information in in the the sample database database and types and algorithm types sample and probe probe array array types and algorithm types in in the the expression expression database. When database. When performing performing a a content content update, update, shared shared data data occurring occurring in in both both the the Gene consolidated. Examples Gene Logic Logic and and the the customer customer data data contents contents must must be be consolidated. Examples of of data data that that are are not not shared shared include include data data pertaining pertaining to to an an individual individual sample sample in in the the sample and experiment sample database database and experiment expression expression values values in in the the expression expression database. database. It It is data because and experiment is not not necessary necessary to to merge merge these these data because customer customer sample sample and experiment ob objects jects are are always always distinct distinct from from Gene Gene Logic Logic sample sample and and experiment experiment objects. objects. Instead, Instead, separate separate spaces spaces of of object object identities identities are are maintained maintained for for customer customer and and Gene Gene Logic Logic data. data. Depending on on the the nature nature of of the the data, data, a a variety variety of of techniques techniques can can be be used used Depending for for handling handling updates. updates. For For example, example, because because it it is is not not necessary necessary to to merge merge data data for for individual individual experiments experiments or or samples, samples, such such as as expression expression values, values, from from different different data data sources, sources, these these data data can can reside reside in in different different database database partitions. partitions. In In this this case, case, content content

1 0 IIntegration ntegration Chal lenges in nagement 10 Challenges in Gene Gene Expression Expression Data Data Ma Management 298 9 8 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

update is replacing a update is as as simple simple as as replacing a database database partition. partition. On On the the other other hand, hand, controlled controlled vocabularies and consolidated; therefore, vocabularies and other other shared shared data data must must be be consolidated; therefore, special special tools tools are required required to to reconcile reconcile terms terms in in customer customer and and Gene Gene Logic Logic data data and and to to make make are sure sure they they are are consistent consistent in in the the integrated integrated warehouse warehouse (e.g., (e.g., having having the the same same ID ID values). consolidation process values). The The consolidation process involves involves resolving resolving the the identification identification of of objects objects and terms, as well well as as object object references. references. and terms, as

1 0. 5 10.5

S U M MARY SUMMARY
This chapter provided provided discussion integration challenges This chapter discussion of of the the data data integration challenges involved involved in in building managing gene building a a system system for for managing gene expression expression data data and and how how these these challenges challenges have have been been addressed addressed in in the the GeneExpress GeneExpress system system and and in in the the context context of of several several GeneExpress GeneExpress integration integration projects. projects. A A data data warehouse warehouse approach approach and and tools tools were were used used in in developing developing GeneExpress GeneExpress and and were were found found to to provide provide an an effective effective environment environment for for developing developing a a system system to to support support the the integration integration and and management management of of data data from from diverse diverse sources, sources, in in which which data data may may be be imprecise imprecise and and may may evolve evolve over over time. time. Other Other non-warehouse non-warehouse (i.e., (i.e., non nonmaterialized materialized view) view) approaches approaches were were also also briefly briefly considered, considered, based based on on previous previous experience experience with with developing developing genomic genomic data data management management systems systems using using the the Object Object Protocol [20], but Protocol Model Model (OPM) (OPM) tools tools [20], but they they were were not not adopted adopted for for reasons reasons similar similar to those those described described by by Davidson Davidson et et al. al. [ [17]. The data data warehouse warehouse approach approach has has to 1 7] . The proven proven well well suited suited for for systems systems such such as as GeneExpress GeneExpress that that need need to to integrate integrate data data from multiple data sources, with from multiple data sources, with data data requiring requiring validation validation and and cleansing, cleansing, and and in in cases where cases where system system performance performance and and robustness robustness are are critical. critical. However, However, the the general general expression domain data approach cannot data warehouse warehouse approach cannot be be applied applied as a s is is to to the the gene gene expression domain and and needs needs to to be be adapted adapted [14]. [14]. Also, Also, coping coping with with issues issues of of data data semantics semantics in in the the area area of genomic genomic applications applications remains remains complex complex and and difficult difficult and and often often requires requires manual manual of solutions. solutions. Because Because good good performance performance is is a a critical critical requirement requirement for for GeneExpress, GeneExpress, a a com comprehensive benchmarks has assess system prehensive set set of of benchmarks has been been devised devised to to assess system performance performance continuously benchmarks involve continuously as as its its data data content content grows. grows. The The benchmarks involve running running typical typical queries and expression analysis operations on queries and expression analysis operations on a a series series of of data data sets, sets, using using vari various ous configurations configurations of of Sun Sun SparcUltra SparcUltra 11II- and and Ill-based III-based servers servers and and Pentium-based Pentium-based clients. clients. These These benchmarks benchmarks first first measure measure the the single-user single-user performance performance of of query query and and analysis operations, then measure multi-user performance with up to 300 simu analysis operations, then measure multi-user performance with up to 300 simulated analysis steps all available lated concurrent concurrent users, users, each each running running analysis steps across across all available array array types. types. It It was was found found that, that, given given sufficient sufficient server server system system memory, memory, performance performance for for multi multiple ple users users scaled scaled linearly linearly with with the the number number of of processors processors and and number number of of concurrent concurrent users. USCI'S.

Trademarks Trademarks

299 Though Though this chapter chapter has focused on on the the GeneExpress GeneExpress system and the the Affy Affymetrix GeneChip platform, the challenges addressed by the GeneExpress system are shared by managing and analyzing gene expression data. data. In are shared by other other systems systems for for managing and analyzing gene expression In particular, for all gene expression platforms, the problems associated with relating the the data data to to gene gene and and sample sample annotations annotations and and issues issues such such as as compatibility compatibility of of array array versions and analysis algorithms are similar. The first version of GeneExpress was released in early 2000. Through the end of 2002, the GeneExpress system has evolved through several versions and has been deployed at more than 25 biotech and pharmaceutical companies worldwide, and and at at several several academic academic institutions. institutions. Based Based on on the the experience experience gained gained in in developing developing tools tools for for incorporating incorporating customer customer data data into into GeneExpress, GeneExpress, the the GX GX Connect Connect tool tool has has been developed to provide support for interactive extraction, transformation, and loading of of gene gene expression expression data data generated generated using using the the Affymetrix Affymetrix GeneChip GeneChip platform platform and related clinical data into GeneExpress. GeneExpress and GX Connect are deployed deployed together together as as part part of of the the Genesis Genesis Enterprise Enterprise System System [2]. [2]. Five Five data data integration integration systems systems that that provide provide support support for for integrating integrating gene gene ex expression data from both Gene Gene Logic and customer sources have been deployed through the end of 2002. All these systems provide support for integrating sample (clinical) data (clinical) data based based on on proprietary proprietary data data formats formats and and allow allow regular regular incremen incremental tal updates updates of of customer customer data; data; two two of of these these systems systems provide provide support support for for custom custom Affymetrix Affymetrix GeneChip GeneChip probe probe arrays; arrays; and and one one system system also also provides provides support support for for proprietary proprietary gene gene annotations. annotations.

;;; _ a; _ _ _

ACKN OWLE DG M E NTS ACKNOWLEDGMENTS

We past and present colleagues We want want to to thank thank our our past and present colleagues at at Gene Gene Logic Logic who who have have been been involved Genesis systems involved in in the the development development of of the the GeneExpress and and Genesis systems for for their outstanding work. Doug Doug Dolginow initiated the GeneExpress at outstanding work. Dolginow initiated the development development of of GeneExpress at Gene Gene Logic Logic and and had had the the key key role role in in defining defining user user requirements requirements for for GeneExpress GeneExpress Explorer. Explorer. Kevin Kevin McLoughlin McLoughlin has has led led the the development development of of GeneExpress GeneExpress Explorer, Explorer, probably probably the the best best known known part part of of GeneExpress. GeneExpress. Special Special thanks thanks to to Mike Mike Cariaso, Cariaso, Franois Franqois Collin, Collin, William William Craven, Craven, Michael Michael Elashoff, Elashoff, Aaron Aaron Hechmer, Hechmer, and and Dmitry Dmitry Krylov for Krylov for their their feedback feedback and and contributions contributions to to this this paper. paper.

TRADE MAR KS TRADEMARKS


GeneExpress| GX, GX TM, and Genesis Genesis Enterprise System TM are trademarks owned GeneExpress, Logic Inc. Affymetrix| and GeneChip GeneChip| are trademarks owned by by Gene Logic Inc. Affymetrix Affymetrix, Affymetrix, Inc. Inc.

300

1 0 10

IIntegration ntegration Chal lenges iin n Gene nagement Challenges Gene Expression Expression Data Data Ma Management

R E F E R E NCES REFERENCES
[ 1] [1] [2] [3] D. ]. J. Lockhart and A A.. E E.. Winzeler. "Genomics, Gene Expression, and DNA Arrays." Nature Nature 405 (2000): 827-836. 827-836. Gene Logic Products. http://www.geneIogic.com/products.htm. http://www.genelogic.com/products.htm. See GeneExpress product product line and Genesis Enterprise software. D. ]. J. Lockhart, H. Dong, M. C. Byrne, et al. "Expression Monitoring Monitoring by Hybridization to " Nature Biotechnology 14 to High-Density Oligonucleotide Arrays. Arrays." ( 1 996): 1 675-1 680. (1996): 1675-1680.

[4]

Markowitz and T. Topaloglou. "Applying Data Warehousing Concepts Concepts to Markowitz Gene Expression Data Management. " In Proceedings of Management." of the 2nd IEEE International Symposium on Bioinformatics and Bioengineering, 65-72. 65-72. Bethesda, MD: IEEE Computer Computer Society, Society, 2001 2001..
A. Brazma, P. P. Hingamp, and and L Quackenbush. Quackenbush. "Minimum "Minimum Information About a Microarray Experiment (MIAME) (MIAME):: Towards Standards for Microarray Data." Data." Nature Genetics 29, no. 4 (2001 ) : 365-3 71. (2001): 365-371. Human Human Gene Nomenclature Database. http://www.gene.ucl.ac.uklnomenclature/. http://www.gene.ucl.ac.uk/nomenclature/.

V. M. V.

[5] [51

[6] [7] [8] [81 [9]

Affymetrix GeneChip Analysis Suite User User Guide. Affymetrix, 2000.


R. A. Irizarry, B. Bolstad, F. E ColIin, Collin, et al. "Summaries of Affymetrix GeneChip 1 , no. 4 (2003 ). Probe Level Level Data." Nucleic Acids Research 3 31, (2003).
c. C. Li and W. Wong. "Model-Based Analysis of Oligonucleotide Arrays: Expression Index Computation " Proceedings Computation and Outlier Detection. Detection." Proceedings of of the National Academy of Science 98 ( 1 99 8 ) : 3 1-36. of Science (1998): 31-36.

[ 1 0] Affymetrix. Affymetrix Analysis Data [10] Data Model. http://www.affymetrix.com/support/. http://www.affymetrix.com/support/. [ 1 1 ] SNOMED. Systematized Nomenclature [11] Nomenclature for Medicine. http://www.snomed.orgl. http://www.snomed.org/. [ 1 2] The Gene Ontology Consortium. "Gene Ontology: Tool for the Unification of [12] Biology. " Nature Genetics 25 (2000): 25-29. 25-29. http://www.geneontology.org. Biology."

[13] V.M. Markowitz, I. A. Chen, and A. Kosky. " "Gene [ 1 3 ] V. M. Markowitz, Gene Expression Data Management: A Case Study. " In Proceedings of Study." of the 8th International Conference on Extending Database Technology Technology (EDBT), from the series Lecture Notes in Computer Science, edited by C. S. Jensen, K. G. Jeffery, L Pokorny, et al., aI., 722-73 1 . HeideIberg, 722-731. Heidelberg, Germany: Springer-VerJag, Springer-Verlag, 2002.
[14] A. Hill, E. L. Brown, M. Z. Whitley, et al. "Evaluation of Normalization [14] A. A.A. Normalization Procedures Procedures for Oligonucleotide Array Data Based on Spiked cRNA Controls." Controls." Genome Biology 2, no. 1 ) : 0055. 1-005 5 . 1 3 . no. 12 (200 (2001): 0055.1-0055.13. http://www.genomebiology.comJ2001/2112/researchl0055/. http://www.genomebiology.com/2001/2/12/research/0055/. [ 1 5 ] D. C. Hoaglin, F. [15] D.C. E MostelIer, Mosteller, and J. W. Tukey. Understanding Understanding Robust and 983. Exploratory Data Analysis. New York: John Wiley, 1 1983.

References References

30 1
[ 1 6] B. A. Eckman, A. S. Kosky, and A. L. Laroco. "Extending Traditional Query-Based [16] B.A.
Integration Integration Approaches Approaches for Functional Characterization Characterization of Post-Genomic Data." Data." Bioinformatics 7, no. 7 (200 1 ) : 587-60 1. Bioinformatics 1 17, (2001): 587-601. B . Davidson, J . Brunk, e t al. "K21Kleisli [ 1 7] S. [17] S.B. J.. Crabtree, B B. et "K2/Kleisli and GUS: GUS: Experiments in Integrated Integrated Access Access to Genomic Data Sources." IBM Systems journal Journal 40, no. 2 (2001): 512-531. (200 1): 5 1 2-5 3 1 .

[18] Presentations. Presentations. Third International on Microarray Data Standards, [18] International Meeting o n Microarray Annotations, Ontologies, and Databases, 1 , 200 1 . Palo Alto, CA: Databases, March 25-3 25-31, 2001. Stanford 1 . http://www.mgedsourceforye.net/ontologieslindex.php. Stanford University, 200 2001. http://www.mgedsourceforye.net/ontologies/index.php.
[19] 1 7 (200 1 ) : 631-632. 417 (2001)" 631-632. [19] H. Pearson. "Biology's Name Name Game." Nature 4
M. Markowitz, [20] V. V.M. Markowitz, 1. I. A. Chen, A. Kosky, et al. "OPM: "OPM: Object-Protocol Model Data Management " In Bioinformatics: Databases and Systems, edited by S. 1. I. Management Tools. Tools." Letovsky, 1 8 7-199. Boston: Kluwer Academic, 1 999. 187-199. 1999.

This Page Intentionally Left Blank

CHAPTER CHAPTER

11 11

Discovery Link Discove ryLi nk


Laura Laura M. M . Haas, Haas, Barbara Barbara A. A. Eckman, Eckman, Prasad Prasad Kodali, Kodali, Eileen T. T. Lin, Lin, Julia Julia E. E. Rice, and Peter Peter M. M. Schwarz Schwarz Eileen Rice, and
DiscoveryLink enables enables the the integration integration of of diverse diverse data data from from diverse DiscoveryLink diverse sources sources into into a single, single, virtual virtual database, database, with with the the goal goal of of making making it it easier easier for scientists to to find find a for scientists the information they need need to to prevent prevent and and cure cure diseases. diseases. To To progress in this this quest, quest, the information they progress in scientists to answer answer questions questions that relate data data about genomics, proteomics, scientists need need to that relate about genomics, proteomics, chemical and assay assay results, results, which which are are found in relational relational databases, data bases, chemical compounds, compounds, and found in flat files, extensible markup language language (XML), flat files, extensible markup (XML), Web Web sites, sites, document document management management systems, and special-purpose special-purpose systems. systems. They need to to search systems, applications, applications, and They need search through through large volumes volumes of data and and correlate information in in complex ways. large of data correlate information complex ways. In bioinformatics research in the the post-genomic post-genomic era, era, the volume of of data In bioinformatics research in the sheer sheer volume data and number number of available for use in the identification identification and and characteri and of techniques techniques available for use in the characterization of regions of functional functional interest the genomic genomic sequence sequence is is increasing too zation of regions of interest in in the increasing too quickly to be be managed managed by traditional methods. methods. Investigators Investigators must must deal deal with with the quickly to by traditional the enormous human and enormous influx influx of of genomic genomic sequence sequence data data from from human and other other organisms. organisms. The The results results of of analysis analysis applications applications such such as as the the Basic Basic Local Local Alignment Alignment Search Search Tool Tool (BLAST) [ 1 ] , PROSITE [2], and [3] must (BLAST) [1], PROSITE [2], and GeneWise GeneWise [3] must be be integrated integrated with with a a large large va variety riety of of sequence sequence annotations annotations found found in in data data sources sources such such as as GenBank GenBank [4], [4], Swiss SwissProt Prot [5], [5], and and PubMed PubMed [6] [6].. Public Public and and private private repositories repositories of of experimental experimental results, results, such such as as the the Jackson Jackson Laboratory's Laboratory's Gene Gene Expression Expression Database Database (GXD) (GXD) [7], [7], must must also also be be integrated. integrated. Deriving Deriving the the greatest greatest advantage advantage from from this this data data requires requires full, full, query querybased based access access to to the the most most up-to-date up-to-date information information available, available, irrespective irrespective of of where where it it is stored or flexibility to customize queries queries easily is stored or its its format, format, with with the the flexibility to customize easily to to meet meet the the needs needs of of a a variety variety of of individual individual investigators investigators and and protein protein families. families. In In an an industrial industrial setting, setting, mergers mergers and and acquisitions acquisitions increase increase the the need need for for data data integration integration in in the the life life science science industry industry in in general general and and the the pharmaceutical pharmaceutical industry industry in in particular. particular. Even Even without without mergers, mergers, in in a a typical typical pharmaceutical pharmaceutical company, company, the the research research groups groups are are geographically geographically dispersed dispersed and and divided divided into into groups groups based based on on therapeutic therapeutic areas. areas. Scientists Scientists in in each each of of these these therapeutic therapeutic areas areas might might be be involved involved in in various various stages stages of of the the drug drug discovery discovery process process such such as as target target identification, identification, target target validation, validation, lead lead identification, identification, lead lead validation, validation, and and lead lead optimization. optimization. During During each each of these these stages, stages, they they need need to to access access diverse diverse data data sources, sources, some some specific specific to to the the of

304 04

1 1 11

DiscoveryLi nk DiscoveryLink

therapeutic therapeutic area area of of interest interest and and the the particular particular stage stage of of the the process process and and others others that value to that are are of of value to many many therapeutic therapeutic areas areas and and at at many many stages stages of of the the process. process. Providing the the data data integration integration infrastructure infrastructure to to support support this this research research environment environment Providing (geographically (geographically dispersed dispersed research research groups groups accessing accessing different different sets sets of of diverse diverse data data sources depending research and sources depending on on their their area area of of research and the the stage stage of of the the drug drug discovery discovery process) is process) is a a daunting daunting task task for for any any information information technology technology (IT) (IT) group. group. As As pharmaceutical pharmaceutical companies companies try try to to shorten shorten the the drug-discovery drug-discovery cycle, cycle, they they must identify identify new new drug drug candidates candidates more more quickly by increasing the efficiency efficiency of of the the must quickly by increasing the research processes and positives earlier earlier in discovery pro research processes and eliminating eliminating the the false false positives in the the discovery process. with easy access to to the the relevant relevant information is essential. essential. cess. Providing Providing scientists scientists with easy access information is Researchers working working in in the the gene gene expression expression domain domain may may gain gain valuable valuable insights insights Researchers if they they have have access access to to data data from from comparative comparative genomics, genomics, biological biological pathways, pathways, or or if cheminformatics. This This is is also also true true for for a a scientist scientist working working in in the the lead lead identification identification cheminformatics. or or optimization optimization areas. areas. This case case can can be be illustrated illustrated by by an an example. example. A A research research group group in in a a pharma pharmaThis ceutical company company working a particular particular therapeutic therapeutic area area might might be be interested interested in in ceutical working in in a looking at at all all the the compounds compounds active active in in biological biological assays assays that that have have been been generated generated looking and tested tested for for a given receptor. In addition, addition, the the researchers researchers might might be be interested interested in in and a given receptor. In looking at compounds and similar receptors. looking at similar similar compounds and their their activities activities against against similar receptors. This This will them understand will help help them understand the the specificity specificity and and selectivity selectivity of of the the compounds compounds iden identified. The a particular tified. The knowledge knowledge that that a particular set set of of compounds compounds was was considered considered for for a a different therapeutic area another team help them develop new different therapeutic area by by another team could could help them develop new leads leads or or eliminate eliminate compounds compounds that that are are not not specific specific in in their their activities. activities. To To answer answer these these queries, the research group must correlate multiple databases, queries, the research group must correlate information information from from multiple databases, some relational (e.g., some relational (e.g., the the assay assay data data may may be be stored stored relationally), relationally), some some not not (the (the chemical be stored chemical structure structure data data might might be stored by by a a special-purpose special-purpose system), system), and and use use spe specialized cialized functions functions of of the the data data sources sources (e.g., (e.g., similarity similarity searches searches involving involving compound compound structures structures or or DNA DNA sequences). sequences). There There are are many many different different approaches approaches to to integrating integrating diverse diverse data data sources. sources. Often, Often, integration integration is is provided provided by by applications applications that that can can talk talk to to one one of of several several data data sources, sources, depending depending on on the the user's user's request. request. In In these these systems, systems, access access to to the the data data sources sources is " Replacing is typically typically "hard-wired. "hard-wired." Replacing one one data data source source with with another another means means rewrit rewriting a a portion of the the application. application. In In addition, addition, data data from from different different sources sources cannot cannot be be ing portion of compared compared in in response response to to a a single single request request unless unless the the comparison comparison is is likewise likewise wired wired into into the the application. application. Moving Moving all all relevant relevant data data to to a a warehouse warehouse allows allows greater greater flexi flexibility comparing data, data, but bility in in retrieving retrieving and and comparing but at at the the cost cost of of re-implementing re-implementing or or losing losing the the specialized specialized functions functions of of the the original original source, source, as as well well as as the the cost cost of of maintenance. maintenance. A third third approach approach is is to to create create a a homogeneous homogeneous object object layer layer to to encapsulate encapsulate diverse diverse A sources. This encapsulation encapsulation makes makes applications applications easier easier to to write write and and more more extensi extensisources. This ble, ble, but but it it does does not not solve solve the the problem problem of of comparing comparing data data from from multiple multiple sources. sources.

1 1 11

DiscoveryLink

305

To To return return to to the the example, example, today today this this problem problem would would be be addressed addressed by by writing writing an an application bases (with application that that accesses accesses chemical chemical structure structure data databases (with specific specific functionality functionality such assay data bases (maybe such as as similarity similarity or or substructure substructure searches), searches), assay databases (maybe in in relational relational format), format), and and sequence sequence databases databases (flat (flat file, file, relational, relational, or or XML XML format). format). Answering Answering the previous question requires multiple multiple queries the previous question requires queries against against these these data data sources: sources:
1 . "Show me 1. me all all the the active active compounds compounds for for each each of of the the assays assays for for a a particular particular receptor. " receptor."

2. "Show "Show me me all all the the compounds compounds that that are are similar similar to to the the top top five five compounds compounds from the previous query" (may from (may require require multiple multiple requests, requests, one one per per compound, compound, depending depending on on the the sophistication sophistication of of the the data data store store and and application). application).
3. "Do 1 ] run " "Do a a BLAST BLAST [ [1] run to to find find similar similar receptors. receptors." 4 e the 4.. "Show "Show m me the results results of of the the compounds compounds from from Query Query 2 2 from from all all the the assays assays against the receptors of " (may of Query Query 3 3" (may require require multiple multiple requests, requests, one one per per compound compound or or one one per per compound-receptor compound-receptor pair, pair, depending depending on on the the sophistica sophistication tion of of the the data data store store and and application). application).
5 5.. "Sort "Sort the the result result set set by by order order of of the the specificity specificity or or selectivity selectivity information" information" (if (if

multiple multiple queries queries were were needed needed in in Step Step 4). 4).

Depending Depending on on the the activities activities of of the the set set of of compounds, compounds, various various scenarios scenarios emerge emerge that that tell continue their tell the the researchers researchers how how best best to to continue their research. research. However, However, the the application application is be extended is hard hard to to write write and and may may need need to to be extended if if additional additional sources sources are are needed needed (e.g., (e.g., if if a a new new source source of of compound compound or or assay assay information information is is acquired). acquired). A the other hand, offers A virtual virtual database, database, on on the other hand, offers users users the the ability ability to to combine combine data data from multiple sources a single warehouse. Dis from multiple sources in in a single query query without without creating creating a a physical physical warehouse. DiscoveryLink [8] uses federated database coveryLink [8] uses federated database technology technology to to provide provide integrated integrated access access to to data the data sources sources used used in in the the life life sciences sciences industry. industry. The The federated federated middleware middleware wraps the actual data sources, providing providing an extensible framework actual data sources, an extensible framework and and encapsulating encapsulating the the de details of tails of the the sources sources and and how how they they are are accessed. accessed. In In this this way, way, DiscoveryLink DiscoveryLink provides provides users which they users with with a a virtual virtual database database to to which they can can pose pose arbitrarily arbitrarily complex complex queries queries in in the the high-level, high-level, non-procedural non-procedural query query language language SQL. SQL. DiscoveryLink DiscoveryLink focuses focuses on on ef efficiently these queries, data may may be ficiently answering answering these queries, even even though though the the necessary necessary data be scattered scattered across sources, and those sources possess all across several several different different sources, and those sources may may not not themselves themselves possess all the the functionality functionality needed needed to to answer answer such such a a query. query. In In other other words, words, DiscoveryLink DiscoveryLink is is able able to to optimize optimize queries queries and and compensate compensate for for SQL SQL functions functions that that may may be be lacking lacking in in a Additionally, queries a data data source. source. Additionally, queries can can exploit exploit the the specialized specialized functions functions of of a a data data source, in accessing accessing the source, so so no no functionality functionality is is lost lost in the source source through through DiscoveryLink. DiscoveryLink. Using DiscoveryLink Using DiscoveryLink in in the the example, example, a a single single query query could could retrieve retrieve the the struc structures tures of of compounds compounds that that are are active active in in multiple multiple assays assays against against different different receptors. receptors.

306

1 1 11

DiscoveryLi nk DiscoveryLink

Views Views could could be be defined defined to to create create a a canonical canonical representation representation of of the the data. data. Further Furthermore, more, the the query query would would be be optimized optimized for for efficient efficient execution. execution. DiscoveryLink's DiscoveryLink's goal goal is is to to give give the the end end user user the the perspective perspective of of a a single single data data source, source, saving saving effort effort and and frustration. frustration. In In a a real real scenario, scenario, before before researchers researchers propose propose the the synthesis synthesis and and testing testing of of an an interesting interesting compound compound they they have have found, found, they they would would like like to to know know the the tox toxicity icity profile profile of of the the compound compound and and related related compounds compounds and and also also the the pathways pathways in in which which the the compound compound or or related related compounds compounds might might be be involved. involved. This This would would require require gathering gathering information information from from a a (proprietary) (proprietary) toxicity toxicity database, database, as as well well as as one one with with information information on on metabolic metabolic pathways, pathways, such such as as the the Kyoto Kyoto Encyclopedia Encyclopedia of of Genes Genes and compounds to and Genomes Genomes (KEGG), (KEGG), and and using using the the structures structures and and names names of of the the compounds to look look up up the the data-another data--another series series of of potentially potentially tricky tricky queries queries without without an an engine engine such as as DiscoveryLink. DiscoveryLink. such This presents an This chapter chapter presents an overview overview of of DiscoveryLink DiscoveryLink and and shows shows how how it it can can be used to integrate life sciences data from heterogeneous data sources. The next be used to integrate life sciences data from heterogeneous data sources. The next section section provides provides an an overview overview of of the the DiscoveryLink DiscoveryLink approach, approach, discussing discussing the the data data representation, query capability, integration of representation, query capability, architecture, architecture, and and the the integration of data data sources, sources, as brief comparison as well well as as providing providing a a brief comparison to to other other systems systems for for data data integration. integration. Section 1 1 .2 focuses query processing processing and 1 .3 addresses Section 11.2 focuses on on query and optimization. optimization. Section Section 1 11.3 addresses performance, scalability, and ease of use. The performance, scalability, and ease of use. The final final section section concludes concludes with with some some thoughts on on the the current the system, thoughts current status status and and success success of of the system, as as well well as as some some directions directions for future enhancements. for future enhancements.

11.1 1 1.1

APPROACH APPROACH
DiscoveryLink is based on federated database technology, which powerful DiscoveryLink is based on federated database technology, which offers offers powerful facilities for for combining information from from multiple multiple data data sources. sources. Built Built on tech facilities combining information on technology from an earlier product, DB2 DataJoiner [9], and enhanced with addi nology from an earlier product, DB2 DataJoiner [9], and enhanced with additional features features for extensibility and and performance performance from from the the Garlic research project project tional for extensibility Garlic research [ 1 0, 11], 1 1 ], DiscoveryLink's DiscoveryLink's federated database capabilities capabilities provide provide a a single, single, virvir [10, federated database tual database database to to users. users. DB2 DB2 DataJoiner DataJoiner first first introduced introduced the the concept concept of of a a virtual virtual tual database, which which is is created created by by federating federating together together multiple multiple heterogeneous, heterogeneous, relarela database, tional data data sources. sources. Users Users of of DB2 DB2 DataJoiner DataJoiner could could pose pose arbitrary arbitrary queries queries over over tional data stored stored anywhere anywhere in in the the federated federated system system without without worrying worrying about about the the data's data's lolo data cation, the the SQL SQL dialect dialect of of the the actual actual data data store(s), store(s), or or the the capabilities capabilities of of those those stores. stores. cation, Instead, users users had had the the full full capabilities capabilities of of DB2 DB2 against against any any data data in in the the federation. federation. Instead, The Garlic Garlic project project demonstrated demonstrated the the feasibility feasibility of of extending extending this this idea idea to to build build a a The federated database database system system that that effectively effectively exploits exploits the the query query capabilities capabilities of of diverse, diverse, federated often non-relational data data sources. sources. In In both both of of these these systems, systems, as as in in DiscoveryLink, DiscoveryLink, often non-relational

* , _ " ,< , >w""'""<'" ,'>w " ,w.v""" "'''''' "w' '*',f'f'" "' ''''",'''''W,'',, ',.o", " =' " w ,

1 1 .1 11.1

Approach Approach

307 307

a query processor develops optimized optimized execution a middleware middleware query processor develops execution plans plans and and compen compensates for for any any functionality functionality the the data data sources sources may may lack. lack. sates There are are many many advantages advantages of of a a federated federated database database approach approach to to integrating integrating There life data. In life science science data. In particular, particular, this this approach approach is is characterized characterized by by transparency transparency (the (the heterogenedegree to which it hides all details of data location and management), heterogene ity ity (the (the extent extent to to which which it it tolerates tolerates data data source source diversity), diversity), a a high high degree degree of of function function providing the SQL and providing the benefits benefits of of both both SQL and the the underlying underlying data data source source capabilities, capabilities, autonomy autonomy for for the the underlying underlying federated federated sources, sources, easy easy extensibility, extensibility, openness, openness, and and optimized optimized performance. performance. All All other other approaches approaches fall fall short short in in one one or or another another of of these these categories. These approaches are numerous, including including domain-specific categories. These other other approaches are numerous, domain-specific solu solutions, tions, language-based language-based frameworks, frameworks, dictionary-based dictionary-based solutions, solutions, frameworks frameworks based based on on an an object object model, model, and and data data warehousing warehousing approaches. approaches. For For example, example, companies companies like like Informax Informax provide provide data data retrieval retrieval and and data data inte integration for biological databases. Their system, and many like it, benefits from being created created specifically specifically for for bioinformatics bioinformatics data, but but as as a a result, it it cannot readily readily being exploit advances advances in in query query processing processing (e.g., (e.g., in in the the relational relational database database industry). industry). exploit [12] Collection Programming Language (CPL) presented in Chapter 6 Kleisli's [12] allows the the expression expression of of complicated complicated transformations transformations across across heterogeneous heterogeneous data data allows sources, provides no schema, making sources, but but it it provides no global global schema, making query query formulation formulation and and optimiza optimization tion difficult. difficult. The The Sequence Sequence Retrieval Retrieval System System (SRS) (SRS) [13, [13, 14] 14] presented presented in in Chapter Chapter 5 vast number number of 5 provides provides fast fast access access to to a a vast of text text files, files, and and LION LION provides provides a a rich rich biology workbench workbench of tools built SRS. SRS biology of integrated integrated tools built on on SRS. SRS has has its its own own proprietary proprietary query language, which offers offers excellent excellent support support for for navigational navigational access access but but less less query language, which power power for for cross-source cross-source querying querying than than SQL. SQL. In In fact, fact, LION's LION's DiscoveryCenter DiscoveryCenter uses uses DiscoveryLink DiscoveryLink to to extend extend its its database database integration integration capabilities. capabilities. Biomax Biomax provides provides similar functionality in its Biological Databanks Databanks Retrieval tool, similar functionality in its Biological Retrieval System System (BioRS) (BioRS) tool, with cleanly structured based on with cleanly structured interfaces interfaces based on the the Common Common Object Object Request Request Broker Broker Architecture scalability on Architecture (CORBA) (CORBA) for for scalability on both both multi-processors multi-processors and and workstations workstations alike. BioRS also alike. BioRS also offers offers a a curated curated and and annotated annotated database database of of the the human human genome genome and number of analysis tools. tools. Again, and a a number of powerful powerful analysis Again, while while the the domain-specific domain-specific tooling tooling makes this biologists, the makes this a a great great package package for for biologists, the language language used used for for queries queries is is more more limited than SQL. Accelrys provides a relational data management and analysis limited than SQL. Accelrys provides a relational data management and analysis package, package, SeqStore, SeqStore, and and a a rich rich set set of of bioinformatics bioinformatics tools tools for for sequence sequence analysis analysis m the the Genetics Genetics Computer Computer Group Group (GCG) (GCG) Wisconsin Wisconsin Package. Package. SeqStore SeqStore includes includes a a relational sequence data, data, coupled tools to auto relational data data warehouse warehouse for for sequence coupled with with tools to receive receive automated analyze sequences with the mated updates, updates, to to analyze sequences with the wide wide range range of of analyses analyses available available in in the GCG Wisconsin automated sequence sequence analysis the GCG Wisconsin Package, Package, and and to to create create automated analysis pipelines. pipelines. The The warehousing warehousing approach approach requires requires that that data data be be moved moved (or (or copied), copied), interfering interfering with source autonomy limiting the with source autonomy and and limiting the extensibility extensibility of of the the system-or systemmor at at least least

308

1 1 11

DiscoveryLink DiscoveryLink

making making it it harder harder to to extend. extend. The The object object frameworks, frameworks, such such as as that that provided provided by by Tripos, Tripos, provide provide only only limited limited transparency. transparency. Similar Similar arguments arguments apply apply to to most most other other bioinformatics integration engines. The The two two biology-focused biology-focused integration integration engines engines that that come come closest closest to to Discov DiscoveryLink's OPM) [[15] 1 5] and eryLink's vision vision are are Gene Gene Logic's Logic's Object Object Protocol Protocol Model Model ((OPM) and the the Transparent 1 6] , Transparent Access Access to to Multiple Multiple Biological Biological Information Information Systems Systems (TAMBIS) (TAMBIS) [[16], presented n Chapter . OPM presented iin Chapter 7 7. OPM provides provides a a virtual, virtual, object-oriented database, database, with with queries queries in in the the proprietary proprietary OPM-MQL OPM-MQL query query language language over over diverse diverse query query sources. sources. OPM's OPM's query query optimization optimization is is rule-based rule-based and and somewhat somewhat limited, limited, because because of of the the dif difficulties ficulties of of optimizing optimizing over over its its more more complex complex data data model. model. While While an an object-oriented object-oriented model model is is a a natural natural choice choice for for modeling modeling life life sciences sciences data, data, and and OPM's OPM's class class methods methods have 1 7] , DiscoveryLink have been been demonstrated demonstrated to to add add significant significant scientific scientific value value [[17], DiscoveryLink fol follows lows an an industry industry standard standard (relational), (relational), believing believing that that the the virtues virtues of of openness openness and and the the benefits benefits of of riding riding on on technology technology that that is is constantly constantly evolving evolving and and growing growing in in power annoyances of power (due (due to to the the large large number number of of users users and and uses) uses) outweighed outweighed the the annoyances of modeling rapidly adding modeling data data as as relations. relations. In In fact, fact, the the database database industry industry is is now now rapidly adding sup support XQuery to purely relational products; DiscoveryLink port for for XML XML and and XQuery to its its once once purely relational products; DiscoveryLink will available, alleviating model will exploit exploit these these capabilities capabilities as as they they become become available, alleviating any any modeling For example, ing issues issues substantially. substantially. For example, the the DiscoveryLink DiscoveryLink engine engine already already supports supports SQLIXML that allow SQL/XML functions functions that allow it it to to return return XML XML documents documents instead instead of of tuples. tuples. TAMBIS is unique in its use of an ontology to guide query formulation, TAMBIS is unique in its use of an ontology to guide query formulation, query query processing, and data data integration. integration. It also offers users a a virtual and deals deals processing, and It also offers users virtual database database and with a great great deal deal of of heterogeneity. heterogeneity. Originally based on on CPL CPL wrappers for with a Originally based for accessing accessing data sources, now uses uses a more generaI wrapper mechanism. mechanism. TAMBIS TAMBIS data sources, TAMBIS TAMBIS now a more general Java Java wrapper focuses user interactions, unlike DiscoveryLink, which is focuses on on supporting supporting direct direct user interactions, unlike DiscoveryLink, which is meant to be a can meant to be a general general infrastructure infrastructure against against which which many many different different query query tools tools can be used. used. Again, Again, DiscoveryLink DiscoveryLink benefits benefits from from its its open, open, industry-standard industry-standard interfaces interfaces be for both both queries queries and and wrappers. wrappers. However, However, the the use use of of an an ontology ontology to to generate generate and and for refine queries is a powerful mechanism, and the marriage of such techniques to refine queries is a powerful mechanism, and the marriage of such techniques to DiscoveryLink middleware middleware could could be be explored explored to to provide provide a a more more biology-centric biology-centric DiscoveryLink experience experience for for users. users. Because DiscoveryLink DiscoveryLink is is a a general general platform platform for for data data integration, integration, it it also also can can Because be compared to other database integration offerings. Most of the major database be compared to other database integration offerings. Most of the major database vendors offer offer some some sort sort of of cross-database cross-database query query product, product, often called a a gateway. vendors often called For example, example, Oracle Oracle offers offers both both dblinks db/inks (for (for cross-Oracle cross-Oracle queries) queries) and and Oracle Oracle For Gateway (for (for more more heterogeneous heterogeneous data data sources). sources). DiscoveryLink DiscoveryLink difdif Transparent Transparent Gateway fers from from these these and and other other products products in in three three fundamental fundamental ways: ways: (1) ( 1 ) It It offers offers an an open open fers application programming programming interface interface (API) (API) for for wrapper wrapper construction; construction; (2) (2) it it allows allows application the use use of of data data source source functions functions in in queries queries that that span span multiple multiple data data sources; sources; and and (3) (3) the it has has the the most most powerful powerful optimization optimization capabilities capabilities available available (it (it is is the the only only system system it

Approach 1 1 . 1 Approach 11.1

== == == == == == == == == == == == == == == == == == == == == == 309 309

(JDBC/ODBC)

SQL API

(Optional)
1 1.1 11.1 F IGURE FIGURE

DiscoveryLink DiscoveryLink architecture. architecture.

that wrappers during that takes takes query-specific query-specific input input from from wrappers during query query planning). planning). Few Few sys systems tems offer offer the the same same degree degree of of transparency transparency and and the the same same query query processing processing power power against against heterogeneous heterogeneous sources. sources.

1 1 .1.1 11.1.1

Arch itectu re Architecture


The 1 . 1 , is The overall overall architecture architecture of of DiscoveryLink, DiscoveryLink, shown shown in in Figure Figure 1 11.1, is common common to to many many heterogeneous heterogeneous database database systems, systems, including including the the Stanford-IBM Stanford-IBM Manager Manager of of Multiple 8], Distributed Multiple Information Information Sources Sources (TSIMMIS) (TSIMMIS) [1 [18], Distributed Information Information Search Search Component 1 9] , Pegasus Component (DISCO) (DISCO) [ [19], Pegasus [20], [20], Distributed Distributed Interoperable Interoperable Object Object Model Model (DIOM) ], Heterogeneous (HERMES) [22], (DIOM) [21 [21], Heterogeneous Reasoning Reasoning and and Mediator Mediator System System (HERMES) [22], and 1] . Applications and Garlic Garlic [10, [10, 1 11]. Applications connect connect to to the the DiscoveryLink DiscoveryLink server server using using any any of Call Level of a a variety variety of of standard standard database database client client interfaces, interfaces, such such as as Call Level Interface Interface (CU) Connectivity (ODBC), (CLI) [23], [23], Object Object Database Database Connectivity (ODBC), or or Java Java Database Database Connectiv Connectivity submit queries DiscoveryLink in SQL (specifically ity (JDBC), (JDBC), and and submit queries to to DiscoveryLink in standard standard SQL (specifically SQL3 [24] ) . The information required to answer the query comes from SQL3 [24]). The information required to answer the query comes from the the local local database database and/or and/or from from one one or or more more data data sources, sources, which which have have been been identified identified to to DiscoveryLink through called registration. SOurces registration. The The data data from from the the sources DiscoveryLink through a a process process called is modeled as relational tables tables in is modeled as relational in DiscoveryLink. DiscoveryLink. The The user user sees sees a a single, single, virtual virtual relational database, database, with locations and relational with the the original original locations and formats formats of of the the sources sources hidden. hidden. The supported against The full full power power of of SQL SQL is is supported against all all the the data data in in this this virtual virtual database, database, regardless is actually actually stored regardless of of where where the the data data is stored and and whether whether the the data data source source actually actually supports supports the the SQL SQL operations. operations. When When an an application application submits submits a a query query to to the the DiscoveryLink DiscoveryLink server, server, the the server server identifies the identifies the relevant relevant data data sources sources and and develops develops a a query query execution execution plan plan for for

310

11

DiscoveryLink

obtaining the the requested requested data. data. The The plan plan typically typically breaks breaks the the original original query query into into obtaining fragments that that represent represent work work to to be be delegated delegated to to individual individual data data sources sources and and ad adfragments ditional processing processing to to be be performed performed by by the the DiscoveryLink DiscoveryLink server server to to filter, filter, aggregate, aggregate, ditional or data. The or merge merge the the data. The ability ability of of the the DiscoveryLink DiscoveryLink server server to to further further process process data data received received from from sources sources allows allows applications applications to to take take advantage advantage of of the the full full power power of of the the SQL SQL language, language, even even if if some some of of the the information information they they request request comes comes from from data data sources sources with with little little or or no no native native query query processing processing capability, capability, such such as as simple simple text text files. stored for files. The The local local data data store store allows allows query query results results to to be be stored for further further processing processing and and refinement, refinement, if if desired, desired, and and also also provides provides temporary temporary storage storage for for partial partial results results during query query processing. processing. during The The DiscoveryLink DiscoveryLink server server communicates communicates with with a a data data source source by by means means of of a a wrapper 11], a module tailored wrapper [[11], a software software module tailored to to a a particular particular family family of of data data sources. sources. The wrapper wrapper for for a a data data source source is responsible for for four four tasks: tasks: The is responsible
9 9 9

Mapping Mapping the the information information stored stored by by the the data data source source into into DiscoveryLink's DiscoveryLink's re relational data data model model lational Informing Informing DiscoveryLink DiscoveryLink about about the the data data source's source's query query processing processing capabilities capabilities by by analyzing analyzing plan plan fragments fragments during during query query optimization optimization Mapping Mapping the the query query fragments fragments submitted submitted to to the the wrapper wrapper into into requests requests that that can can be using the be processed processed using the native native query query language language or or programming programming interface interface of of the the data source data source Executing such such requests requests and and returning returning results results Executing

The interface The interface between between the the DiscoveryLink DiscoveryLink server server and and the the wrapper wrapper supports supports the the International Standards Standards Organization/Structured Organization/Structured Query Language/Management International Query Language/Management of [25]. of External External Data Data (ISO (ISO SQL/MED) SQLlMED) standard standard [25]. Wrappers are are the the key key to extensibility in DiscoveryLink, so Wrappers to extensibility in DiscoveryLink, so one one of of the the pri primary goals goals for architecture was was to to allow allow wrappers wrappers for mary for the the wrapper wrapper architecture for the the widest widest possible variety variety of of data data sources sources to to be be produced minimum of of effort. effort. Past Past possible produced with with a a minimum experience has shown shown that that this this is is feasible. feasible. To To make make the the range range of of data data sources sources that that experience has can be be integrated integrated using using DiscoveryLink DiscoveryLink as as broad broad as as possible, possible, a a data data (or (or applicaapplica can tion) source source only only needs needs to to have have some some form of programmatic interface that that can can tion) form of programmatic interface respond to to queries queries and, and, at at a a minimum, minimum, return return unfiltered unfiltered data data that that can can be be modeled modeled respond (by the the wrapper) wrapper) as as rows rows of of (one (one or or more) more) tables. tables. The The author author of of a a wrapper wrapper need need (by not implement implement a a standard standard query query interface interface that that may may be be too too high-level high-level or or low-level low-level not for the the underlying underlying data data source. source. Instead, Instead, a a wrapper wrapper provides provides information information about about a a for data source's source's query query processing processing capabilities capabilities and and specialized specialized search search facilities facilities to to the the data DiscoveryLink server, server, which which dynamically dynamically determines determines how how much much of of a a given given query query DiscoveryLink

1 1.1

31 1
the the data data source source is is capable capable of of handling. handling. This This approach approach allows allows wrappers wrappers for for simple simple data to be be built while retaining data sources sources to built quickly, quickly, while retaining the the ability ability to to exploit exploit the the unique unique query processing capabilities query processing capabilities of of non-traditional non-traditional data data sources sources such such as as search search engines engines for images. For for chemical chemical structures structures or or images. For DiscoveryLink, DiscoveryLink, this this design design was was validated validated by wrapping wrapping a a diverse diverse set set of of data data sources sources including including flat flat files, files, relational relational data databases, by bases, Web Web sites, sites, a a specialized specialized search search engine engine for for text, text, and and the the BLAST BLAST search search engine. engine. To possible, only To make make wrapper wrapper authoring authoring as as simple simple as as possible, only a a small small set set of of key key services is required, services from from a a wrapper wrapper is required, and and the the approach approach ensures ensures that that a a wrapper wrapper can be be written written with with very very little little knowledge knowledge of of DiscoveryLink's DiscoveryLink's internal internal structure. structure. can As a a result, result, the the cost cost of of writing writing a a basic basic wrapper wrapper is is small. small. In In past past experience, experience, a a As wrapper that that just just makes makes the wrapper the data data at at a a new new source source available available to to DiscoveryLink, DiscoveryLink, without to exploit without attempting attempting to exploit much much of of the the data data source's source's native native query query process processing ing capability, capability, can can be be prototyped prototyped in in a a matter matter of of days days by by someone someone familiar familiar with with the concepts. Because the data data source source interfaces interfaces and and the the wrapper wrapper concepts. Because the the DiscoveryLink DiscoveryLink server can can compensate compensate for for missing missing functionality functionality at at the the data data sources, sources, even even such such a a server simple wrapper allows apply the full power SQL to simple wrapper allows applications applications to to apply the full power of of SQL to retrieve retrieve the the new new data data and and integrate integrate it it with with information information from from other other sources, sources, albeit albeit with with perhaps less-than-optimal performance. performance. Once Once a a basic basic wrapper wrapper is is written, written, it it can can perhaps less-than-optimal be incrementally incrementally improved be improved to to exploit exploit more more of of the the data data source's source's query query processing processing capability, leading to capability, leading to better better performance performance and and increased increased functionality functionality as as specialized specialized search data source search algorithms algorithms or or other other novel novel query query processing processing facilities facilities of of the the data source are are exposed. exposed. A wrapper is is a a C++ C++ program, program, packaged packaged as as a a shared shared library library A DiscoveryLink DiscoveryLink wrapper that be loaded that can can be loaded dynamically dynamically by by the the DiscoveryLink DiscoveryLink server server when when needed. needed. Often Often a single wrapper is capable of accessing several data sources, as long as a single wrapper is capable of accessing several data sources, as long as they they share or similar API. For one thing, not need need to share a a common common or similar API. For one thing, wrappers wrappers do do not to encode encode information on the schema in the the source. the Oracle information on the schema of of data data in source. For For example, example, the Oracle wrap wrapper provided can be be used used to to access access any any number of Oracle Oracle per provided with with DiscoveryLink DiscoveryLink can number of data bases, each having a databases, each having a different different schema. schema. In In fact, fact, the the same same wrapper wrapper supports supports several well. This side benefit, several Oracle Oracle release release levels levels as as well. This has has a a side benefit, namely namely that that schemas schemas can can evolve evolve without without requiring requiring any any change change in in the the wrapper wrapper as as long long as as the the source's source's API API remains remains unchanged. unchanged. In In addition, addition, wrappers wrappers can can get get connection connection information information for for individual servers SQL Data individual servers from from SQL Data Definition Definition Language Language (DDL) (DDL) statements, statements, even even if if the the other the schemas schemas are are identical. identical. On On the other hand, hand, there there is is a a tradeoff tradeoff between between flexibility flexibility and and ease ease of of configuration configuration (the (the more more flexible flexible the the wrapper, wrapper, the the more more it it needs needs to to be be told during registration). For told during registration). For that that reason, reason, it it is is sometimes sometimes more more practical practical to to encode encode (parts of) wrapper. For (parts of) the the schema schema in in the the wrapper. For example, example, the the BLAST BLAST wrapper wrapper defines defines many fixed fixed columns, but allows allows the the user user to to specify specify others others that that are are appropriate appropriate for for many columns, but their instantiation BLAST. their instantiation of of BLAST.

31 2 12

11 11

DiscoveryLi nk DiscoveryLink

This architecture architecture has has many many benefits, benefits, as as described described previously. previously. However, However, there there This are some some controversial controversial aspects. First and and foremost, foremost, much biology data data is is semi semiare aspects. First much biology structured, and and the the current current implementation implementation forces forces data data to to be be modeled modeled relationally. relationally. structured, While this this does does complicate wrapper writing writing somewhat, several examples While complicate wrapper somewhat, there there are are several examples of of wrappers wrappers today today that that deal deal with with nested nested and and semi-structured semi-structured data, data, including including an an XML XML wrapper. wrapper. These These wrappers wrappers expose expose their their data data as as multiple multiple relations, relations, which which can can be be joined joined to to get get back back the the full full structure structure (note (note that that the the data data is is still still stored stored in in its its nested nested form, and and the the joins joins are are often often translated translated into into simple simple retrievals retrievals as as a a result). result). Future Future form, direction direction is is to to support support XML XML and and XQuery XQuery natively natively in in DiscoveryLink's DiscoveryLink's engine engine and and to allow allow wrapper wrapper writers writers their their choice choice of of a a relational relational or or an an XML XML model. model. That That will will to make make the the modeling modeling issues issues less less painful. painful. A second second issue issue is is the the use use of of C++, C++, a a general general purpose purpose and and somewhat somewhat arcane arcane A programming language, for writing writing wrappers, wrappers, as as opposed opposed to to a a simpler simpler scripting scripting programming language, for language or specialized wrapper language or a a specialized wrapper construction construction mechanism. mechanism. A A general-purpose general-purpose language was was chosen chosen for for several several reasons. First, DiscoveryLink DiscoveryLink is is meant meant to to handle handle language reasons. First, large-scale C++ is large-scale queries queries over over many many data data sources sources and and large large volumes volumes of of data. data. C++ is an an efficient language, language, suitable suitable for for such such applications. applications. Second, Second, DiscoveryLink DiscoveryLink wrappers wrappers efficient are required required to to do do more more than than ordinary ordinary connectors connectors or or adaptors, adaptors, and and the the general generalare purpose purpose programming programming language language allows allows the the wrapper wrapper writer writer complete complete flexibility flexibility in in accomplishing wrapper tasks. tools for accomplishing the the wrapper tasks. A A toolkit toolkit and and tools for wrapper wrapper development development can ease ease the the pain pain of of programming programming by by providing providing template template functions, functions, automatic automatic can generation code, error generation of of parts parts of of the the code, error checking, checking, and and so so on. on. Last Last but but not not least, least, the the DiscoveryLink happens to DiscoveryLink engine engine happens to be be written written in in C++, C++, so so this this was was by by far far the the easiest easiest to to interface interface with with the the engine engine initially. initially. A A Java Java version version of of the the toolkit toolkit is is currently currently produced, as particular styles produced, as well well as as a a set set of of generic generic wrappers wrappers for for particular styles of of data data source source access ODBC and access (e.g., (e.g., a a Web Web services services wrapper, wrapper, ODBC and JDBC JDBC wrappers, wrappers, maybe maybe even even a Perl Perl script script wrapper). wrapper). These These facilities increase the the ease ease of of adding adding new new a facilities should should increase wrappers. wrappers. Related Related to to the the ease ease of of wrapper wrapper writing writing is is the the ease ease of of changing changing wrappers wrappers when when (if) (if) the the interface interface to to the the data data source source changes. changes. For For most most data data sources, sources, such such changes changes are is not are uncommon uncommon (note (note that that this this is not in in reference reference to to schema schema changes changes but but to to changes changes in used by in the the API API or or language language used by the the data data source). source). When When changes changes do do occur, occur, they they are additions to are often often additions to the the existing existing interface, interface, and and the the wrapper wrapper can can continue continue to to function modification if function as-is, as-is, only only needing needing modification if exploiting exploiting the the new new feature(s) feature(s) is is desired. desired. Most Most commercial commercial data data sources, sources, for for example, example, try try to to maintain maintain upward upward compatibility compatibility in between one one release release and in interface interface between and the the next. next. But But for for some some classes classes of of sources sources (especially (especially Web Web data data sources), sources), change change is is much much more more common. common. To To deal deal with with these these sources, sources, it it is is particularly particularly desirable desirable to to have have some some non-programmatic non-programmatic or or scripted scripted way of of creating wrappers. Our Our explorations explorations into into generic wrappers that that can can be be way creating wrappers. generic wrappers easily tailored tailored will will address this concern. easily address this concern.

11.1 Approach 1 1 . 1 Approach

313

11.1.2 1 1 . 1 .2

Registration Reg istration


f using The process process o of using a a wrapper wrapper to to access access a a data data source source begins begins with with registration, registration, the the The means by by which which a a wrapper wrapper is is defined defined to to DiscoveryLink DiscoveryLink and and configured configured to to provide provide means access to to selected selected collections collections of of data data managed managed by by a a particular particular data data source. source. Regis Regisaccess tration consists consists of of several several steps, steps, each each taking taking the the form form of of a a DDL DDL statement. statement. Each Each tration registration registration statement statement stores stores configuration configuration meta-data meta-data in in system system catalogs catalogs main maintained by by the the DiscoveryLink DiscoveryLink server. server. tained The first first step step in in registration registration is is to to define define the the wrapper wrapper itself itself and and identify identify the the The shared library library that that must be loaded loaded before before the the wrapper wrapper can can be be used. used. The The CREATE CREATE shared must be WRAPPER statement statement serves serves this this purpose. purpose. BLAST BLAST [[1] is a a search search engine engine for for finding finding WRAPPER 1 ] is nucleotide or or peptide peptide sequences sequences similar similar to to a a given given pattern pattern sequence. sequence. A A wrapper wrapper nucleotide for BLAST BLAST might might be be created created as as follows: follows: for
CREATE WRAPPER B l a s ttWrapper Wrapper LIBRARY li b b l a s t . a.'a ' CREATE WRAPPER Blas LIBRARY 'l ibblast
'

Note that that a a particular particular data data source source has has not not yet yet been been identified, identified, only only the the software software Note required to to access any data data source source of of this this kind. kind. The The next next step step of of the the registration registration required access any C R E A T E SERVER S E R V E R statement. statement. process is to define specific specific data data sources sources using using the the CREATE process is to define If several sources are to used, only If several sources of of the the same same type type are to be be used, only one one CREATE CREATE WRAPPER WRAPPER statement statement is is needed, needed, but but a a separate separate CREATE CREATE SERVER SERVER would would be be needed needed for for each each source. For particular BLAST as follows: source. For a a particular BLAST service, service, the the statement statement might might be be as follows:

W RAPPER B l a s t tWrapper Wrapper WRAPPER Blas

CREATE SERVER TBlastNServ TYPE C REATE S ERVER TBlastNServ T YPE OPTIONS O PTIONS ( (NODE NODE

''tBLASTn ''2 1. .22 '' t B L A S T n ' ' VERS V E R S I ION ON 2 .. 1

''myblas tb. i bigpharma .' com ' ,, PORT '2 00 033' ') ) myblast. gpharma.com PORT ' 20

This statement statement registers registers a a data data source source that known to to DiscoveryLink as This that will will be be known DiscoveryLink as TBlas indicates that previously reg T B l a s ttNServ N S e r v and and indicates that it it is is to to be be accessed accessed using using the the previously registered wrapper, wrapper, B Blas It further further identifies identifies that BLAST server server is istered l a s ttWrapper. W r a p p e r . It that this this BLAST is doing tBLASTn search search (i.e., (i.e., comparing comparing an the input, doing a a tBLASTn an amino amino acid acid sequence, sequence, the input, to to a a database of nucleotide nucleotide sequences) sequences) and and that that it it is is using using version version 2.1.2 2 . 1 .2 of of the the BLAST BLAST database of software. The The additional additional information information specified specified in in the the OPTIONS clause clause is is a a set set of of software. pairs (option (option name, name, option option value) value) that that are are stored stored in in the the DiscoveryLink DiscoveryLink catalogs catalogs pairs but are are meaningful meaningful only only to to the the relevant relevant wrapper. In this this case, case, they they indicate indicate to to the the but wrapper. In TBlas data source source can can be be contacted contacted via via a a particular particular wrapper that that the the T wrapper B l a s ttNServ N S e r v data Internet Protocol Protocol (IP) (IP) address address and and port port number. number. In In general, general, the the set set of of valid valid option option Internet names from wrapper names and and option option values values will will vary vary from wrapper to to wrapper wrapper because because different different data data sources sources require require different different configuration configuration information. information. Options Options can can be be specified specified on each each of of the the registration registration DDL DDL statements and provide provide a a simple simple but but powerpower on statements and ful ful form form of of extensible extensible meta-data. meta-data. Because Because a a wrapper wrapper understands understands the the options options it it

4 ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 314

1 1 11

DiscoveryLink DiscoveryLink

defines, only only that that wrapper wrapper can can validate validate that that the the option option names names and and values values spec specdefines, ified mutually compatible. ified on on a a registration registration statement statement are are meaningful meaningful and and mutually compatible. As As a wrappers participate each step may a result, result, wrappers participate in in each step of of the the registration registration process process and and may reject, augment the option information reject, alter, alter, or or augment the option information provided provided in in the the registration registration DDL DDL statement. statement. The third third registration registration step step is is to to identify, identify, for for each each data data source, source, particular colThe particular col lections lections of of data data that that will will be be exposed exposed to to DiscoveryLink DiscoveryLink applications applications as as tables. tables. This This is is done done using using the the CREATE CREATE NICKNAME NICKNAME statement. statement. Collectively, Collectively, these these statements statements define define the the schema schema of of each each data data source source and and form form the the basis basis of of the the integrated integrated schema schema seen by by applications. applications. seen For For example, example, suppose suppose there there are are three three data data sources. sources. One One is is a a relational relational database database system protein targets. system providing providing data data on on protein targets. The The second second is is a a Web Web site site storing storing informa information about technical technical publications. tion about publications. The The third third is is a a BLAST BLAST server server that that has has the the ability ability to to compare compare an an input input sequence sequence to to a a file file of of stored stored sequences sequences as as described described previously. previously. For this this example, C R E A T E NICKNAME N I C K N A M E statements statements are are needed, needed, one one For example, three three sets sets of of CREATE 1 .2 shows set set for for each each of of the the three three data data sources. sources. Figure Figure 1 11.2 shows representative representative CREATE NICKNAME NICKNAME statements statements that that define define partial partial schemas schemas for for each each source. source. The The protein protein sequence sequence source source exports exports two two relations. relations. The The first first is is Proteins, Proteins, with unique identifier with columns columns representing representing the the unique identifier for for a a protein, protein, the the common common (print) (print) name, name, the the amino amino acid acid sequence sequence associated associated with with the the protein, protein, the the function function of of the the protein, protein, and and a a list list of of diseases diseases with with which which the the protein protein has has been been associated. associated. In In real real

Protein Protein Sequence Sequence Source Source Schema Schema (Relational (Relational Database) Database)
CREATE C R E A T E NICKNAME NICKNAME (protein_id ( p r o t e i n _ i d varchar v a r c h a r ((33 00 ))nnot ot name ) J n a m e varchar v a r c h a r (( 66 00 ), function function sequence 0, ) s e q u e n c e varchar v a r c h a r (( 33 22 00 00 0) varchar ) , v a r c h a r (( 1l 0O 0O ),
J

Publications Publications Source Source Schema Schema (Web (Web Site) Site)
CREATE C R E A T E NICKNAME NICKNAME ( pub_id varchar (pub_id varchar(( 11 00 )) pub date pub_d ate date date,, Pubs Pubs not ull J not n null,

BLAST BLAST Source Source Schema Schema (Search (Search Engine) Engine)
CREATE CKNAME C R E A T E NI NIC KNAME Protein last Protein _ blast _b ( 32000) (query_seq q u e r y _ s e q varchar varchar ( (32000) options index options ( (index options index options ( (index accession varchar 10) accession v archar ( (I0)
J

Proteins Proteins

nul l, null,

pub_t it t l e varchar 3 l, pub_ti l e v a r c h a r ((3 00 ))nnot o t nul null, keywords ) } k e y w o r d s varchar v a r c h a r (( 22 55 66 ))

FOR b . swpdata FOR proteindb p r o t e i n d b ..b ii oo .s wpdata CREATE C R E A T E NICKNAME NICKNAME pub pub Prot -P Pubs Protubs not not

d is se ea as se e s varchar ) ) di s v a r c h a r (( 22 55 66 ))

OPTIONS OPTIONS(( URL URL

FOR FOR SERVER S E R V E R pubdb pubdb

d ef fi in ni it ti io o n varchar lOO) de n varchar { (I00)

' 1' , 'i' , delimit delimit ' ' \ ' ) '2 2' ), ,

'

I, '),

' :/ // /www pubs .o r' g ' ) 'ht h tt tp p: w w w ..p u b s ii tt ee .o rg ) FUNCTION FUNCTION

(prot (3 (prot_id varchar(3 00 )) _ i d varchar ref (l ref varcha v a r c h a rr (1 0O ))

not not

FOR . swppubs FOR proteindb p r o t e i n d b ..bbio io.s wppubs

nul l) null)

nul l, null,

MAPPING M A P P I N G FOR FOR

CREATE CREATE

contains 0) c o n t a i n s (varchar (varchar ( (1 i0 ), ,

OPTIONS OPTIONS

FOR FOR

) I

hsp info varchar 10 0 ) hsp_ nfo v archar ( (i00) _i SERVER Bl la as st tNServ SERVER T TB NServ (datasource (datasource

' gbest ' ) 'gbest' )

varchar 0) varchar ( (3 30 ) ,,

RETURNS RETURNS

FOR FOR SERVER S E R V E R nubdb pubdb

char char

varchar 256) ) v archar ( (256)) ( 1) (I)

1 1 .2 11.2

Representative Representative configuration configuration statements statements (syntax (syntax simplified simplified for for illustration). illustration).

FIG URE FIGURE

1 1 .1 11.1
. . .

Approach
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 15 315

life, f columns, life, a a Database Database Administrator Administrator (DBA) (DBA) would would likely likely declare declare a a fuller fuller set set o of columns, representing representing more more of of the the information information contained contained in in the the source; source; the the schema schema is is sim simplified plified in in the the interest interest of of space space only. only. Also, Also, because because the the data data source-a source--a relational relational Database Management System System (DBMS)-has (DBMS)--has a a self-describing self-describing schema, schema, the the DBA DBA Database Management would would not not actually actually need need to to put put the the column column information information in in the the CREATE C R E A T E NICK NICKN A M E statement. statement. That That information information could could be be read read from from the the data data source source catalogs catalogs NAME automatically. automatically. The The second second relation relation exported exported from from this this source source is is a a mapping mapping table table that that maps maps proteins proteins to to publications publications that that reference reference them. them. The The FOR clause clause identi identifies, via the server, fies, via a a three-part three-part name, name, the server, schema, schema, and and remote remote table table referenced referenced by by the the nickname. This syntax used with nickname. This syntax may may be be used with relational relational data data sources. sources. Pubs, from Similarly, visible a Similarly, the the DBA DBA makes makes visible a single single table, table, Pubs, from the the publication publication source, source, for for which which only only four four columns columns are are shown: shown: the the publication publication identifier, identifier, the the title title of article, the was published, of the the article, the date date the the article article was published, and and a a list list of of keywords keywords for for the the publication. nickname definitions definitions give publication. Note Note that that the the nickname give the the types types of of attributes attributes in in terms terms of of standard standard SQL SQL data data types. types. This This represents represents a a commitment commitment on on the the part part of of the the wrapper wrapper to to translate translate types types used used by by the the data data source source to to these these types types as as necessary. necessary. Finally, engine is virtual table, indexed on Finally, the the BLAST BLAST search search engine is modeled modeled as as a a virtual table, indexed on the the input input sequence sequence and and with with columns columns representing representing both both input input parameters parameters and and the the results results of of the the BLAST BLAST search. search. Again, Again, only only a a subset subset of of the the schema schema is is shown. shown. Here _seq, Here are are shown shown the the input input column, column, query query_ s e q , and and output output columns columns for for the the accession number, definition, and hsp_info accession number, definition, and h s p _ i n f o (the (the information information string string computed computed for a a given given high-scoring for high-scoring segment pair containing containing information information about about the the number number of of nucleotides nucleotides or or amino amino acids acids that that matched matched between between the the query query and and the the hit hit se sequences) use of quences).. Note Note the the use of options options clauses clauses on on both both the the CREATE C R E A T E NICKNAME N I C K N A M E state statement definition of ment and and on on the the definition of individual individual columns. columns. These These give give the the DBA DBA the the ability ability to the wrapper. to specify specify information information needed needed by by the wrapper. For For the the BLAST BLAST wrapper, wrapper, the the op options tions on on the the individual individual columns columns tell tell the the wrapper wrapper how how to to parse parse the the BLAST BLAST defline into these these columns. this case, defline is into columns. In In this case, the the defline is assumed assumed to to contain contain the the accession accession number, Columns whose number, followed followed by by the the definition, definition, delimited delimited by by white white space. space. ((Columns whose values have no ) The values do do not not come come from from the the defline defline have no parsing parsing options options specified. specified.) The op option tion on on the the overall overall CREATE CREATE NICKNAME NICKNAME tells tells the the wrapper wrapper which which data data source source to to blast against (in this blast against (in this case case GenBank's GenBank's gbest). Actually, Actually, the the BLAST BLAST wrapper wrapper sup supports ports so so many many different different input input and and output output columns columns that that part part of of the the schema schema is is hard-wired hard-wired so so a a DBA DBA does does not not have have to to re-type re-type all all the the columns columns in in the the CREATE CREATE NICKNAME NICKNAME statement. statement. Further Further details details on on this this wrapper wrapper can can be be found found in in the the IBM guration Guide DB2 Life Sciences Sciences Data Connect Planning, Installation and Confi Configuration [26]. [26]. Specialized Specialized search search or or data data manipulation manipulation capabilities capabilities of of a a data data source source also also can can be be modeled modeled as as user-defined user-defined functions, functions, and and identifying identifying these these functions functions by by means means

31 6

11 11

DiscoveryLink DiscoveryLink

of REATE F UNCTION M A P P I N G statements of C CREATE FUNCTION MAPPING statements is is the the fourth fourth step step in in registration. registration. Thus, Thus, the the definition definition of of the the publications publications data data source source in in Figure Figure 11.2 1 1 .2 also also includes includes a CREATE FUNCTION MAPPING MAPPING statement, statement, registering registering that that source's source's function function a CREATE c o n t a i n s (A, contains ( A , B, B , C). C ) . This This function function returns returns 'Y' ' Y ' if if the the publication publication identified identified by ontains ( L54 ' 6 ' ,, A contains contains the the string string c C in in column column B, for for example, example, c contains (' M ' ML 564 by A 'k eywords', v a r i aan n c y s t ' ' ). keywords ' , 'o ovari cyst ) . The The mapping mapping identifies identifies this this function function to to the the query query processor processor and and declares declares its its signature signature and and return return type type in in terms terms of of standard standard SQL convert values SQL data data types. types. As As with with nicknames, nicknames, the the wrapper wrapper must must convert values of of these these types types to to and and from from the the corresponding corresponding types types used used by by the the data data source. source. This This function function models the the underlying underlying data data source's source's Boolean Boolean search search capability. capability. models Finally, Finally, user user mappings mappings are are defined. defined. A A user user mapping mapping tells tells DiscoveryLink DiscoveryLink how how to data source. to connect connect a a particular particular local local user user to to a a data source. For For example, example, if if a a DiscoveryLink DiscoveryLink LAURA connects the protein database as using the the user identified identified by by LAURA user connects to to the protein database as ITNerd, ITNerd, using the following DDL statement might be issued: password password DLRocks, DLRocks, the following DDL statement might be issued:

CREATE USER MAPPING FOR proteindb CREATE U SER M APPING F O R LAURA L A U R A SERVER SERVER p roteindb OPTIONS ((REMOTE_AUTHID ) OPTIONS R E M O T E _ A U T H I D ' ' ITNerd ITNerd'' , , REMOTE_PASSWORD R E M O T E _ P A S S W O R D ' ' DLRocks D L R o c k s '' )

With With these these five five steps, steps, registration registration is is complete. complete. The The new new data data source source is is ready ready to to use. use. Queries Queries can can combine combine data data from from all all the the registered registered sources sources and and use use the the specialized specialized capabilities capabilities of of these these sources; sources; in in the the example, example, two two techniques techniques for for modeling modeling these these special capabilities were shown: using a virtual table, as done for the BLAST special capabilities were shown: using a virtual table, as done for the BLAST c o n t a i n s function function of of the the source, and and using using a a function function mapping, mapping, as as done done for for the the contains source, publications publications source. source. Note Note that that additional additional sources sources can can be be added added at at any any time time without without affecting the the ongoing ongoing operations operations of of the the federated federated system. system. The The system system need need not not be be affecting quiesced, and and existing existing applications applications and and queries queries need need not not be be altered. altered. However, However, new new quiesced, queries queries that that combine combine information information from from the the preexisting preexisting sources sources and and the the new new source source can can now now be be asked. asked. If data data source source schemas schemas or or functions functions change, change, they they must must be be re-registered. re-registered. Dis DisIf coveryLink coveryLink currently currently has has no no mechanism mechanism to to detect detect changes changes in in the the sources, sources, though though an application application that that periodically periodically compares compares the the DiscoveryLink DiscoveryLink and and source source schemas schemas an could could be be written. written.

1 1 .2 11.2

QU E RY PROCESSI N G OVE RVI EW QUERY PROCESSING OVERVIEW


Once Once registration registration is is completed, completed, the the newly newly defined defined nicknames nicknames and and functions functions can can be be used used in in queries. queries. When When an an application application issues issues a a query, query, the the DiscoveryLink DiscoveryLink server server uses uses the the meta-data meta-data in in the the catalogs catalogs to to determine determine which which data data sources sources hold hold the the requested requested information. information. Then Then it it optimizes optimizes the the query, query, looking looking for for an an efficient efficient execution execution plan. plan.

1 1 .2 Query Processing Overview 11.2 Overview . . . . . . . . . . . . .

31 3 17 7

It It explores explores the the space space of of possible possible query query plans, plans, using using dynamic dynamic programming programming to to enumerate plans for joins. The optimizer first plans for enumerate plans for joins. The optimizer first generates generates plans for single single table table accesses, then for joins, and planning, the accesses, then for two-way two-way joins, and so so on. on. With With each each round round of of planning, the optimizer considers various join join orders methods, and all the tables are optimizer considers various orders and and join join methods, and if if all the tables are located at common data source, it it tries plans for located at a a common data source, tries to to generate generate plans for performing performing the the join join either at the either at at the the data data source source or or at the federated federated server. server. Once the has chosen chosen a plan for query fragments Once the optimizer optimizer has a plan for a a query, query, query fragments are are dis distributed tributed to to the the data data sources sources for for execution. execution. Each Each wrapper wrapper maps maps the the query query fragment fragment it receives into sequence of make use native it receives into a a sequence of operations operations that that make use of of its its data data source's source's native programming query language. been translated, programming interface interface and/or and/or query language. Once Once the the plan plan has has been translated, it saved for it can can be be executed executed immediately immediately or or saved for later later execution. execution. The The DiscoveryLink DiscoveryLink server's server's execution execution engine engine is is pipelined pipelined and and employs employs a a fixed fixed set set of of functions functions (open, (open, fetch, fetch, close) close) that that each each wrapper wrapper must must implement implement to to control control the the execution execution of of a a query fragment. fragment. When accepting parameters parameters from returning results, query When accepting from the the server server or or returning results, the the wrapper wrapper is is responsible responsible for for converting converting values values from from the the data data source source type type system system to DiscoveryLink's DiscoveryLink's SQL-based SQL-based type type system. system. to DiscoveryLink DiscoveryLink includes includes a a full full database database engine engine that that can can execute execute arbitrary arbitrary (DB2) (DB2) SQL queries. queries. Features Features useful useful for for life life sciences sciences applications applications include include support support for for long long SQL data ) data types types (e.g., (e.g., Binary Binary Large Large Object Object [BLOB], [BLOB], Character Character Large Large Object Object [CLOB] [CLOB]) and user-defined functions. Applications also benefit from the ability to update and user-defined functions. Applications also benefit from the ability to update information information at at relational relational data data sources sources via via SQL SQL statements statements submitted submitted to to Discov DiscoveryLink (and in the the future, future, full full transaction transaction management management for for data data sources sources that that eryLink (and in comply with with the the XlOpen X/Open XA-interface XA-interface specification), specification), the the ability ability to to invoke invoke stored stored comply procedures procedures that that reference reference nicknames, nicknames, and and the the ability ability to to use use DiscoveryLink DiscoveryLink DDL DDL statements statements to to create create new new data data collections collections at at relational relational data data sources. sources. Another Another fea feature certain queries queries to using pre-materialized automatic sumsum ture allows allows certain to be be answered answered using pre-materialized automatic mary tables mary tables stored stored by by DiscoveryLink, DiscoveryLink, with with little little or or no no access access to to the the data data sources sources themselves. themselves. Joins, Joins, subqueries, subqueries, table table expressions, expressions, aggregation, aggregation, statistical statistical functions, functions, and and many many other other SQL SQL constructs constructs are are supported supported against against data, data, whether whether the the data data is is locally stored stored or or retrieved retrieved from from remote remote data data sources. sources. locally

1 1 .2 . 1 11.2.1

Q u e ry O pti m ization Query Optimization


During During the the planning planning process, process, the the DiscoveryLink DiscoveryLink server server takes takes into into account account the the query processing source. As query processing power power of of each each data data source. As it it identifies identifies query query fragments fragments to to be be performed performed at at a a data data source, source, it it must must ensure ensure that that the the fragments fragments are are executable executable by by that source. If that source. If a a fragment fragment cannot cannot be be performed performed by by the the source, source, the the optimizer optimizer builds builds a a plan plan to to compensate compensate for for the the missing missing function function by by doing doing that that piece piece of of work work in in the the DiscoveryLink DiscoveryLink server. server. For For example, example, if if the the data data source source does does not not do do joins, joins, but but it it is is necessary necessary to to join join together together data data from from two two nicknames nicknames at at that that source, source, the the data data will will

318 318

1 ........................................................................................................................................................................................................................... 1 11 ....DiscoveryLinok

be be retrieved retrieved from from both both nicknames nicknames (typically (typically after after restricting restricting it it with with any any predicates predicates the joined by the source source can can apply), apply), and and then then joined by DiscoveryLink. DiscoveryLink. The The DiscoveryLink DiscoveryLink server server has has two two ways ways of of obtaining obtaining information information about about query query processing relational data (and for processing power. power. Wrappers Wrappers provided provided by by IBM IBM for for relational data sources sources (and for other server other sources sources that that are are similar similar to to a a relational relational source source in in function) function) provide provide a a server attributes long list list of parameters that (SAT). The The SAT SAT contains contains a a long of parameters that are are set set to to attributes table table (SAT). is appropriate For example, P U S H D O W N is appropriate values values by by the the wrapper. wrapper. For example, if if the the parameter parameter PUSHDOWN set will not that the data source set to to "N", "m", DiscoveryLink DiscoveryLink will not request request that the data source perform perform query query fragments fragments more more complex complex than: than:
SELECT it s> t > FROM S E L E C T <column_l <column_lis F R O M <nickname> <nickname>

Note: Note: In In this this chapter, chapter, SQL SQL is is used used as as a a concise concise way way of of expressing expressing the the work work to to be done done by by a a remote remote data data source. This work work is is actually actually represented represented internally internally by by be source. This various various data data structures structures for for efficient efficient data data processing. processing. If If PUSHDOWN PUSHDOWN is is set set to to ' Y '' ,, more more complex complex requests requests may may be be generated, generated, de depending pending on on the the nature nature of of the the query query and and the the values values of of other other SAT SAT parameters. parameters. For For example, example, if if the the value value of of the the BASIC_PRED parameter parameter in in the the SAT SAT is is ' Y '' ,, requests requests may may include include predicates predicates such such as: as:
.9 .. . . WHERE > W H E R E pub_date pub_date >
'

/3 31 1//11 '12 12/ 99 99 55 '

'

The indicate a The parameter parameter MAlCTABS M A X _ T A B S is is used used to to indicate a data data source's source's ability ability to to perform perform joins. joins. If If it it is is set set to to 1 1,, no no joins joins are are supported. supported. Otherwise Otherwise MAlCTABS MAX_TABS indicates indicates the the clause of maximum number number of nicknames that maximum of nicknames that can can appear appear in in a a single single FROM FROM clause of the the query query fragment fragment to to be be sent sent to to the the data data source. source. Information Information about about the the cost cost of of query query processing processing by by a a data data source source is is supplied supplied to to the the DiscoveryLink DiscoveryLink optimizer optimizer in in a a similar similar way, way, using using a a fixed fixed set set of of parameters parameters such such as as CPU_RATIO, which which is is the the relative relative speed speed of of the the data data source's source's processor processor relative relative to to the Additional parameters, the one one hosting hosting the the DiscoveryLink DiscoveryLink server. server. Additional parameters, such such as as average average number of instructions per number of number of instructions per invocation invocation and and average average number of Input/Output Input/Output (I/O) (I/O) operations per invocation, can be provided for data source functions defined operations per invocation, can be provided for data source functions defined to to DiscoveryLink DiscoveryLink with with function function mappings, mappings, as as can can statistics statistics about about tables tables defined defined as as nicknames. nicknames. Once Once defined, defined, these these parameters parameters and and statistics statistics can can be be easily easily updated updated whenever whenever necessary. necessary. This This approach approach has has proven proven satisfactory satisfactory for for describing describing the the query query processing processing ca capabilities the relational pabilities and and costs costs of of the relational database database engines engines supported supported by by DiscoveryLink; DiscoveryLink; although superficially similar although even even for for these these superficially similar sources, sources, a a large large set set (hundreds) (hundreds) of of pa parameters needed. However, rameters is is needed. However, it it is is difficult difficult to to extend extend this this approach approach to to more more id idiosyncratic iosyncratic data data sources. sources. Web Web servers, servers, for for example, example, may may be be able able to to supply supply many many pieces pieces of of information information about about some some entity, entity, but but frequently frequently they they will will only only allow allow certain certain attributes to be used as search criteria. This sort of restriction is difficult to express attributes to be used as search criteria. This sort of restriction is difficult to express

1 1 .2 11.2

Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Query Proces_....~. singoOvervj_~iew

319

319

using using a a fixed fixed set set of of parameters. parameters. Similarly, Similarly, the the cost cost of of executing executing a a query query fragment fragment at data source source may at a a data may not not be be easily easily expressed expressed in in terms terms of of fixed fixed parameters, parameters, if, if, for for example, cost depends For instance, example, the the cost depends on on the the value value of of an an argument argument to to a a function. function. For instance, a BLAST function asked to to do a BLAST function asked do a a BLASTp BLASTp comparison comparison against against a a moderate moderate amount amount of of data data will will return return in in seconds, seconds, whereas whereas if if it it is is asked asked to to do do a a tBLASTn tBLASTn comparison comparison against hours. against a a large large dataset, dataset, it it may may need need hours. The solution, validated The solution, validated in in the the Garlic Garlic prototype, prototype, is is to to involve involve the the wrappers wrappers directly individual queries. directly in in planning planning of of individual queries. Instead Instead of of attempting attempting to to model model the the behavior statically determined behavior of of a a data data source source using using a a fixed fixed set set of of parameters parameters with with statically determined values, the DiscoveryLink DiscoveryLink server server will will generate generate requests requests for for the the wrapper wrapper to to process process values, the specific the wrapper specific query query fragments. fragments. In In return, return, the wrapper will will produce produce one one or or more more wrapper wrapper plans, each plans, each describing describing a a specific specific portion portion of of the the fragment fragment that that can can be be processed, processed, along with an estimate estimate for for the the cost cost of of computing computing the the result result and and its its estimated estimated size. size. along with an

1 1 . 2.2 11.2.2

An Exa mple An Example


Voltage-sensitive f calcium Voltage-sensitive calcium calcium channel channel proteins proteins mediate mediate the the entry entry o of calcium ions ions into into cells and are involved in such processes as neurotransmitter release. They respond cells and are involved in such processes as neurotransmitter release. They respond to to electric electric changes, changes, which which are are a a prominent prominent feature feature of of the the neural neural system. system. The The dis discovery covery of of a a novel novel gene gene that that codes codes for for a a calcium calcium channel channel protein protein would would potentially potentially be be of of great great interest interest to to pharmaceutical pharmaceutical researchers researchers seeking seeking new new drug drug targets targets for for a gene discovery a neuropsychological neuropsychological disease. disease. A A popular popular method method of of novel novel gene discovery is is to to search search Expressed Expressed Sequence Sequence Tag Tag (EST) (EST) databases databases for for (expressed) (expressed) sequences sequences similar similar to to known known genes genes or or proteins. proteins. For For example, example, a a scientist scientist with with access access to to the the data data sources sources just described just described might might like like to to see see the the results results of of the the following following query: query: "Return "Return accession accession numbers numbers and and definitions definitions of of EST EST sequences sequences that that are are similar similar (60% identical (60% identical over over 50 50 amino amino acids) acids) to to calcium calcium channel channel sequences sequences in in the the protein protein 995 mentioning " data data source source that that reference reference papers papers published published since since 1 1995 mentioning 'brain'. "brain'." The condensed form The hsp_info h s p _ i n f o column column holds holds a a condensed form of of the the equivalent equivalent data data in in the the XML specification for provided by XML specification for BLAST BLAST provided by the the National National Center Center for for Biotechnology Biotechnology Information. But Information. But to to answer answer the the above above query, query, one one needs needs direct direct access access to to the the per percentage within the and the centage of of identities identities within the hsp hsp alignment alignment and the length length of of that that alignment. alignment. Assume that functions are Assume that two two user-defined user-defined functions are defined defined to to extract extract this this information information from from the the hsp_info h s p _ i n f o string: string:
CREATE it ty varchar lOO) ) C R E A T E FUNCTION F U N C T I O N percent_ident percent_identi y ((v archar ( (i00)) RETURNS l oat EXTERNAL hsp_info . a RETURNS f float E X T E R N A L NAME N A M E ''hsp_info. a''

and and
CREATE varchar ((i00)) lO O ) ) C R E A T E FUNCTION F U N C T I O N align_length align_length ( (varchar RETURNS EXTERNAL ' R E T U R N S integer integer E X T E R N A L NAME N A M E ''hsp_info h s p _ i n f o ,. a a'

320 320

1 1 11

DiscoveryLi nk DiscoveryLink

Then, this this request request can can be be expressed expressed as as a a single single SQL SQL statement statement that that combines combines data data Then, from all three sources: from all three data data sources:

SELECT .n name .a access ioon .d de fi in ittion . pub_id SELECT b b. a m e ,, c c. ccessi n ,, c c. ef ni i o n ,, a a. pub_id FROM , Proteins b, Protein_blast , Pro t-P Pubs F R O M Pubs Pubs a a, P r o t e i n s b, Protein_blast c c, Prot ubs d d WHERE . pub_id . pub_re f W HERE a a. pub_id = = d d. pub_ref AND d d. rot_id = = b b. rotein_id AND .p prot_id .p protein_id .Q Query_seq AND b b. quence = c c. uery_seq AND .s se equence AND b b. unct n = = ''calc calci u m channel channel'' AND .f func ti io on ium AND . pub_date 1/ /1 19 99 955 AND a a. pub_date > > ''12 1 2/ /3 31 '' Y' AND a pub_id Keyword ' , ''brain') brain ' ) = ''Y' A N D contains c o n t a i n s ((a ..p u b _ i d , , ''Keyword', AND percent_ident i ty ( c . hsp_info ) > 0 . 6 AND percent_identity(c.hsp_info) >0.6 AND li ig gn_length o) > 0 AND a al n _ l e n g t h (( cc . .hhsp_inf sp_info) > 5 50

Many Many possible possible evaluation evaluation plans plans exist exist for for this this query. query. One One plan plan is is shown shown in in Figure Figure 1 1 .3. In figure, each 11.3. In this this figure, each box box represents represents an an operator. operator. The The leaves leaves represent represent actions actions at at a a data data source. source. Because Because DiscoveryLink DiscoveryLink does does not not model model the the details details of of those those actions, actions,

BlndJoln

C b.nama, c accassion. c.d.nnftlon, l.publd


P: pm:ert ICIifI c lisp Iol > 6 AND n te"g1h'C.hs I U n fO) > 50

T;b.d,a,c

BlnclJaIn

T: b,d,'

C:b name. b sequence, . pub-'d

c: b.neme. b.lequBnce,
d.pub_ref T' ProlelnS b. !'rot-Pubs cl P: b tImdIon= "calcium chllln"" AND cI.prot -'d b.proIein_Id

Wrapper ktlon (keenl

T' Pubs I puluillle > '1213111995' AND ool!Wl5(a pub_Id KfIfWOl 'brIIIn AND a.pubJd ?
C: . pub_1d
p. I

C. C:.Kttnlon, e.d,"on. c:.hsp-'nfo

T. Protlln_blast c
P: c.qua'Y_seq
=

1 1 .3 11.3 F IGURE FIGURE

One One evaluation evaluation plan plan for for the the query. query.

1 1 .2 11.2 . . . . .Q . .u . .e .r .. y
.. , ~ a ~ , ~

sing
~ ~ ~ ,

Overview Overview
~ , ~ . ~ , ~ ~ , ,

.........................................
~ ~ , ~ , ~ ~ ~ . ~ . . ~ ~ ~ ~ ~ : ~ .~ . . , , ~ , ~

_ ..........
, ~ . , ~ . . .o

32 1

321

each each action action is is modeled modeled as as a a single single operator, operator, even even if if it it might might involve involve a a series series of of operators (For relational operators at at the the source. source. (For relational sources, sources, DiscoveryLink DiscoveryLink does does in in fact fact model model the the individual individual operators, operators, but but to to simplify simplify the the figures, figures, details details are are omitted. omitted. Thus Thus the the join of modeled in of Proteins p r o t e i n s and and Prot p r o t -pPubs u b s is is modeled in the the figures figures as as a a single single operator.) operator.) For non-relational would not For non-relational sources, sources, DiscoveryLink DiscoveryLink would not know know whether whether a a logical logical join join action action was was an an actual actual join join or or whether, whether, in in fact, fact, the the data data was was stored stored in in a a nested, nested, pre prejoined joined format. format. Nor Nor does does DiscoveryLink DiscoveryLink know know whether whether the the data data are are scanned scanned and and then then predicates predicates applied, applied, or or there there is is an an indexed indexed access, access, and and so so on. on. Instead, Instead, Discov DiscoveryLink has been of eryLink keeps keeps track track of of the the work work that that has been done done by by recording recording the the properties of each (c), the each operator. operator. The The properties properties include include the the set set of of columns columns available available (c), the set set of of p), as 1 .3. tables tables accessed accessed (T), and and the the set set of of predicates predicates applied applied ((p), as shown shown in in Figure Figure 1 11.3. Non-leaf Non-leaf nodes nodes represent represent individual individual operations operations at at the the DiscoveryLink DiscoveryLink server. server. The The optimizer optimizer models models these these local local operations operations separately. separately. This plan first This plan first accesses accesses the the protein protein data data source, source, retrieving retrieving protein protein names names and and sequences and corresponding publication identifiers, for sequences and corresponding publication identifiers, for all all proteins proteins that that serve serve as calcium channels. This information as calcium channels. This information is is returned returned to to DiscoveryLink, DiscoveryLink, where where the the bindjoin operator sends the operator sends the publication publication references references to to the the publications publications source source one one at at a a time. time. At At the the publications publications source, source, these these publication publication identifiers identifiers are are used used to to find find relevant relevant publications, publications, and and those those publications publications are are further further checked checked for for compliance compliance with on keyword k e y w o r d and and pub_date. p u b _ d a t e . For For those those publications publications with the the query query restrictions restrictions on that pass all returned to that pass all the the tests, tests, the the identifier identifier is is returned to DiscoveryLink. DiscoveryLink. There, There, the the second second bindjoin operator bindjoin operator sends sends the the sequence sequence for for any any surviving surviving proteins proteins to to BLAST, BLAST, where where they compared against against gbest, gbest, and they are are compared and the the results results are are returned returned to to DiscoveryLink DiscoveryLink where hspjnfo is where each each hsp_info is analyzed analyzed to to see see if if the the sequence sequence is is sufficiently sufficiently similar. similar. A similar plan plan is 1 .4. In A second, second, superficially superficially similar is shown shown in in Figure Figure 1 11.4. In this this plan, plan, the the publications publications with with appropriate appropriate dates dates and and keywords keywords are are sent sent to to DiscoveryLink, DiscoveryLink, where hash table is is built. built. The The data data from from the the protein protein data data source source are are also also sent sent where a a hash to used to table. Matches passed to to DiscoveryLink DiscoveryLink and and used to probe probe the the hash hash table. Matches are are passed to the the bindjoin against gbest, bindjoin operator, operator, which which BLASTs BLASTs the the sequences sequences against gbest, then then returns returns them them to the quality to DiscoveryLink DiscoveryLink to to check check the quality of of the the match. match. It plan results It is is not not obvious obvious which which plan plan is is best. best. The The first first plan results in in one one query query of of the database, but (one for qualified protein) the protein protein database, but many many queries queries (one for each each qualified protein) of of the the publications plan only only queries queries each publications database. database. The The second second plan each of of these these sources sources once, once, but but potentially potentially returns returns many many publication publication entries entries for for proteins proteins that that will will not not qualify. qualify. Either is likely the one one shown in Figure 1 .5. Either of of these these plans plans is likely to to be be better better than than the shown in Figure 1 11.5. In In this this plan, plan, the the protein protein data data is is extracted extracted first first and and all all calcium calcium channel channel proteins proteins are are BLASTed regardless of BLASTed against against gbest, gbest, regardless of what what publications publications they they reference. reference. Discov DiscoveryLink eryLink then then filters filters the the sequences sequences using using the the similarity similarity criterion, criterion, and and the the remaining remaining proteins passed to join operator. join compares proteins are are passed to the the nested nested loop loop join operator. This This join compares each each protein's protein's referenced referenced publications publications with with a a temporary temporary table table created created by by storing storing in in

322 322

========---

11 11

DiscoveryLink DiscoveryLi n k

Blnoln
C b.nemo, c.eccossion.

T: b.d .c P: per.m ld...alo lW( .t,..nID) > .. AlII) O._OJlep.Info) ,. 10


c.deflnltion. e pubJd

Huhjoln
C:b.name. b.s&qulnce,
T: b,d,1I

.pUb_ld

P: d.pubJor = II pubJd

c: b.name. b.sequence,
d.putu.,

wrapper ActIon (Ac_1 C . .pub_1d


T: Pubs . P; . pub_dIIIl >

Wrapper AdIon

AND

T' Proteins b, ProtPubs d


dpn'Ud = b protoln_ld

P: b .llJnct Ion= ..... ......

'12131/1995' AND .......fIIIb- .... ,.,., . 'Y'

C'

(Access)

P: c.qlMllY_SICI

T. ProIeln_blast c
c.llsp_Wo

c.acc.sSlon e d.tInIIl on.

1 1 .4 11.4 F IGURE FIGURE

A second plan. second query query evaluation evaluation plan.

DiscoveryLink those recent plan could DiscoveryLink those recent publications publications that that discuss discuss the the brain. brain. This This plan could only only win win if if there there were were very very few few recent recent publications publications with with brain brain as as a a keyword keyword (so (so the the cost cost of of the the query query to to make make the the temporary temporary table table is is small), small), and and yet yet virtually virtually every every calcium calcium channel channel protein protein in in the the protein protein database database referenced referenced at at least least one one of of them them (so (so there doing the join of proteins with there is is no no benefit benefit to to doing the join of proteins with publications publications early). early). While While that that predicates (e.g., is unlikely unlikely for for this this example, example, if if there there were were a a more more restrictive restrictive set set of of predicates (e.g., is a a recently recently discovered discovered protein protein of of interest interest and and papers papers within within the the last last two two months), months), this this plan plan could, could, in in fact, fact, be be a a sensible sensible one. one.

1 1 . 2 .3 11.2.3

Determ i n i ng Costs Determining Costs


Accurately Accurately determining determining the the cost cost of of the the various various possible possible plans plans for for this this or or any any query query is is difficult difficult for for several several reasons. reasons. One One challenge challenge is is estimating estimating the the cost cost of of evaluat evaluating ing the the wrapper wrapper actions. actions. For For example, example, the the DiscoveryLink DiscoveryLink engine engine has has no no notion notion of of what what must must actually actually be be done done to to find find similar similar sequences sequences or or how how the the costs costs will will vary vary depending depending on on the the input input parameters parameters (the (the bound bound columns). columns). For For BLAST, BLAST, the the actual actual algorithm algorithm used used can can change change the the costs costs dramatically, dramatically, as as can can the the data data set set

1 1 .2 11.2

==

Query Processing Processi n g Overview Overview Query

== == == == == == == == == == == == == == == == == == 3 == == 23 323

NestedLoopjoln T: b.d ..c

C b.name. c.accesslon. c.dennttlon, a.publd

Scan c:

a.pub_1d T: Pubs a p. e.pUb_Id ?

d.pubref

Bln O l n C .neme. c.sccesslon.

P: lCMnIIIJ(o./Wp Infa) ,. .,
AND

T: b.d,c

c.dennijlon

elgn lenglh(o.hap Into) > 10

iWi'app- Actfon IJoIn )

/
=

b.seque!1Cle
P; c.query_seCl

T: Pubs a

.PUb-'d

Wrapper AdIon fAce...)


C. c.lICcesslon. c.deftnllon
=

C; b.name. b.sequence.

P: b.nJncllon.. ....... ....... AND d .prolJd

d.pub_re' T: ProlalnS b. ProH>ubs d

T; Proleln_blast c

Wrapper Action C; a.pubJd

b.proIelnJd

P: .pub_dela > '1 213111995' AND --.NI.... Ill. ... __') . T ...

T: Pubs a

(Access)

1 1 .5 11.5 F IGURE FIGURE

A third plan for the query. A

being being searched. searched. As As a a second second challenge, challenge, the the query query processor processor has has no no way way to to esti estimate results that sources. While mate the the number number of of results that may may be be returned returned by by the the data data sources. While the the wrapper could, perhaps, wrapper could, perhaps, provide provide some some statistics statistics to to DiscoveryLink, DiscoveryLink, purely purely rela relational tional statistics statistics may may not not be be sufficient. sufficient. For For example, example, cardinalities, cardinalities, as as well well as as costs, costs, for for search search engines engines like like BLAST BLAST may may vary vary depending depending on on the the inputs. inputs. A A third third chal challenge lenge is is to to estimate estimate the the cost cost of of the the functions functions in in the the query. query. The The costing costing parameters parameters maintained by wrappers in maintained by relational relational wrappers in DiscoveryLink DiscoveryLink for for a a function function implemented implemented by include a the initial by a a data data source source include a cost cost for for the initial invocation invocation and and a a per-row per-row cost cost for for each additional invocation. However, the each additional invocation. However, the only only way way to to take take the the value value of of a a func function account is through a tion argument argument into into account is through a cost cost adjustment adjustment based based on on the the size size of of the the argument value in bytes. While While this acceptable for argument value in bytes. this may may be be acceptable for simple simple functions functions like like percent_identi ty and percent_identity and align_length, align_length, it it is is unlikely, unlikely, in in general, general, to to give give ac accurate contains actually actually has has to to search search in in different different ways ways curate results. results. For For example, example, if if contains depending depending on on the the type type of of the the column column passed passed as as an an argument argument (e.g., (e.g., a a simple simple scan scan for for keyword keyword but but an an index index lookup lookup for for the the paper paper itself), itself), the the cost cost parameters parameters must must be set be set to to reflect reflect some some amalgamation amalgamation of of all all the. the. search search techniques. techniques. A A simple simple case case

324

1 1 11

DiscoveryLi nk DiscoveryLink

statement, statement, easily easily written written by by the the wrapper wrapper provider, provider, could could model model the the differences differences and allow allow more more sensible sensible choices of plans. plans. While the costs costs of of powerful powerful functions functions and choices of While the in some some cases be hard to predict, in fact, fact, know know quite quite a a bit bit in cases can can be hard to predict, many many vendors vendors do, do, in about themselves to about the the costs costs of of their their functions. functions. They They often often model model costs costs themselves to improve improve their systems' performance. their systems' performance. The The challenges challenges of of accurately accurately estimating estimating costs costs are are met met by by letting letting the the wrap wrapper possible plan per examine examine possible plan fragments fragments to to provide provide information information about about what what the the data data source Consider our source can can do do and and how how much much it it will will cost. cost. Consider our example example query query once once again. again. During the the first first phase phase of of optimization, optimization, when when single-table single-table access access plans plans are are being being During considered, considered, the the publications publications database database will will receive receive the the following following fragment fragment for for con consideration (again, query sideration (again, query fragments fragments are are represented represented in in SQL; SQL; the the actual actual wrapper wrapper interface interface uses uses an an equivalent equivalent data data structure structure that that does does not not require require parsing parsing by by the the wrapper). wrapper).
SELECT SELECT a a.. pub_id pub_id,, a a.. pub_date pub_date FROM F R O M Pubs Pubs a a WHERE .p pub_date > /3 31 1/ /1199 W HERE a a. ub_date > ''1 12 2/ 99 55 '' AND . pub_id A N D contains contains ( (a a. p u b _ i d , , ''keyword keyword' ' , , ''brain b r a i n ''))

''Y' Y'

Assume Assume that, that, in in a a single single operation, operation, the the publications publications database database can can apply apply either either the the con predicate predicate on on publication publication date date or or the the c o n tains t a i n s predicate, predicate, but but not not both. both. Many Many Web Web sites single predicate predicate at only restricted sites can can handle handle only only a a single at a a time time or or only restricted combinations. combinations. (Note: (Note: In In the the previous previous illustrative illustrative plans, plans, it it was was assumed assumed that that the the publications publications database database could could do do both. both. Either Either assumption assumption might might be be true. true. This This one one is is adopted adopted here here to possible to to illustrate illustrate the the point.) point.) Further Further assume assume that that it it is is possible to invoke invoke the the contains contains function function separately separately later later (this (this is is like like asking asking a a new, new, very very restrictive restrictive query query of of the the Web allow such follow-on queries queries to Web site) site).. Many Many Web Web sites sites do do allow such follow-on to retrieve retrieve additional additional information information about about an an object object or or do do some some more more complex complex computation. computation. The The wrapper wrapper might might return return two two wrapper wrapper plans plans for for this this fragment. fragment. The The first first would would indicate indicate that that the the data data source source could could perform perform the the following following portion portion of of the the fragment: fragment:
SELECT SELECT a a.. pub_id pub_id,, a a.. pub_dat p u b _ d a t ee FROM F R O M pubs pubs a a WHERE .p pub_date > /3 31 1//11 W HERE a a. ub_date > ''12 12/ 99 99 55 ''

with execution cost seconds and estimated result result size with an an estimated estimated execution cost of of 3.2 3.2 seconds and an an estimated size of of 500 would be 500 publications publications (in (in reality, reality, of of course, course, the the result result size size would be much much bigger) bigger).. To f the To estimate estimate the the total total cost cost o of the query query fragment fragment using using this this wrapper wrapper plan, plan, the the DiscoveryLink would add plan the DiscoveryLink optimizer optimizer would add to to the the cost cost for for the the wrapper wrapper plan the cost cost of of invoking the invoking the contains contains function function on on each each of of the the 500 500 publications publications returned. returned. If If each each

1 1 .2 11.2

Query Processing Overview Overview

325

invocation because of invocation costs costs a a second second ((because of the the high high overhead overhead of of going going out out to to the the World World Wide Wide Web), Web), the the total total cost cost of of this this portion portion of of the the query, query, using using this this plan, plan, would would be be 503.2 seconds. seconds. 503.2 The The second second wrapper wrapper plan plan would would indicate indicate that that the the data data source source could could perform perform the following portion the following portion of of the the fragment: fragment:
SELECT a SELECT a.. pub_id pub_id, , a a.. pub_dat pub_date e FROM F R O M Pubs Pubs a a WHERE contains . pub_id ') W HERE contains ( (a a. p u b _ i d , , ''keyword keyword' ' , , ''brain brain')

= -

''Y' Y'

with estimated execution execution cost 8 seconds seconds and estimated result result size 000 with an an estimated cost of of 1 18 and an an estimated size of of 1 1000 publications publications (i.e., (i.e., entries entries for for all all the the publications publications in in the the database database with with the the keyword keyword brain). To would augment To compute compute the the total total cost cost in in this this case, case, the the optimizer optimizer would augment the the cost for plan with the cost DiscoveryLink engine cost for the the wrapper wrapper plan with the cost of of using using the the DiscoveryLink engine to to apply apply the 000 publications. the predicate predicate on on publication publication date date to to each each of of the the 1 1000 publications. If If filtering filtering one one 1 publication takes a of a second, the total cost for this portion of the query, publication takes a 1/100 / 1 00 of a second, the total cost for this portion of the query, using be 28 using this this plan, plan, would would be 28 seconds-a seconds--a clear clear winner. winner. Wrappers participate query planning way during Wrappers participate in in query planning in in the the same same way during the the join join enu enumeration meration portion portion of of optimization. optimization. In In the the example, example, the the wrapper wrapper might might be be asked asked to to consider consider the the following following query query fragment: fragment:
SELECT a SELECT a.. pub_id pub_id, , a a.. pub_date pub_date FROM F R O M Pubs Pubs a a WHERE a .p pub_date /3 311 95 9'5 ' W HERE a. ub date > > ''12 12/ // 11 99 AND . pub_id ') A N D contains contains ( (a a. p u b _ i d , , ''keyword keyword' ' , , ''brain brain') AND .p pub_i HO AND a a. ub_id d = = : -H0

''Y' Y'

This is access, but predicate would would not con This is essentially essentially a a single-table single-table access, but the the third third predicate not be be considered during single-table access planning value being being compared sidered during single-table access planning because because the the value compared to to pub_id table. For p u b _ i d comes comes from from a a different different table. For each each pub_id p u b _ i d produced produced by by the the rest rest of of the the query query (represented (represented above above by by the the host host variable variable :HO), :H0), the the publications publications database database is is asked asked to to find find the the important important properties properties of of the the corresponding corresponding publication, publication, if if it it matches matches the the other other criteria. criteria. As As before, before, the the wrapper wrapper would would return return one one or or more more plans plans and and indicate indicate in in each each one one which which of of the the predicates predicates would would be be evaluated. evaluated. Only Only a a few few of of the the plans plans that that DiscoveryLink DiscoveryLink would would consider consider in in optimizing optimizing this this query query were were shown. shown. The The goal goal was was not not to to give give an an exhaustive exhaustive list list of of alterna alternatives, tives, but but rather rather to to illustrate illustrate the the process. process. As As well, well, the the chapter chapter has has demonstrated demonstrated the critical role plays for obvious nor the critical role an an optimizer optimizer plays for complex complex queries. queries. It It is is neither neither obvious nor intuitive which plan will ultimately ultimately be intuitive which plan will be the the best; best; the the answer answer depends depends on on many many fac factors including including data data volumes, data distributions, tors volumes, data distributions, the the speeds speeds of of different different processors processors

326

1 1 11

DiscoveryLi nk DiscoveryLink

and network network connections, connections, and and so so on. on. Simple Simple heuristics heuristics generally generally cannot cannot arrive arrive at at and the the right right answer. answer. Only Only a a cost-based cost-based process process with with input input on on specific specific data data source source characteristics can can hope hope to to choose choose the the right right plans plans for for the the vast vast array array of of possible possible characteristics queries. quenes. As As a a wrapper wrapper may may be be asked asked to to consider consider many many query query fragments fragments during during the the planning of of a a single single query, query, it it is is important important that that communication communication with with the the wrapper wrapper planning be be efficient. efficient. This This is is achieved achieved easily easily in in DiscoveryLink DiscoveryLink because because the the shared shared library library that that contains contains a a wrapper's wrapper's query query planning planning code code is is loaded loaded on on demand demand into into the the address address space of of the the DiscoveryLink DiscoveryLink server server process process handling handling the the query. query. The The overhead overhead for for space communicating communicating with with a a wrapper wrapper is, is, therefore, therefore, merely merely the the cost cost of of a a local local procedure procedure call. call. This This approach approach to to query query planning planning has has many many benefits. benefits. It It is is both both simple simple and and extremely extremely flexible. flexible. Instead Instead of of using using an an ever-expanding ever-expanding set set of of parameters parameters to to invest invest the the DiscoveryLink DiscoveryLink server server with with detailed detailed knowledge knowledge of of each each data data source's source's capabil capabilities, ities, this this knowledge knowledge resides resides where where it it falls falls more more naturally, naturally, in in the the wrapper wrapper for for the the source question. This source in in question. This allows allows to to exploit exploit the the special special functionality functionality of of the the underly underlying the BLAST by modeling ing source, source, as as was was done done for for the BLAST server server ((by modeling the the search search algorithm algorithm as table) and (using a as a a virtual virtual table) and the the publications publications source source (using a template template function). function). The The wrapper only responds requests in wrapper only responds to to specific specific requests in the the context context of of a a specific specific query. query. As As the previous examples the previous examples have have shown, shown, sources sources that that only only support support searches searches on on the the values fields or values of of certain certain fields or on on combinations combinations of of fields fields are are easily easily accommodated. accommodated. In a a similar similar way, one can can accommodate sources that only sort sort results results under under In way, one accommodate sources that can can only certain or can can only only perform perform certain certain circumstances circumstances or certain computations computations in in combination combination with others. others. Because Because a a wrapper needs only to respond respond to to a a request request with with a a single single with wrapper needs only to plan, or plan, or in in some some cases cases no no plan plan at at all, all, it it is is possible possible to to start start with with a a simple simple wrapwrap per that that evolves evolves to to reflect reflect more more of of the the underlying underlying data data source's source's query query processing processing per power. power. This approach approach to to query query planning planning need need not not place place too too much much of of a a burden burden on on This the wrapper wrapper writer, writer, either. either. In In a a paper paper presented presented at at the the annual annual conference conference on on very very the large databases databases [27], [27], Roth Roth et et al. al. showed showed that that it it is is possible possible to to provide provide a a simple simple large default cost cost model model and and costing costing functions functions along along with with a a utility utility to to gather gather and and update update default all necessary necessary cost cost parameters. parameters. The The default default model model did did an an excellent excellent job job of of modeling modeling all simple data data sources sources and and did did a a good good job job predicting predicting costs, costs, even even for for sources sources that that simple could apply apply quite quite complex complex predicates. predicates. This This same same paper paper further further showed showed that that even even could an approximate approximate cost cost model model dramatically dramatically improved improved the the choice choice of of plans plans over over no no an information or or fixed fixed default default values values [27]. [27]. Therefore, Therefore, it it is is believed believed that that this this method method information of query query planning planning is is not not only only viable, viable, but but necessary. necessary. With With this this advanced advanced system system of for optimization, DiscoveryLink DiscoveryLink has has the the extensibility, extensibility, flexibility, flexibility, and and performance performance for optimization, required required to to meet meet the the needs needs of of life life sciences sciences applications. applications.

Ease of Use, Sca l a b i l ity, oo~.d a n d Performan,.~ce., Performance ............................................................................ 1 1 .3 Ease.,.....,._.of.~oUSe, 11.3,,._. Scalab.~~..~.~!lity, an

327

327

1 1 .3 11.3

-'

EASE SE, SCALABI LlTY, EASE OF OF U USE, SCALABILITY, A N D PERFORMANCE AND PERFORMANCE

DiscoveryLink DiscoveryLink provides provides a a flexible flexible platform platform for for building building life life sciences sciences applications. applications. It It is is not not intended intended for for the the scientist, scientist, but but rather rather for for the the application application programmer, programmer, an an IT IT worker, worker, or or a a vendor vendor who who creates creates the the tools tools that that the the actual actual scientists scientists will will use. use. While interface, it While it it provides provides only only a a simple simple user user interface, it supports supports multiple multiple programming programming ODBC and ]DBC. It, interfaces, including such industry standards standards as as ODBC and JDBC. It, interfaces, including such de facto industry therefore, therefore, can can be be used used with with any any commercially commercially available available tool tool that that supports supports these these interfaces, including popular popular query interfaces, including query builders, builders, such such as as those those by by Brio Brio or or Cognos, Cognos, application-building application-building frameworks frameworks such such as as VisualAge VisualAge from from IBM, IBM, or or industry-specific industry-specific applications including LabBook, Spotfire, and so on. Alternatively, in-house ap applications including LabBook, Spotfire, and so on. Alternatively, in-house applications the needs plications can can be be developed developed that that meet meet the needs of of specific specific organizations. organizations. IBM IBM has has a number of who are a number of business business partners partners who are including including DiscoveryLink DiscoveryLink in in their their offerings offerings to workbench for to create create a a more more complete complete scientific scientific workbench for their their customers. customers. Database DiscoveryLink. For Database Administrators Administrators will will also also be be users users of of DiscoveryLink. For these, these, Dis DiscoveryLink coveryLink has has a a Graphical Graphical User User Interface Interface (GUI) (GUI) to to help help with with the the registration registration process. Yet applications require this. Complete process. Yet life life sciences sciences applications require more more support support than than this. Complete integration also requires integration also requires the the development development of of tools tools to to bridge bridge between between different different models sciences research research community models of of data. data. The The life life sciences community is is not not a a homogeneous homogeneous one. one. Different Different groups groups use use different different terms terms for for the the same same concept concept or or describe describe different different concepts similarly. Semantic mappings mappings must created, and concepts similarly. Semantic must be be created, and applications applications for for par particular must be ticular communities communities must be developed. developed. DiscoveryLink DiscoveryLink provides provides features features that that help help with does not can help help with with the the with these these tasks, tasks, but but it it does not solve solve either. either. For For example, example, views can problems problems of of semantic semantic integration integration by by hiding hiding mappings mappings from from one one data data representa representation tion to to another, another, but but the the views views still still must must be be created created manually manually by by the the DBAs. DBAs. Another Another characteristic characteristic of of life life sciences sciences data data and and research research environments environments is is fre frequent both in data and and in data is is quent change, change, both in the the amounts amounts of of data in the the schemas schemas in in which which data stored more work work for stored (causing (causing more for DBAs). DBAs). Further, Further, new new sources sources of of information information are are always always appearing appearing as as new new technologies technologies and and informatics informatics companies companies evolve. evolve. In In such such an is essential. an environment, environment, flexibility flexibility is essential. DiscoveryLink's DiscoveryLink's powerful powerful query query processor processor and and non-procedural non-procedural SQL SQL interface interface protect protect applications applications (to (to the the extent extent possible) possible) from from changes changes in in the the underlying underlying data data source source via via the the principle principle of of logical logical data data inde independence. pendence. New New sources sources of of information information require require a a new new server server definition, definition, however, however, and and perhaps perhaps a a new new wrapper, wrapper, and and may may also also require require adjusting adjusting view view definitions definitions to to reference in a often can reference their their data. data. Changes Changes in a data data source's source's interfaces interfaces often can be be hidden hidden from from the the application application by by modifying modifying the the translation translation portion portion of of the the wrapper wrapper or or in installing stalling a a new new wrapper wrapper with with the the new new version version of of the the source. source. The The query query processing processing

328

1 1 11

DiscoveryLink

technology technology is is built built to to handle handle complex complex queries queries and and to to scale scale to to terabytes terabytes of of data. data. Thus, Thus, the the database database middleware middleware concept concept itself itself allows allows DiscoveryLink DiscoveryLink to to deal deal with the the changes changes in in this this environment, environment, but but it it puts puts a a burden burden on on the the DBA DBA to to admin adminwith ister these these changes. changes. ister Wrapper Wrapper writers writers are are a a third third group group of of users. users. The The wrapper wrapper architecture architecture has has been been designed designed for for extensibility. extensibility. Only Only a a small small number number of of functions functions need need to to be be written written to to create create a a working working wrapper. wrapper. Simple Simple sources sources can can be be wrapped wrapped quickly, quickly, in in a week week or or two; two; more more complex complex sources sources may may require require from from a a few few weeks weeks to to a a few few a months but even months to to completely completely model, model, but even for for these these a a working working wrapper, wrapper, perhaps perhaps with with limited functionality, limited functionality, can can be be completed completed quickly. quickly. Template Template code code for for each each part part of of the modeling code the wrapper wrapper and and default default cost cost modeling code are are provided provided for for wrapper wrapper writers. writers. Wrappers enable as Wrappers are are built built to to enable as much much sharing sharing of of code code as as possible, possible, so so that that one one wrapper can can be written to handle multiple multiple versions versions of of a a data data source, source, and and so so that that wrapper be written to handle wrappers for for similar sources can can build on existing existing wrappers. wrappers. The The ability to separate separate wrappers similar sources build on ability to schema information information from from wrapper wrapper code code means means that that changes changes in in the the schema schema of of a a schema data source source require require no code changes changes in data data no code in the the wrappers. wrappers. The The addition addition of of a a new new data source requires change to existing wrappers. Thus, the wrappers also source requires no no change to any any existing wrappers. Thus, the wrappers also help help the the system system adapt adapt to to the the many many changes changes possible possible in in the the environment, environment, and and the the wrapper wrapper architecture eases eases the the wrapper wrapper writer's writer's task. task. architecture Scalability Scalability is is a a fundamental fundamental goal goal of of DiscoveryLink. DiscoveryLink. There There is is no no a a priori priori limit limit to to the the number number of of different different sources sources it it can can handle, handle, because because sources sources are are independent independent and use. (Wrapper and consume consume little little in in the the way way of of system system resources resources when when not not in in use. (Wrapper code code is loaded dynamically; is loaded dynamically; when when not not in in use, use, the the only only trace trace of of the the source source is is a a set set of of catalog catalog entries.) entries.) There There may may be be limitations limitations in in practice practice if if many many sources sources of of different different types used at same time, memory is types are are used at the the same time, depending depending on on how how much much memory is available. available. This This is is akin akin to to the the limits limits on on query query complexity complexity in in relational relational database database management management systems systems today, today, which which are are not not typically typically hit hit until until several several hundred hundred tables tables are are used used in in the robust and the same same query. query. Because Because DiscoveryLink DiscoveryLink is is built built on on robust and scalable scalable relational relational database database technology, technology, there there should should also also be be no no a a priori priori limit limit on on the the amount amount of of data data the needed, the system system can can handle. handle. Because Because the the data data are are left left in in the the native native stores stores until until needed, they they can can still still be be updated updated by by directly directly modifying modifying those those stores. stores. (That (That is, is, updates updates do do not not need need to to go go through through DiscoveryLink, DiscoveryLink, though though it it may may be be convenient convenient to to do do so so for for relational data sources.) this case, update rate only limited relational data sources.) In In this case, the the update rate is is only limited by by the the update update rate rate of of the the data data sources; sources; DiscoveryLink DiscoveryLink is is not not a a bottleneck. bottleneck. As database management DiscoveryLink needs As with with all all database management systems, systems, DiscoveryLink needs to to be be able able to to handle handle complex complex queries queries over over large large volumes volumes of of data data swiftly swiftly and and efficiently. efficiently. For For DiscoveryLink, DiscoveryLink, this this task task is is further further complicated complicated by by the the fact fact that that much, much, if if not not all, all, of of the the data data resides resides in in other other data data sources, sources, which which may may be be distributed distributed over over a a wide wide geo geographic area. Query graphic area. Query optimization, optimization, which which is is described described in in this this chapter, chapter, is is the the main main tool tool DiscoveryLink DiscoveryLink uses uses to to ensure ensure good good performance. performance. There There are are other other aspects aspects of of

1 1 .4 11.4

Conclusions Conclusions

329

the also help. query is the system system that that also help. For For example, example, before before optimization, optimization, the the query is passed passed to to a rewrite engine. This engine can greatly a rewrite engine. This engine applies applies a a variety variety of of transformations transformations that that can greatly improve example, eliminate improve the the ultimate ultimate performance. performance. Transforms Transforms can, can, for for example, eliminate un unnecessary necessary operations operations such such as as sorts sorts or or even even joins. joins. Others Others can can derive derive new new predicates predicates that allow the that restrict restrict operations operations or or allow the use use of of a a different different access access path, path, again again enhanc enhancing performance. In addition to query query rewrite, rewrite, wrappers wrappers are are carefully carefully tuned tuned to to ing performance. In addition to use the most efficient efficient programming use the most programming interfaces interfaces provided provided by by the the source source (e.g., (e.g., taking taking advantage to efficiently advantage of of bulk bulk reads reads and and writes writes to efficiently transport transport data data between between sources) sources).. Additional Additional constructs constructs such such as as automatic automatic summary summary tables tables (materialized (materialized views views over over local and/or automatically substituted local and/or remote remote data data that that can can be be automatically substituted into into a a query query to to save save remote remote data data access access and and re-computation) re-computation) provide provide a a simple simple form form of of caching. caching. How How well well does does the the system system perform? perform? There There are are no no benchmarks benchmarks yet yet for for this this style and IBM style of of federated federated data data access, access, and IBM experience experience to to date date is is limited limited to to a a few few cus customers and and some some experiments experiments in in IBM's IBM's lab. But, some some statements statements can can be be made. made. tomers lab. But, For For example, example, it it is is known known that that DiscoveryLink DiscoveryLink adds adds little little if if any any overhead. overhead. A A sim simple experiment experiment compares compares queries queries that can be be run run against against a a single single source source with with the the ple that can same query query submitted submitted against against a a DiscoveryLink DiscoveryLink nickname nickname for for that that source. source. In In most most same cases, the the native native performance performance and and performance performance via via DiscoveryLink DiscoveryLink are are indistin indistincases, guishable [8]. In a few cases, due either to the sophisticated rewrite engine guishable [8]. In a few cases, due either to the sophisticated rewrite engine or or just more hardware just the the addition addition of of more hardware power, power, performance performance using using this this three-tiered three-tiered approach Discovery Link-+source) approach (client-+ (client~Discovery Link~source) is is better better than than performance performance using using the the source directly directly (client-+ source). This This has experiments source (client-~source). has been been borne borne out out by by repeated repeated experiments on standard TPC-H TPC-H workloads. queries that on both both customer customer and and standard workloads. For For queries that involve involve data data from claims, as clear from multiple multiple sources, sources, it it is is harder harder to to make make broad broad claims, as there there is is no no clear standard for comparison. The standard for comparison. The overall overall experience experience so so far far shows shows that that performance performance depends heavily on the complexity depends heavily on the complexity of of the the query query and and the the amounts amounts of of data data that that must must be transported to complete the query. Overall, performance seems to be meeting be transported to complete the query. Overall, performance seems to be meeting customers' needs; that is, it is normally normally good not shy customers' needs; that is, it is good enough enough that that they they do do not shy away away from distributed queries from distributed queries and and often often are are not not even even aware aware of of the the distribution. distribution. There There are including better exploitation of are areas areas for for improvement, improvement, however, however, including better exploitation of parallelism parallelism when when available available and and some some form form of of automated automated caching. caching.

1 1 .4 11.4

1-

C ONCLUSIONS CONCLUSIONS

This This chapter chapter described described IBM's IBM's DiscoveryLink DiscoveryLink offering. offering. DiscoveryLink DiscoveryLink allows allows users users to to query query data data that that may may be be physically physically stored stored in in many many disparate, disparate, specialized specialized data data stores stores as as if if that that data data were were all all co-located co-located in in a a single single virtual virtual database. database. Queries Queries against against this this data data may may exploit exploit all all of of the the power power of of SQL, SQL, regardless regardless of of how how much much or or how little SQL function the data sources prO\7;ide. In addition, queries may employ how little SQL function the data sources provide. In addition, queries may employ

330 330

~ ~ ~

1 1 11

DiscoveryLi nk DiscoveryLink

any additional additional functionality functionality provided provided by by individual individual data data stores, stores, allowing allowing users users the the any best SQL and best of of both both the the SQL and the the specialized specialized data data source source worlds. worlds. A A sophisticated sophisticated query query optimization optimization facility facility ensures ensures that that the the query query is is executed executed as as efficiently efficiently as as possible. possible. The The interfaces, interfaces, performance, performance, and and scalability scalability of of DiscoveryLink DiscoveryLink were were also also discussed. discussed. DiscoveryLink is is a a new new offering, offering, but but it it is is based based on on a a fusion fusion of of well-tested well-tested tech techDiscoveryLink nologies Universal Database nologies from from DB2 DB2 Universal Database (UDB), (UDB), DB2 DB2 DataJoiner, DataJoiner, and and the the Garlic Garlic research project. Both (originally DB2 DB2 Client/Server ) and research project. Both DB2 DB2 UDB UDB (originally Client/Server [C/S] [C/S]) and DB2 DB2 DataJoiner 990s, and DataJoiner have have been been available available as as products products since since the the early early 1 1990s, and they they have have been used used by by thousands of customers customers in in the the past past decade. The Garlic Garlic project project began began been thousands of decade. The in 1 1994, and much much of of its its technology technology was was developed as the the result result of of joint studies in 994, and developed as joint studies with Dis with customers, customers, including including an an early early study study with with Merck Merck Research Research Laboratories. Laboratories. DiscoveryLink's extensible extensible wrapper wrapper architecture architecture and and the the interactions interactions between between wrap wrapcoveryLink's per and and optimizer optimizer during during query query planning derive from from Garlic. As part part of of Garlic, Garlic, per planning derive Garlic. As wrappers wrappers were were successfully successfully built built and and queried queried for for a a diverse diverse set set of of data data sources, sources, in including (DB2 and cluding two two relational relational database database systems systems (DB2 and Oracle), Oracle), a a patent patent server server stored stored in Lotus Lotus Notes, Notes, searchable searchable sites sites on on the the World World Wide Wide Web Web (including (including a a database database of of in business listings listings and specialized search business and a a hotel hotel guide), guide), and and specialized search engines engines for for collections collections of of images, images, chemical chemical structures, structures, and and text. text. Currently, Currently, IBM IBM is is working working on on building building a a portfolio portfolio of of wrappers wrappers specific specific to to the In addition the life life sciences sciences industry. industry. In addition to to key key relational relational data data sources sources such such as as Oracle Oracle and and Microsoft's Microsoft's SQL SQL Server, Server, wrappers wrappers are are available available for for application application sources sources such such as as BLAST BLAST and and general general sources sources of of interest interest to to the the industry industry such such as as Microsoft Microsoft Excel, Excel, flat flat files, Documentum for text management, and XML. IBM also working key files, Documentum for text management, and XML. IBM is is also working with with key industry vendors to to wrap sources they supply. This will provide provide access access industry vendors wrap the the data data sources they supply. This will to chemical sources. While wrappers will be to many many key key biological biological and and chemical sources. While wrappers will be created created as possible, it anticipated that require one as quickly quickly as as possible, it is is anticipated that most most installations installations will will require one or or more be created more new new wrappers wrappers to to be created because because of of the the sheer sheer number number of of data data sources sources that that exist fact that exist and and the the fact that many many potential potential users users have have their their own own proprietary proprietary sources sources as as well. Hence, a is being well. Hence, a set set of of tools tools is being developed developed for for writing writing wrappers wrappers and and training training a wrapper writers a staff staff of of wrapper writers who who will will be be able able to to build build new new wrappers wrappers as as part part of of the the DiscoveryLink DiscoveryLink software software and and services services offering offering model. model. As As DiscoveryLink DiscoveryLink supports supports the standard [25] for accessing accessing external sources, those who would would the SQLlMED SQL/MED standard [25] for external data data sources, those who rather partners) rather create create their their own own wrappers wrappers (customers, (customers, universities, universities, and and business business partners) may too. Hopefully, Hopefully, in in this may do do so, so, too. this way way a a rich rich set set of of wrappers wrappers will will quickly quickly become become available for for use use with with DiscoveryLink. DiscoveryLink. available From is clear From the the preceding preceding pages, pages, hopefully hopefully it it is clear that that DiscoveryLink DiscoveryLink plays plays an an es essential sential role role in in integrating integrating life life science science data. data. DiscoveryLink DiscoveryLink provides provides the the plumbing, plumbing, or trans or infrastructure, infrastructure, that that enables enables data data to to be be brought brought together, together, synthesized, synthesized, and and transformed. This plumbing formed. This plumbing provides provides a a high-level high-level interface, interface, a a virtual virtual database database against against which queries can posed and results are which sophisticated sophisticated queries can be be posed and from from which which results are returned returned

References

33 1 with excellent performance. with excellent performance. It It allows allows querying querying of of heterogeneous heterogeneous collections collections of of data data from from diverse diverse data data sources sources without without regard regard to to where where they they are are stored stored or or how how they they are are accessed. accessed. While to all While not not a a complete complete solution solution to all heterogeneous heterogeneous data data source source woes, woes, Discov DiscoveryLink is well well suited the life life sciences eryLink is suited to to the sciences environment. environment. It It serves serves as as a a platform platform for for data integration, integration, allowing allowing complex complex cross-source queries and and optimizing optimizing them them for for data cross-source queries high performance. performance. In In addition, addition, several resolution of high several of of its its features features can can help help in in the the resolution of semantic discrepancies by providing mechanisms mechanisms DBAs semantic discrepancies by providing DBAs can can use use to to bridge bridge the the gaps gaps between data representations. high-level SQL flexi between data representations. Finally, Finally, the the high-level SQL interface interface and and the the flexibility and careful design of wrapper architecture make it easy to bility and careful design of the the wrapper architecture make it easy to accommodate accommodate the many many types of change prevalent in in this the types of change prevalent this environment. environment. Of course, there there are plenty of of areas areas in which further research is needed. For Of course, are plenty in which further research is needed. For the the query engine, key topics topics are the exploitation exploitation of to enhance query engine, key are the of parallelism parallelism to enhance performance performance and richer richer support of object object features sources. There and support for for modeling modeling of features in in foreign foreign data data sources. There is is also a for additional tools and and facilities also a need need for additional tools facilities that that enhance enhance the the basic basic DiscoveryLink DiscoveryLink offering. preliminary work work was was done offering. Some Some preliminary done on on a a system system for for data data annotation annotation that that provides a a rich model of of annotations, annotations, while provides rich model while exploiting exploiting the the DiscoveryLink DiscoveryLink engine engine to allow allow querying querying of annotations and and data conjunction. A A tool tool to of annotations data separately separately and and in in conjunction. is also also being being built built to to help help users users create create mappings mappings between is between source source data data and and a a target, target, integrated schema 29] to to ease ease the the burden integrated schema [28, [28, 29] burden of of view view definition definition and and reconciliation reconciliation of data that that plagues plagues today's of schemas schemas and and data today's system system administrators. administrators. Hopefully, Hopefully, as as DiscoveryLink matures it it will will serve serve as DiscoveryLink matures as a a basis basis for for more more advanced advanced solutions solutions that that will distill distill information of data life sciences will information from from the the oceans oceans of data in in which which life sciences researchers researchers are are currently drowning, drowning, for for the the advancement advancement of health and basic scientific scientific currently of human human health and for for basic understanding. understanding.

R E F E R E N CE S REFERENCES
[ 1] [1] [2] [3] [4] [5] IS] S. Altschul, W. Gish, W. Miller, t al. "Basic Miller, e et "Basic Local Alignment Search Tool."

Journal of 1 5, no. 3 ( 1 990): 403-410. of Molecular Biology 2 215, (1990)"


L. Falquet, SITE Database, Its Status in Falquet, M. Pagni, P. P. Bucher, Bucher, et al. "The PRO PROSITE 2002." Nucleic Acids Research Research 30, no. 1 1 (2002) (2002).: 235-238. E. Birney Birney and R. Durbin. "Using GeneWise GeneWise in the Drosophilia Annotation Experiment" [see [see comments] comments].. Genome Research 10, no. 4 (2000): 547-548. 547-548. " Nucleic Acids D. A. Benson, Benson, I. Karsch-Mizrachi, D. J. Lipman, et al. "GenBank. "GenBank." Research 30, no. 1 : 1 7-20. 1 (2002) (2002)" 17-20. A. Bairoch and R. Apweiler. Apweiler. "The SWISSROT SWISS_PROT Protein Sequence Database and Its . " Nucleic Acids Research 28, no. 1 Supplement TrEMBL TrEMBL in 2000 2000." 1 (2000) (2000):: 45-48.

332
[6) [6]

1 1 11

DiscoveryLi nk DiscoveryLink

D . L. Wheeler, D Database Resources of the D.L. D.. M. Church, A. E. Lash, et al. " "Database National Center for Biotechnology Information: 2002 Update. " Nucleic Acids Update." Research 20, no. 1 1 (2002): 13-16. M. Ringwald, J. T. Epping, D. A. Begley, Begley, et al. "The Mouse Gene Expression 1 ) : 98-1 01. GXD ) . " Nucleic Acids Research 29, no. 1 Database ( (GXD)." 1 (200 (2001): 98-101. L. M. Haas, P. L.M. P. M. Schwarz, P. P. Kodali, et al. "DiscoveryLink: A System for Integrated Access Access to Life " IBM Systems Journal 40, no. 2 Life Sciences Sciences Data Sources. Sources." (February 200 1 ) : 489-5 11. 2001): 489-511. P. P. Gupta and E. T. Lin. "Datajoiner: A Practical Approach to Multi-Database Access. " In Proceedings Access." Proceedings of of the International IEEE Conference on Parallel and IEEE Computer Society, Distributed Information Systems, 264. Los Alamitos, CA: IEEE

[7) [7] [8) [8]

[9) [9]

1 994. 1994.
M. Haas, D. Kossmann, E. L. Wimmers, et al. "Optimizing Queries Across [10) [10] L. L.M. Diverse Data Sources." In Proceedings of of the Conference on Very Very Large Large Databases 997. (VLDB), 276-285. 276-285. San Francisco: Morgan Kaufmann, 1 1997.

[ 1 1 ) M. T. Roth [11] M.T. Roth and P. P. M. Schwarz. "Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources."In Sources. "In Proceedings Proceedings of of the Conference on Very Very Large Data 997. Bases (VLDB), 266-275. San Francisco: Morgan Kaufmann, 1 1997. [ 1 2) S. Davidson, C. Overton, V. Tannen, et al. "BioKleisli: A Digital Library for [12] , no. 1 Biomedical Researchers. " International Journal of Researchers." of Digital Libraries Libraries 1 1, 1 (January 1 997): 36-53. 1997):
n Indexing and Retrieval Tool for Flat File [ 1 3 ) T. Etzold and P. [13] P. Argos. "SRS: "SRS: A An File Data 1 993): 49-57. Libraries. " Computer Applications in the Biosciences 9, no. 1 Libraries." 1 ((1993): "SRS: Analyzing and Using Data from [ 14] P. Carter, T. Coupaye, D. Kreil, et al. "SRS: [14] Heterogeneous " In Bioinformatics: Databases and Systems, Heterogeneous Textual Textual Databanks. Databanks." In Bioinformaticsedited by S. Letovsky. Boston: Kluwer Academic, 998. Academic, 1 1998.

[ 1 5) I-M. A. Chen, A. S. [15] S. Kosky, V. V. M. Markowitz, et al. "Constructing and


Maintaining Scientific Scientific Database Views Views in the Framework of the Object-Protocol Model." In Proceedings Proceedings of of the Ninth Ninth International Conference on Scientific and Statistical Database 7-248 . Los Alamitos, CA: IEEE 237-248. IEEE Computer Database Management, 23 Society, 997. Society, 1 1997.

[ 1 6] R. Stevens, [16] Stevens, C. Goble, N. W. Patan, Paton, et al. "Complex Query Formulation Over
Diverse Information Sources in TAMBIS." In Z. Lacroix and T. Critchlow (eds ). (eds). Bioinformatics: 8 9-223. San Francisco: Morgan Bioinformatics'. Managing Scientific Data, 1 189-223. Kaufmann, 2004. A. Eckman, A. S. [ 1 7] B. [17] B.A. S. Kosky, and L. A. Laroco Jr. "Extending Traditional Query-Based Integration Approaches for Functional Characterization of Post-Genomic Data." Bioinformatics 1 7, no. 7 (200 1): 5 8 7-601 . 17, (2001): 587-601.

[ 1 8) Y. [18] Y. Papakonstantinou, Papakonstantinou, H H.. Garcia-Molina, and J J.. Widom. "Object Exchange Across Heterogeneous Information Sources. " In Proceedings of Sources." of the IEEE Conference on Data Engineering, 251-260. 995. 251-260. Los Alamitos, CA: IEEE IEEE Computer Society, Society, 1 1995.

References References

333
[ 1 9 ] A. Tomasic, [19] Tomasic, L L. E Valduriez. "Scaling Heterogeneous Heterogeneous Databases and . Raschid, and P. the Design Design of DISCO." In Proceedings Proceedings of of International Conference on Distributed Computing Systems (ICDCS), 449-457. Los Alamitos, CA: IEEE IEEE Computer Society, 996. Society, 1 1996.

[20] M-C. Shan, R. Ahmed, J. Davis, et al. "Pegasus: "Pegasus: A Heterogeneous Heterogeneous Information [20] Management System. " In Modern Database Systems, edited by W. Kim, 664-682. System." Reading, MA: Addison-Wesley, Addison-Wesley, 1995.
[21 ] L. Liu and C. Pu. "The Distributed Interoperable Object Model and its [21] Application to Large-Scale Large-Scale Interoperable Database Systems. " In Proceedings of Systems." of the

A CM International Conference on Information and Knowledge Management,


1 05-1 12. New York: 105-112. York: Association for Computing Machinery, 1995.

Y. Papakonstantinou, et al. " "Query [22] S. Adali, K. Candan, Y. Query Caching and Optimization in Distributed Mediator Systems. " In Proceedings of Systems." of the A CM SIGMOD 3 7-148. New York: 137-148. York: Association SIGMOD Conference on Management of of Data, 1 for Computing Machinery, 1996. 1 996.
[23] [23] International Organization for Standardization. "Information Technology Technology~ Database Languages-SQL-Part 3: Call Level Level Interface (SQLlCLI) . "ISOIIEC (SQL/CLI)."ISO/IEC Languages~SQL~Part 3: 9075-3. Geneva, Switzerland: International Organization for Standardization, 9075-3. Geneva, 1999. 1 999. [24] [24] International Organization for Standardization. "Information Technology Technology~ Database Languages-SQL-Part Languages~SQL~Part 2: Foundation (SQLlFoundation)." (SQL/Foundation)." ISO/IEC 9075-2. Geneva, Switzerland: International Organization for Standardization, 9075-2. Geneva, 1999. 1 999. [25] [25] International Organization for Standardization. "Information Technology Technology~ Database Languages-SQL-Part )." Languages~SQL~Part 9: Management of External Data (SQLlMED (SQL/MED)." ISO/lEe ISO/IEC 9075-9. 9075-9. Geneva, Geneva, Switzerland: International Organization for Standardization, 2000. [26] [26] IBM. IBM. IBM IBM DB2 Life Sciences Sciences Data Connect Planning, Planning, Installation Installation and Configuration Plains, NY: IBM, 1. Version 7.2 FP 5. White Plains, IBM, 200 2001. Configuration Guide, Version http://www-3.ibm.comlsoftware/data/db2/Iifesciencesdataconnect/db2Is-pdf.html. http://www-3.ibm.com/software/data/db2/lifesciencesdataconnect/db21s-pdf.html. [27] T. Roth, F. [27] M. M.T. E Ozcan, and L. M. Haas. "Cost Models Do Matter: Providing Cost Information for Diverse of Diverse Data Sources Sources in a Federated System." In Proceedings of the Conference on Very 10. San Francisco: Very Large Data Bases Bases (VLDB), 559-6 559-610. Morgan Kaufmann, 1 999. 1999. [28] M. Haas, R. J. Miller, [28] L. L.M. Miller, B. B. Niswonger, et al. "Transforming Heterogeneous Data with Database Middleware: Middleware: Beyond Integration." IEEE Data Engineering Bulletin 1 ((1999): 22, no. 1 1 999): 31-36. [29] ]. Miller, [29] R. R.J. Miller, L. M. Haas, and M. Hernandez. "Schema Mapping as Query Discovery. " In Proceedings Proceedings of of the Conference on Very Very Large Data Bases Bases (VLDB), Discovery." 77-8 8. San Francisco: 77-88. Francisco: Morgan Kaufmann, 2000.

This Page Intentionally Left Blank

CHAPTER CHAPTER

12 1 2

System System for for Scientific Scientific Data Data M anag e m ent Management

AM Model-Based Mediator A odel - B ased M ediator

Bertram and Maryan n E. E. Martone Bertram Ludascher, Lud~ischer, Amarnath Amarnath Gupta, Gupta, and Maryann Martone

A A database database mediator mediator system system combines combines information information from from multiple multiple existing existing source source databases and databases and creates creates a a new new virtual, virtual, mediated mediated database database that that comprises comprises the the inte integrated grated entities entities and and their their relationships. relationships. When When mediating mediating scientific scientific data, data, the the techni technically problem of mediator query complicated by cally challenging challenging problem of mediator query processing processing is is further further complicated by the the complexity complexity of of the the source source data data and and the the relationships relationships between between them. them. In In partic particular, one ular, one is is often often confronted confronted with with complex complex multiple-world multiple-world scenarios scenarios in in which which the the semantics of well as knowledge to link them, semantics of individual individual sources, sources, as as well as the the knowledge to link them, require require modeling than is offered Based deeper modeling than is offered by by current current database database mediator mediator systems. systems. Based a deeper on with federation on experiences experiences with federation of of brain brain data, data, this this chapter chapter presents presents an an extension extension called model-based called model-based mediation (MBM). (MBM). In In MBM, MBM, data data sources sources export export not not only only raw eMs), including raw data data and and schema schema information information but but also also conceptual conceptual models ((CMs), including domain domain semantics, semantics, to to the the mediator, mediator, effectively effectively lifting lifting data data sources sources to to knowledge allows a mediation engineer to define integrated views 1) sources. This This allows a mediation engineer to define integrated views based based on on ((1) the the local local eMs CMs of of registered registered sources sources and and (2) (2) auxiliary auxiliary domain domain knowledge knowledge sources sources called domain domain maps maps (DMs) (DMs) and and process process maps maps (PMs), (PMs), respectively, respectively, which which act act as as called sources sources of of glue knowledge. For For complex complex scientific scientific data data sources, sources, semantically semantically rich rich reason with eMs CMs are are necessary necessary to to represent represent and and reason with scientific scientific rationale rationale for for linking linking a a wide variety variety of wide of heterogeneous heterogeneous experimental experimental assumptions, assumptions, observations, observations, and and con conclusions that clusions that together together constitute constitute an an experimental experimental study. study. This This chapter chapter illustrates illustrates the real-world examples the challenges challenges using using real-world examples from from a a complex complex neuroscience neuroscience integra integration problem and presents methodology and tools, in tion problem and presents the the methodology and some some tools, in particular particular the the knowledge-based mediator prototype knowledge-based integration integration of of neuroscience neuroscience data data (KIND) (KIND) mediator prototype for mediation of for model-based model-based mediation of scientific scientific data. data.

336

1 2 12

A a nagement A Model-Based Model-Based Mediator Mediator System System for for Scientific Scientific Data Data M Management

,,,,,%_,,,,, " """'''V_''', "', "" ", _ _...,,,,,, = , ,, , ,,,,,, ,,,, ,, ,''' _''''-'''''' __ '' ' _ -=W_ ''' ' ''''''''"_W

12.1 1 2. 1

BACKGROUND BACKG RO U N D
Seamless Seamless data data access access and and sharing, sharing, handling handling of of large large amounts amounts of of data, data, federation federation and and integration integration of of heterogeneous heterogeneous data, data, distributed distributed query query processing processing and and appli application cation integration, integration, data data mining, mining, and and visualization visualization are are among among the the common common and and recurring broad broad themes themes of of scientific scientific data data management. management. A A main main stream stream of of activity activity recurring in the the bioinformatics bioinformatics domain is concerned concerned with with sequence sequence and and structural structural data databases in domain is bases such such as as GenBank, GenBank, the the Protein Protein Data Data Bank Bank (PDB), (PDB), and and Swiss-Prot, Swiss-Prot, and and much much work work is is devoted devoted to to algorithmic algorithmic challenges challenges stemming stemming from from problems problems (e.g., (e.g., efficient efficient sequence sequence alignment alignment and and structure structure prediction). prediction). However, However, in in addition addition to to the the well-known well-known chal challenges lenges of of bioinformatics bioinformatics applications applications such such as as algorithmic algorithmic complexity complexity and and scalabil scalability ity (e.g., (e.g., in in genomics), genomics), there there are are other other major major challenges challenges that that are are sometimes sometimes over overlooked, level of looked, particularly particularly when when considering considering scientific scientific data data beyond beyond the the level of sequence sequence and and protein protein data data (e.g., (e.g., brain brain imagery imagery data). data). These These challenges challenges arise arise in in the the context context of of information integration of of scientific data and and have have to to do do with with the the inherent inherent seman semantic 1 ) the necessary tic complexity complexity of of ( (1) the actual actual source source data data and and (2) (2) the the glue knowledge necessary to to link link the the source source data data in in meaningful meaningful ways. ways. Traditional Traditional federated federated database database sys system tem architectures, architectures, and and those those of of the the more more recent recent database database mediators mediators developed developed by by the the database database community, community, need need to to be be extended extended to to handle handle adequately adequately information information integration complex scientific multiple sources. sources. This integration of of complex scientific data data from from multiple This extension extension is is a a combination of representation and nutshell: combination of knowledge knowledge representation and mediator mediator technology. technology. In In a a nutshell:
Database Mediation Mediation + Knowledge Representation Representation Model-Based Mediation Mediation - Database + Knowledge
=

With respect respect to their semantic heterogeneity (ignoring syntactic and and system system With to their (ignoring syntactic aspects), information information integration/mediation integration/mediation scenarios ( scientific or can aspects), scenarios (scientific or otherwise) otherwise) can be roughly classified spectrum as follows: On end, there there are are simple be roughly classified along along a a spectrum as follows: On one one end, somewhere in are simple multiple-world scenarscenar one-world scenarios; somewhere in the the middle middle are and at the other other end end of the spectrum spectrum are are complex multiple-world scenarios. at the of the ios; and An example example of in which which the the modeled modeled real-world real-world An of a a simple one-world scenario (i.e., (i.e., in entities can can be be related related easily easily to to one one another another and and come come from a single single domain) domain) entities from a is comparison shopping shopping for books. A A typical query is is to to find find the the cheapest price for books. typical query cheapest price is for a a given given book book from from a a number number of of sources sources such such as as amazon.corn amazon.com and and bn.com. bn.com. for An example of a a simple multiple-world scenario is is the the integration integration of of realtor realtor and and An example of census data data to to annotate and rank rank real real estate by neighborhood neighborhood quality. quality. Here, Here, the the census annotate and estate by approach combines combines and and relates relates quite quite different different kinds kinds of of information, information, but but the the rere approach lations between between the the multiple multiple worlds worlds are are simple simple enough enough to to be be understood without lations understood without are often often deep domain domain knowledge. knowledge. Examples Examples of of complex multiple-world scenarios are deep found in in scientific scientific data data management management and and are are the the subject subject of of this this chapter. chapter. Thus, Thus, found simple and and complex complex here here refer refer to to the the degree in which which specific degree in specific domain domain semantics

12.1 1 2.1

Background

337 337

is required required to to formalize formalize or or even even state state meaningful meaningful associations associations and and linkages linkages be beis tween tween data data objects objects of of interest; interest; it it does does not not mean mean that that the the database database and and mediation mediation technology for for realizing realizing such such mediators mediators is is simple. simple. 11 For For example, example, to to state state the the prob probtechnology what the result of an integrated comparison shopping shopping view should be, lem of what (title, authors, authors, publisher, publisher, price, price, etc.) etc.) a basic basic understanding understanding of of a a books schema (title, a is sufficient. sufficient. In In particular, particular, the the association association operation operation that that links links objects objects of of inter interis on est across across sources sources can can be be executed executed (at (at least least in in principle) principle) as as a a syntactic join on est the the ISBN. ISBN. Similarly, Similarly, in in the the realtor realtor example, example, data data can can be be joined joined based based on on the the ZIP ZIP that can can be be code, latitude latitude and and longitude, longitude, or or street street address address (i.e., (i.e., by by spatial joins that code, spatial oracle). To understand the basic modeled as atomic function calls to a oracle). To understand the basic modeled as atomic function calls to a linkage of of information information objects, objects, no no insight insight into into the the details details of of the the spatial spatial join join is is linkage required. This This is is fundamentally fundamentally different different for for complex complex multiple-world multiple-world scenarios scenarios as as found found in many many scientific scientific domains. domains. There, There, even even if if data data is is stored stored in in state-of-the-art state-of-the-art (often (often in Web accessible) accessible) data databases, significant domain domain knowledge knowledge is is required required to to articulate articulate Web bases, significant meaningful queries (or within meaningful queries across across disciplines disciplines (or within different different micro-worlds micro-worlds of of a a single single discipline); further further examples examples are are offered offered in in the the next next section. section. discipline);

Outline Outline
In examples from In this this chapter, chapter, these these challenges challenges are are illustrated illustrated with with examples from ongoing ongoing col collaborations laborations with with users users and and providers providers of of scientific scientific data data sets, sets, in in particular particular from from the the neuroscience domain (see Section 12.2). Then methodology called model-based neuroscience domain (see Section 12.2). Then a a methodology called model-based which extends extends current current database database mediator by incorporating incorporating mediation, mediation, which mediator technology technology by knowledge representation (KR) techniques techniques to do knowledge representation (KR) to create create explicit explicit representations representations of of domain experts' that can can be be used various ways mediation engineers engineers main experts' knowledge knowledge that used in in various ways by by mediation and by by the the MBM MBM system system itself, is presented presented in in Section 12.3. The goal of MBM and itself, is Section 12.3. The goal of MBM could could be be paraphrased paraphrased as: as:
scientists' questions questions into into executable database queries. queries. Turning scientists'

Section 12.4 1 2.4 introduces introduces some some of of the the KR KR formalisms formalisms (e.g., (e.g., for for domain domain maps maps Section and process process maps) maps) and and describes describes their their use use in in MBM. MBM. In In Section 1 2.5 the the KIND KIND meme and Section 12.5 diator prototype prototype and and other other tools tools being being developed developed at at the the San San Diego Diego Supercomputer Supercomputer diator Center (SDSC) (SDSC) and and the the University University of of California California at at San San Diego Diego (UCSD) (UCSD) are are presented presented Center primarily in in the the context of the the neuroscience neuroscience domain. domain. Section Section 12.6 12.6 discusses discusses related related primarily context of work and concludes the chapter. work and concludes the chapter.

1 . Such Such simple simple mediation mediation scenarios scenarios often pose very very difficult difficult technical technical challenges challenges (e.g., (e.g., query query processing 1. presence of limited limited source capabilities) capabilities) [1, 2] . in the presence [1 , 2].

338

338

1 2 12

A iator System A Model-Based Model-Based Med Mediator System for for Scientific Scientific Data Data Management

Ii

12.2 1 2.2

SCI E NTI F I C DATA NTEG RATI O N ACROSS SCIENTIFIC DATA IINTEGRATION ACROSS M U LTI PLE WOR LDS: EXAM PLES AN D MULTIPLE WORLDS: EXAMPLES AND CHALLE N G E S FROM THE N E U ROSCI E N CES CHALLENGES NEUROSCIENCES
Some f the f scientific n complex Some o of the challenges challenges o of scientific data data integration integration iin complex multiple-world multiple-world scenarios scenarios are are illustrated illustrated using using examples examples that that involve involve different different neuroscience neuroscience worlds. worlds. Such Such examples examples occur occur regularly regularly when when trying trying to to federate federate brain brain data data across across multiple multiple sites, sites, scales, scales, and and even even species species [3] [3] and and have have led led to to new new research research and and development development projects projects aimed aimed at at overcoming overcoming the the current current limitations limitations of of biomedical biomedical data data sharing sharing [4]. . and mediation [4]
2.2. 1 (Two Neuroscience Worlds). Consider Example 1 12.2.1 Consider two two neuro-science neuro-science labora laboratories, , that tories, SYNAPSE SYNAPSE and and NCMIR NCMIR 2 2, that perform perform experiments experiments on on two two different different brain brain regions. regions. The The first first laboratory, laboratory, SYNAPSE, SYNAPSE, studies studies dendritic dendritic spines spines of of pyramidal pyramidal cells cells in in the the hippocampus. hippocampus. The The primary primary schema schema elements elements are are thus thus the the anatomical anatomical enti entities ties reconstructed reconstructed from from 3D 3D serial serial sections. sections. For For each each entity entity (e.g., (e.g., spines, spines, dendrites), dendrites), researchers researchers make make a a number number of of measurements measurements and and study study how how these these measurements measurements change change across across age age and and species species under under several several experimental experimental conditions. conditions. In In contrast, contrast, the the NCMIR NCMIR laboratory laboratory studies studies a a different different cell cell type, type, the the Purkinje Purkinje cells cells of of the the cerebellum. cerebellum. They They inspect inspect the the branching branching patterns patterns from from the the dendrites dendrites of of filled neurons and and the localization of neuron compartments. filled neurons the localization of various various proteins proteins in in neuron compartments. The schema schema used used by by this of a number of The this group group consists consists of a number of measurements measurements of of the the dendrite branches (e.g., (e.g., segment segment diameter) the amount amount of proteins dendrite branches diameter) and and the of different different proteins in each each of of these these subdivisions. subdivisions. Assume Assume each schemas has found each of of the the two two schemas has a a class class found in C with with a a location attribute that that has has the Pyramidal Cell C e l l dendri t e and and c location attribute the value value Pyramidal dendrite Purkinj Purkinj e e Cell, C e l l , respectively. respectively. How are are the the schemas schemas of of SYNAPSE SYNAPSE and and NCMIR NCMIR related? related? Evidently, Evidently, they they carry carry How distinctly different different information information and and do do not not even even enter enter the the purview purview of of the the schema schema distinctly conflicts usually usually studied studied in in databases databases [5]. To To the the scientist, scientist, however, however, they they are are rere conflicts lated for for the the following following reason: reason: Like Like pyramidal pyramidal neurons, neurons, Purkinje Purkinje cells cells also also possess possess lated dendritic spines. spines. Release Release of of calcium calcium in in spiny spiny dendrites dendrites occurs occurs as as a a result result of of neuroneuro dendritic transmission and and causes causes changes changes in in spine spine morphology morphology (sizes (sizes and and shapes shapes obtained obtained transmission from SYNAPSE). SYNAPSE) . Propagation Propagation of of calcium calcium signals signals throughout throughout a a neuron neuron depends depends on on from the morphology morphology of of the the dendrites, dendrites, the the distribution distribution of of calcium calcium stored stored in in a a neuron, neuron, the

2. Information Information about about the two laboratories laboratories SYNAPSE SYNAPSE and NCMIR NCMIR is respectively respectively available available at 2.
http://synapses.bu.edu and http-//www-ncmir.ucsd.edu. http://www-ncmir.ucsd.edu. http://synapses.bu.edu

Scientific ntegration Across u ltiple Worlds: m ples and Ch a. lenges l enges 1 2.2 Scien.~.~otific 12.2 D a t Data a IIntegration Across M Multiple Wo Ids" Exa Examples and Chal l..~ ~
~,,,~ ~,~:~.~..~,~-.,~,~\\,~,~ ~ , ~ , ~ , ~ , ~ , , ~ \ ~ , ~ ~ , , ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

. . . . . . .
~

. . . . . . .
~

339

339

and and the the distribution distribution of of calcium calcium binding binding proteins, proteins, whose whose subcellular subcellular distribution distribution for Purkinje Purkinje cells cells are are measured measured by by NCMIR. NCMIR. for Thus, Thus, a a researcher researcher who who wanted wanted to to model model the the effects effects of of neurotransmission neurotransmission in in hippocampal hippocampal spines spines would would get get structural structural information information on on hippocampal hippocampal spines spines from from SYNAPSE and information SYNAPSE and information about about the the types types of of calcium calcium binding binding proteins proteins found found in NCMIR. Note in spines spines from from NCMIR. Note that that neither neither of of the the sources sources contains contains information information that would allow allow a that would a mediator mediator system system to to bridge bridge the the semantic gap between between them. them. Therefore, Therefore, additional domain knowledge-independent knowledge--independent of of the the observed observed experi experimental domain mental raw raw data data of of each each source-is source--is needed needed to to connect connect the the two two sources. sources. The The domain expert, here here a a neuroscientist, neuroscientist, it it is is easy easy to to provide provide the the necessary necessary glue knowledge expert,
Purkinje cells cells and Pyramidal cells cells have have dendrites that have have higher-order higher-order branches that contain spines. spines. Dendritic Dendritic spines spines are ion (calcium) (calcium) regulating regulating components. Spines Spines have have ion binding binding proteins. proteins. Neurotransmission involves involves ionic ionic activity activity (release). Ion-binding proteins control ion activity cell. Ion (release). Ion-binding activity (propagation) in a cell. Ionregulating components of cells cells affect affect ionic ionic activity (release). (release).

To To capture capture such such domain domain knowledge knowledge and and make make it it available available to to the the system, system, the the proposed approach employs two kinds of ontologies, called domain maps and proposed approach employs two kinds of called domain and process maps, respectively. respectively. The The former former are are aimed aimed at at capturing capturing the the basic basic domain domain terminology, terminology, and and the the latter latter are are used used to to model model different different process process contexts. contexts. Ontolo Ontologies, gies, such such as as the the domain domain map map in in Figure Figure 12.1, 12.1, are are often often formalized formalized in in logic logic (in (in this this 2.4.1 ). Together case case statements statements in in description description logic logic [6]; [6]; see see Section Section 1 12.4.1). Together with with addi addi), logic tional tional inference inference rules rules (e.g., (e.g., capturing capturing transitivity transitivity of of has has), logic axioms axioms like like these these formally formally capture capture the the domain domain knowledge knowledge and and allow allow mediator mediator systems systems to to work work with with this this knowledge knowledge (e.g., (e.g., a a concept concept or or class class hierarchy hierarchy can can be be used used to to determine determine whether whether the should retrieve class C' when looking for the system system should retrieve objects objects of of class when the the user user is is looking for instances instances of of C). C). Domain Domain maps maps not not only only provide provide a a concept-oriented concept-oriented browsing browsing and and data data explo exploration user, but--even but-even more ration tool tool for for the the end end user, more importantly-they importantly--they can can be be used used for for defining (IVDs) at defining and and executing executing integrated integrated view view definitions definitions (IVDs) at the the mediator. mediator. The The previous example illustrates previous real-world real-world example illustrates a a fundamental fundamental difference difference in in the the nature nature of information integration studied in of information integration as as studied in most most of of the the database database literature literature and and as as is management. In seemingly unconnected unconnected is necessary necessary for for scientific scientific data data management. In the the latter, latter, seemingly schema can close when situated in the scientific context, which, schema can be be semantically semantically close which, in in this this case, case, is is the the neuroanatomy neuroanatomy and and neurophysiological neurophysiological setting setting described described previ previously. ously. Therefore, Therefore, this this is is called called mediation across multiple worlds and and it it is is facilitated facilitated using ). using domain domain maps maps such such as as the the one one shown shown (see (see Figure Figure 12.1 12.1).

340

340

112 2

A Model-Based Mediator System for for Scientific Scientific Data Data Management

Neuron Neuron has :: /has Compazlment Compartment


.................... f ::...........

Spiny_Neuron Spiny_Neuron AND AND


::.:: __ ...........................

Neuron Neuron

[;; E 3has.Compartment, 3has.Compartment,

Axon, Axon, Dendrit,e, Dendrite, Soma Soma [;; E Compartment, Compartment. Spinyeuron Spiny.Neuron '= - Neuron Neuron n I-13has.Spine 3has.Spine Purkinje_Cell, Purkinje_Cell, PyramidaLCell Pyramidal_Cell [;; U_Spinyeuron Spiny.Neuron

Axon

Dendrite Dendrite

Soma Soma Branch


':

Purkinje_CeU Purkinje_Cell

Pyramidal_CeU PyramidaCCell

",,,~as
has

Dendrite Dendrite Shaft Shaft Spine Spine

Sh~has Protein Protein Spine Spine :.... /ontains ~ //~ntains ....................... Ion_Regulating_Component

[;; E Branch Branch n I-13has.Spine 3has.Spine

[;; C 3has.Branch 3has.Branch

[;; _ 3contains.JonJ3indingJ'rotein 3contains.Ion.Binding.Protein [;; _ 3subprocess_of.Neurotransmission 3subprocess_of.Neurotransmission

Spine Spine [;; E lon.Regulating_Component Ion-Regulating_Component lon-AcHvit,y Ion.Activity

Ion_Binding_Protein

~ , , , , ~ o n t r o l s / / = , , regUlates regul.... ontrols/"" Ion_Activity Neurolransmission Neurotransmission subprocess I subprocess

lon_BindingJ'rotein Ion_Binding_Protein [;; E Protein Protein n I-13controls.Ion-Activity 3controls.Ion.Activity lon_Regulating_Component Ion.Regulating_Component '= - 3regulates.Ion-Activity 3regulates.Ion.Activity

12.1 1 2.1

FIGURE F IGURE

A domain domain map for for SYNAPSE SYNAPSE and and NCMIR NCMIR (left) (left) and and its its formalization formalization in in descrip description logic logic (right). (right). Unlabeled, Unlabeled, gray gray edges edges ~ "isa" "isa" ~ " "E". tion ".

12.2.1 1 2. 2 . 1

F rom Te rm i no l ogy a n d Static From Terminology and Static Knowledge Knowledge to Process Process Context to Co ntext
While domain maps are are useful useful to to put put data data into into a a terminological thus somesome While domain maps terminological and and thus what static knowledge context, knowledge representation representation has has to to be be what static knowledge context, a a different different knowledge devised when trying to to put put data a dynamic dynamic or or process process context. context. Consider, Consider, devised when trying data into into a for example, example, the the groups groups of of neuroscientists neuroscientists who who study study the the science science of of mammalian mammalian for memory and and learning. learning. Many Many of of these these groups groups study study a a phenomena phenomena called called long-term memory (LTP) in in nerve nerve cells, cells, in in which which repeated repeated or or sustained sustained input input to to nerves nerves in in potentiation (LTP) potentiation specific brain brain regions regions (such (such as as the the hippocampus) hippocampus) conditions conditions them them in in such such a a manner manner specific that after after some some time, time, the the neuron neuron produces produces a a large large output output even even with with a a small small amount amount that of known input. input. Given Given this this general general commonality commonality of of purpose, purpose, however, however, individual individual of scientists study study and and collect collect observational observational data data for for very very different different aspects aspects of of the the scientists phenomena. phenomena.
Example 12.2.2 1 2.2.2 (Capturing (CapturingProcess Process Knowledge). Consider Consider a a group group [7] [7] that that studies studies Example the role of a specific protein N-Cadherin in the context of synapse formation the role of a specific protein N-Cadherin in the context of formation during late-phase late-phase long-term long-term potentiation potentiation (L-LTP), (L-LTP), a a subprocess subprocess of of LTP. LTP. The The data data during

collected by by the the group group consists consists of of measurements measurements that that illustrate illustrate how how the the amount amount collected

1 2.2 Scientific 12.2o,~,,, Sci e.=n!if D io~, a,,~ t.c a oData Inte grati,,on Ac:ro ss M~~tioPl~eW or Ids~,~ Ex a m pl,es9 ndo~o C hal !e n.ges

= 3 41 341

of of N-Cadherin and and the the number number of of synapses synapses (nerve (nerve junctions) junctions) both both simultaneously simultaneously increase increase in in cells cells during during L-LTP. L-LTP. Now Now consider consider that that a a different different group group [8] [8] studies studies a a new new enzyme enzyme called called CAMK-IV CAMK-IV and and its its impact impact on on a a chemical chemical reaction reaction called called phosphorylation of of a a protein protein called called CREB. CREB. Their Their data data are are collected collected to to show show how how modulating modulating the the amounts amounts of of CAMK-IV CAMK-IV and and other other related related enzymes enzymes affect affect the the amount amount of and how this, in of CREB CREB production, production, and how this, in turn, turn, affects affects other other products products in in the the nucleus nucleus of of the the neurons. neurons. Ideally, Ideally, the the goal goal of of mediating between between experimental experimental information information from from these two would be produce an these two sources sources would be to to produce an integrated integrated view view that that enables enables an an end enduser scientist scientist to to get get a a deeper deeper understanding understanding of of the the LTP LTP phenomena. phenomena. Specifically, Specifically, the the user end (and get end user user should should be be able able to to ask ask queries queries (and get answers) answers) that that exploit exploit the the scientific scientific interrelationship interrelationship between between these these experiments. experiments. In In this this way, way, the the integrated integrated access access provided questions, thus provided by by a a mediator mediator system system can can lead lead to to new new observations observations and and questions, thus eventually driving driving new new experiments. experiments. eventually At the the risk of oversimplification, the first first group group looks looks at at synapse synapse formation formation At risk of oversimplification, the and is is only only interested interested in in the the fact fact that that some some proteins proteins (including (including N-Cadherin) and N-Cadherin) bring bring about the formation of synapses. They do not look at the processes leading about the formation of synapses. They do not look at the processes leading to to the production of proteins. The looks at the production of these these proteins. The second second group group looks at a a specific specific chain chain of of events leading up events leading up to to the the production production of of the the proteins proteins but but does does not not identify identify which which proteins produced. The proteins are are produced. The semantic connection between between these these two two sources sources can can be underlying event structure and be constructed constructed in in terms terms of of the the underlying and the the way way the the two two simplified view groups groups elaborate elaborate on on different different parts parts of of it. it. Figure Figure 12.2 12.2 depicts depicts a a simplified view of of the explained previously previously and progression of the relationship relationship explained and shows shows the the cyclic cyclic progression of events events leading to leading to synapse synapse formation formation during during LTP: LTP: Red Red edges edges situate situate the the first first source source with with respect to the overall process, and blue edges situate the second source. In either respect to the overall process, and blue edges situate the second source. In either case, the dashed lines lines show case, the dashed show the the subsequence subsequence of of events events the the sources sources glossed over, or abstracted. Thus, Thus, the pertaining to or abstracted. the first first source source does does not not have have any any information information pertaining to phosphorylates (CAMK-IV, CREB), and and the the second second source source does does not not have have any any data data related related to to forms (protein, (protein, synapse). synapse). Neither Neither source source has has any any data data about about the the (black) (black) edge edge synthesizes (gene, (gene, protein). Domain allow data their source source data Domain maps maps allow data providers providers to to put put their data into into a a static! static/ terminological and process process maps allow them terminological context, context, and maps allow them to to do do the the same same for for a a dy dynamic!process namic/process context. context. Together, Together, they they capture capture valuable valuable glue knowledge that that resides resides at at the the mediator mediator and and facilitates facilitates integration integration of of hard-to-correlate hard-to-correlate sources: sources: in in particu particular, lar, concept-oriented concept-oriented data data discovery discovery (semantic (semantic browsing) browsing) [9], [9], view view definition, definition, and and semantic model-based mediation semantic query query optimization optimization [10]. [10]. To To make make model-based mediation effective, effective, it it is is also the elements elements of of the the source source schema schema to to the the domain domain map map and and also necessary necessary to to hook the the the process process map. map. This This process, process, called called the the contextualization mechanism, mechanism, is is central central to the MBM framework. to the MBM framework.

342 342

112 2

A anagement A Model-Based Model-BasedMediator Mediator System System for for Scientific Scientific Data DataM Management

--

fonn (protein, ynap e)

\ \
\ b \ ab tra C2

I
r

ynthe ize (gene protein)


-

/
u taine(Unput
4
...-

/
b tran cribe ( CREB , gene)

b pho phorylate (CAMK-IV,CREB)

12.2 1 2.2 FIGURE FIGURE

simple process process map. map. Blue Blue and and red red edges edges (marked (marked b b and and r, r, respectively) respectively) depict depict A simple A processes about aboutwhich which two two data data sources/research sources/research groups groups have have observational observational data; data; processes dashed edges edges indicate indicate abstractions abstractions (short (short cuts). cuts). No No observational observational data data is is available available dashed for for the the edge edge 6-7; 6-7; hence, hence, this this edge edge is is shown shown in in black black (unmarked). (unmarked).

1 2.3 12.3

Model-Based Model-Based Mediation Mediation

o ~ = = - - _ .:. . .~ ...~ . . . .1 ...7 6 1 7 6

===_

343 343

,-

~\7

1 2 .3 12.3
~ -.

~.

M O D E L- BASE D M E DIATI O N MODEL-BASED MEDIATION


In mediator mediator systems, systems, differences differences in in syntax syntax and and data data models models of of sources s o u r c e s SI $1,, S $2, In 2 , .. ... . are are resolved resolved by by wrappers wrappers that that translate translate the the raw raw data data into into a a common common data data format, format, typically markup language mediator systems, typically extensible extensible markup language (XML). (XML). In In most most current current mediator systems, all other handled by all other differences, differences, in in particular particular schema schema heterogeneities, heterogeneities, are are then then handled by an an appropriate appropriate integrated integrated view view definition definition (IVD), (IVD), which which is is defined defined using using an an XML XML query query language [ [11, 12]. This This architecture is extended extended by by lifting lifting exported exported source source data data language 1 1 , 12]. architecture is from the level of uninterpreted, semistructured data in XML syntax to the semanfrom the level of uninterpreted, semistructured data in XML syntax to the seman tically tically rich rich level level of of conceptual models (CMs) (CMs) with domain knowledge. Then, Then, the the mediator's mediator's views views can can be be defined defined in in terms terms of of CMs CMs (i.e., (i.e., IVDs IVDs are are defined defined in in a a global globalmodel involving as-view fashion) fashion) and and thus thus make make use use of of a a semantically semantically richer richer model involving class class hierarchies, hierarchies, complex complex object object structure, structure, and and properties properties of of relationships relationships (relational (relational constraints, constraints, cardinalities). cardinalities).

1 2 .3. 1 12.3.1

Model-Based ed iati o n : The g o n i sts Model-Based M Mediation" The Prota Protagonists


The The underlying underlying methodology methodology and and procedures procedures of of MBM MBM involve involve users users in in different different roles and roles and at at different different levels: levels: 9 Data Data providers providers are are typically typically domain domain experts, experts, such such as as bench bench scientists scientists who who would would like like to to make make their their data data from from experimental experimental studies studies available available to to the the community. community. In In MBM, MBM, data data providers providers can can not not only only export export an an XML-queriable XML-queriable version version of of their their data, data, but but they they can can also also export export domain semantics by by lifting lifting the the exported exported data data and and schema schema information information from from a a structural structural level level (e.g., (e.g., XML XML DTDs [Document [Document Type Allowing data CMs. 3 3 Allowing data DTDs Type Definitions] Definitions])) to to the the level level of of eMs. providers (see Example Example 12.3.2) 12.3.2) their their primary primary data data providers to to situate situate or or contextualize (see themselves themselves has has significant significant benefits. benefits. First, First, data data providers providers know know best best where where their their data maps. Second, data fit fit on on the the glue glue maps. Second, even even without without the the IVDs IVDs defined defined by by mediation mediation engineers, engineers, data data are are automatically automatically associated associated across across different different sources sources via via their their domain/process domain/process map map contexts contexts.. specify integrated 9 View View providers providers specify integrated view view definitions definitions (IVDs), (IVDs), that that is, is, they they pro program gram complex complex views views in in an an expressive, expressive, declarative declarative rule rule language. language. The The IVDs IVDs are registered complex $2 and the the are defined defined over over the the registered complex sources sources CM cM (( 8 sl1 ) ), CM CM(( 8 2 )) ,,. .. .. . and glue glue knowledge knowledge sources sources in in the the mediator's mediator's repository. repository. Thus, Thus, view view providers providers are are
,

3. The The w3c w3c working working group group XML XML Schema Schema (http://www.w3.org/XMLlSchema) (http.//www.w3.org/XML/Schema)and and similar similar efforts efforts like like RELAX (http.//www.oasis-open.org/committees/relax-ng/)play play an an intermediate intermediate role role between between RELAX NG NG (http://www.oasis-open.org/committees/relax-ngl) purely purely structure-based structure-based models models (DTDs) (DTDs) and and richer richer semantic semantic models models with with constraint constraint mechanisms. mechanisms.

44 ~=~=~=`~`~`~:~`~``~`~`~=~:~`~`~`~`~=~:~`~`=~``~``=~=~=`~`~`~`~`~`~`~=`~``~=~:~=:=~:~::~:~:~=~`~`~===~`:~:~`~`~:~``~`~`~`~=~`=~`~`~``~`~=~=~`~`~`~`~~=~= 344

1 2 12

A iator System A Model-Based Model-Based Med Mediator System for for Scientific Scientific Data Data

Management

the the actual actual mediation mediation engineers engineers and and they they bring bring together together (as (as a a team team or or individ individually) bases and ually) expertise expertise in in the the application application domain domain and and in in data databases and knowledge knowledge representation. The The new new fused fused objects objects defined defined by by an an IVD IVD can can be be contextualized, contextualized, based based on on the the contexts contexts provided provided by by the the source source conceptual conceptual models models (see (see right right side side of of Figure Figure 12.6). 12.6). In In this this way, way, an an integrated, integrated, virtual virtual view view exported exported by by the the mediator mediator becomes becomes a a first-class first-class citizen citizen of of the the federation; federation; it it is is considered considered a a conceptual conceptual M ) itself level level source source CM cM ((M) itself and and can can be be used used just just like like any any original original CM-wrapped CM-wrapped source.
9

End users can can start start with with semantic browsing of of CMs, CMs, by by navigating navigating the the domain domain and and process process ontologies ontologies in in the the style style of of topic topic maps, maps, in in which which a a user user navigates navigates through through a a concept concept space space by by following following certain certain relationships, relationships, going going up up and and down down concept hierarchies and so on. Users may also focus their view by issuing concept hierarchies and so on. Users may also focus their view by issuing graph graph queries queries over over domain domain or or process process maps, maps, which which return return only only the the subgraphs subgraphs of of interest. interest. Eventually, Eventually, the the user user can can access access raw raw data data from from different different sources, sources, which which is is (due (due to to contextualization) contextualization) automatically automatically organized organized by by context context [9], [9], and and access access derived derived data data resulting resulting from from user user queries queries against against the the mediated mediated views. views.

12.3.2 1 2 .3.2

Conce ptu a l Models nd Reg istration of rces Conceptual Models a and Registration of Sou Sources at the ed i ator at the M Mediator
The following components of the conceptual model CM a source source S can be be The following components of the conceptual model CM of of a s can distinguished:
CM(S) CM( S) = OM(S) OM(S) u ONT(S) u U ONT(S) U CON(S) CON( S)
=

The different different logical logical components components and and their their dependencies dependencies are are depicted depicted in III The Figure 12.3:
9

9 9

OM( S ) is is the the object object model model of of the the source source S S and and provides provides signatures signatures for for classes, OM(S) associations between between classes, classes, and and functions. functions. OM(S) OM(S ) structures structures can can be be defined defined associations extensionally by by facts facts (EDB), (EDB), or or intensionally intensionally via via rules rules (IDB). (IDB). extensionally ONT( S) is is the the local ontology ontology of of the the source source S. S. It It defines defines concepts concepts and and their their ONT(S) relationships from from the the source's source's perspective. perspective. relationships ONTG(S) is is the the ontological ontological grounding grounding of of OM(S) OM( S ) in in ONT(S), ONT( S ) , that that is, is, a a mapmap ONTG(S) ping between between the the object object model model OM(S) OM( S) (classes, (classes, attributes, attributes, associations) associations) and and ping the concepts concepts and and relationships relationships of of ONT(S). ONT( S ) . the CON( S ) is i s the the contextualization contextualization of of the the local local source source ontology ontology relative relative to to a a CON(S) mediator mediator ontology, ontology, ONT(M). ONT( M) .

1 2.3 12.3

========

Model-Based Model-Based Mediation Mediation

345 345

Integrated View Definition IVD(M)

GAV

Ontological Grounding ONTG(S)


.... .... / /

LAV

1 2.3 12.3 F IGURE FIGURE

Model-based Model-based mediation: mediation: dependencies dependencies among among logical logical components. components.

.. M) is 9 IVD( IVD(M) is the the mediator's mediator's integrated view definition and and comprises comprises logic logic view view definitions in models OM( S ) and definitions in terms terms of of the the sources' sources' object object models OM(S) and the the mediator's mediator's ontology M). By By posing posing queries M), the user ontology ONT( ONT(M). queries against against the the mediator's mediator's IVD( IVD(M), the user has single, semantically has the the illusion illusion of of interacting interacting with with a a single, semantically integrated integrated source source instead of with independent, independent, unrelated instead of interacting interacting with unrelated sources. sources. In local parts S), and In the the following, following, the the local parts of of CM(S) CM(S) (OM(S), (OM(S), ONT( ONT(S), and ONTG(S)) ONTG(S)) are through a are presented presented through a running running example. example. For For details details on on the the contextualization contextualization CON(S) CON(S) see see Example Example 12.3.2 12.3.2 and and the the related related work work on on registering registering scientific scientific data data sources 13]. sources [ [13].
Example 1 2.3. 1 (Cell-Centered Database [CCDB]). 12.3.1 [CCDB]). Figure Figure 12.4 12.4 shows shows pieces pieces of simplified version of a a simplified version of of the the conceptual conceptual model model CM(CCDB) CM(CCDB) of of a a real-world real-world sci scientific called the Cell-Centered Database Database [14] [14].. The The database database entific information information source source called the Cell-Centered objects. consists consists of of a a set set of of EXPERIMENTS EXPERIMENTS objects. Each Each experiment experiment collects collects a a number number IMAGES from one or more instruments. For each image, the scientists of cell of cell IMAGES f r o m one or more instruments. For each image, the scientists STRUCTURES in and perform mark mark out out cellular cellular STRUCTURES in the the image image and perform measurements measurements on on them them [14]. also identify called DEPOSITs, [14]. They They also identify a a second second set set of of regions, regions, called DEPOSITs, in in images images that that show deposition of molecules of markers. In general, a show the the deposition of molecules of proteins proteins or or genetic genetic markers. In general, a region marked marked as region marked does not not necessarily necessarily coincide coincide with with a a region as a a region marked as as deposit does structure.

346

'12 2

A Model-Based Model-Based Mediator Mediator A

System for for Scientific Scientific Data Data Management

Classesin in OM( OM(CCDB) Classes CCDB)

EXPEHIMEXT(id:id, date:date, cell_type:string. cell_type:string, images:SET(image images:SET(image))., EX PE H ItE" T (i.!! : id. date:date.
tIMA(.;E(ict:id, tAC;J:(i.!! : id. instrument :E"Dt { c_microscope. e_microscope } . resolution :float. size.x:int. instrument:l--xtM {c_microscope, e_microscope}, resolution:float, size_x:int, size_y:int. size_y:int,

depth:int, structures:sET(structure), regions:sET(deposit)). dept h :int. structures:sET( structure). regions:sET( deposit ,


STHt '('Tl 'fll:(i.!! : id. name:string. ,~Tl~l(~rl't{lc(id:id, name:string, length:float. length:float, surface..area:float. surface_area:float,

volume:float. volume:float, boundingJ>ox:Cube), bounding_box:Cube).

DEPOSIT(i.!! : id. substance_name:string. {dark. . i)rl,O.~IT(icl:id, substance_name:string, deposit_type:string. deposit_type:string, relativejntesity:E"Ut relative.Jntesity:EN'UM {dark, normal. normal,bright bright} },

amount:float, bounding_box:Cube), bounding_box:Cube). amount:float.


. . .

Associations in in O OM(CCDB) Associations M ( CCDB) co_localizes_wit h (DEPOSIT DEPOSIT,substance..name. .SUbsta nce_name, DEPOSlT,substance..name. DEPOSIT.Su bstance_name, STHt:C:'ITRE, STRt: CTURE.name), na me). coJocalizes_with( surrounds(sl :STHl'C'T :STRU(rTI'RE, S2:STRI:CTURE). surrounds(sl I ' H E . s2:STHI 'CTC H E ) ,
. . .

Functions in in OM(CCDB) OM(CCDB) Functions


~ SE SET(STRUCTURE.name) deposit.in_structure(DEPOSIT.id) --+ deposit..in.structure(DEPOSIT,id) T ( STHCC1T H E, name )
. . .

o [ brain brain ,,a_.~,',)cer'ebellu'mJ,,o~_.~(~;)-cerebellar " O) cerebellum h() cerebellar cortex)i~,_~_(_~o,"vermis" cortex vermis

J Source Source Ontology Ontology- ONT'(CCDBi ONT ( CC DB) dendnte dendrtte

hu

9 ha~(c,,) " O "' ( CO)

proj:",_!/' _t. () cell cell v,.oj~.,_t.,, brain_region brain_region

---4

--->

spine process spone process

..

ha.~(p,,.) h a l pfn)

---4 --->

(ONTI) (ONT1) (ONT2) (ONT2)

spine spone dena t uration ~-:~ process.. denaturation process


.,

~globus-pallid us ~c~-~ brain_region . . . . globus-pallidusbra;n_region.


. . . . . . . . . . . . . . .

tc.has ) := tc_has( ( co co) := transitive_closure(has transitive_closure( has(co) ( co).,


. . . . . .

h a s _ c o _ p r o := c_has (pro)) has_co_pm := chain(tc_has(co), chain( tc _has ( co) . ttc_has(pm

: = transitive_closure(has(pro)). transitive..closure(has(pm, ( (ONT4) ttc_has c _ h a s ( (pm) p m ) := ONT4)


. . . . . .

(ONT3) (ONT3)

(ONTS) (ONT5)

Ontological Ontological GroundingGrounding - ONTG(CCDB) ONTG(CCDB)


domain(sTRU(~URE.volume) domain(STRl'('Tl:RE.volume) in in [0,300] [0.3001 domain(sTRU(,'TURE, name) in domain(sTHL:C'Tl'RE,name) in tc.has(co)(cerebellum) tc_has (co)(cerebellum) domain(EX PERIIMENT.cell_type) domain(ExPEH IE"T ,ceIUype) in in tc.has(co)(cerebellum) tc-has(co) (cerebellum)
,_to ,,,' oj..:=!; EX PERi MENT.cell_type pvoj cot s_t o globus_pallidus EXPEIUtEXT,ceILtype globus-lla llidus
s DF..X ATURED_PROTEl,~'c3: ~ t t.S denaturation. denaturation. DE. \,ATCH ED_PHOT EJ :\ c:J;
. . .

(OGl) (OG1) (OG2) (OG2) (OG3 ) (OG3) (OG4) (OG4)

12.4 1 2. 4
. . . . . . . . . . i

Conceptual model model for for registering registering the the Cell-Centered Cell-Centered Database Database [14]. [14]. Conceptual

FIGURE FIGURE

1 2.3 12.3
. . .

Model-Based Model-Based Mediation Mediation


. . . . . . . . . . . . . . . . . . . . .

347
3 4 7

Note CCDB ) in Note that that OM OM ( (CCDB) in Figure Figure 12.4 12.4 includes includes classes classes that that are are instantiated instantiated ( CCDB ) . In with database EDB with observed observed data, data, that that is, is, the the extensional extensional database EDB(CCDB). In addition addition to classes, OM ( CCDB ) stores associations, which are n-ary relationships to classes, OM(CCDB) stores associations, which are n-ary relationships between between object ie zes_wi specifies which pairs of object classes. classes. The The association association co_local co_localiz s _ w i t h th specifies which pairs of substances substances occur occur together together in in a a specific specific structure. structure. The The object object model model also also contains contains functions, functions, such such as as the the domain domain specific specific methods methods that that can can be be invoked invoked by by a a user user as as part part of of a a query. query. For For example, example, when when the the mediator mediator or or another another client client calls calls the the function function CCDB tructure ( ) , and deposit object, CCDB.. depos d e p o s ii t_in_s t_in_str ucture (), and supplies supplies the the ID ID of of a a deposit object, the the function structure objects function returns returns a a set set of of structure objects that that spatially spatially overlap overlap with with the the specified specified deposit deposit object. object. CCDB ) is Next, Next, the the source's source's local local ontology, ontology, ONT ((CCDB) is described. described. Here, Here, an an ontology ontology ) consists consists of of a a set set of of concepts and ONT (( S ) and inter-concept inter-concept relationships,4 relationships, 4 possibly possibly aug augmented mented with with additional additional inference inference rules rules and and constraints.5 constraints, s The The ontological ontological ground groundS ) to ing ing ONTG ONTG (( S ) ) links links the the object object model model OM OM((S) to the the source source ontology ontology ONT (( S ) ).. The The source ontology ontology serves number of source serves a a number of different different purposes. purposes.
Creating erminological Frame Creating a aT Terminological Frame of of Reference Reference

For For defining defining the the terminology terminology of of a a specific specific scientific scientific information information source, source, the the source source declares vocabulary through declares its its own own controlled controlled vocabulary through ONT (( S ) ).. More More precisely, precisely, ONT (( S ) comprises comprises the the terms terms (i.e., (i.e., concepts) concepts) of of this this vocabulary vocabulary and and the the relationships relationships among among them. them. The The concepts concepts and and relationships relationships are are often often represented represented as as nodes nodes and and edges edges of of a a has directed graph, respectively. Two directed graph, respectively. Two examples examples of of inter-concept inter-concept relations relations are are h a s (( co co )) and pm) , which and has has ( (pro), which are are different different kinds kinds of of part-whole part-whole relationships. relationships. 6 6 In In Fig Figure 2.4, items ure 1 12.4, items ONTI ONT1 and and ONT2 ONT2 show show fragments fragments of of such such a a concept concept graph. graph. Once Once a additional con a concept concept graph graph is is created created for for a a source, source, one one may may use use it it to to define define additional constraints straints on on object object classes classes and and associations. associations.
Semantics of Relationships Semantics of Relationships

The The edges edges in in the the concept concept graph graph of of the the source source ontology ontology represent represent inter-concept inter-concept relationships. semantics, which relationships. Often Often these these relationships relationships have have their their own own semantics, which must must be be tc_has specified specified within within ONT ONT ( S S) ).. Item Item ONT4 O N T 4 declares declares two two new new relationships, relationships, t c _ h a s ( co co ) and c_has pm) . After this declaration and and t tc _ h a s ((pm). After registration, registration, the the mediator mediator interprets interprets this declaration and creates the transitive relations creates the new new (possibly (possibly materialized) materialized) transitive relations on on top top of of the the base base

4. Most Most formal approaches approaches (e.g., those those based on description logic) consider consider binary binary relationships only.
5 5.. For example, ONT4, ONT4, ONT5 ONT5 in Figure 12.4 define virtual relations such as transitive closure over the base relations. 6. By standards standards of meronyms, there are different kinds of the has has relation, including component componente o ) , portion-mass pm ) , member-collection has ) , stuff-object has so ) , and object has has ((co), portion-mass has has ((pm), has (me (mc), has ((so), p a ) [15]. place-area place-area has has ((pa)

348

===== == =.. = ...

12 12

A A Model-Based Model-Based Mediator Mediator System System for for Scientific Scientific Data Data

Management

relations ( co ) and ) provided S. Similarly, relations has has(co) and has has (pm (pm) provided by by the the source source S. Similarly, the the item item ONTS ONT5 is is interpreted interpreted by by the the mediator mediator using using a a higher-order higher-order rule rule for for chaining chaining binary binary relations: relations:
chain ( Rl , R2 ) ( X, Y) chain(Ri,R2) (X,Y) i f if Rl ( X , Z ) , R2 ( Z , Y) RI(X,Z), R2(Z,Y)

With creates a new relationship X, Y With this, this, ONTS ONT5 creates a new relationship has_co-pm has_co_pm ( (X, Y)) provided provided that that there there such that c_has co ) ( X , Z ) , and tc_has pm ) ((Z,Y). Z , Y) . is is a a Z z such that t tc _has ( (co) (x,z),and tc_has ( (pro)
Ontological of S) Ontological Grounding Grounding o f OM oM ( (S)

A A local local domain domain constraint constraint specifies specifies additional additional properties properties of of the the given given extensional extensional database database and and thereby thereby establishes establishes an an ontological grounding ONTG (( S ) ) between between the the model OM local ontology ontology ONT (( S ) and local and the the object object model ON (( S )) (see (see Figure Figure 12.3). 12.3). Items Items OG1 OG1in Figure 12.4 refine refine the the domains domains of ofthe EXPERIMENT.. c e l l_type _type Figure 12.4 the attributes attributes EXPERIMENT ell OG2 in and STRING ) . The and STRUCTURE STRUCTURE.. name name from from the the original original type type declaration declaration ((STRING). The refine refinement constrains them to to take take values from those those nodes nodes of of the the concept concept graph graph that that ment constrains them values from are cerebellum through are descendants descendants of of the the concept concept cerebellum through the the has h a s (( co co ) relationship. relationship. This This constraint constraint illustrates illustrates an an important important role role of of the the local local ontology ontology in in a a concep conceptually lifted source. source. By By constraining constraining the the domain domain of of an an attribute attribute to to be be concept concept name, name, C, c, the the corresponding corresponding object object instance instance 0 o is is semantically about c. c. In In addition, addition, this this also also implies implies that that 0 o is is about any any ancestor ancestor concept, concept, C c ' , , of of C c where where ancestor ancestor is is de deo ) edges fined fined via via has has (c co edges only. only. Similarly, Similarly, if if a a specific specific instance, instance, STRUCTURE STRUCTURE. . name, name, has has the the value value spine s p i n e process, p r o c e s s , it it is is also also about about dendrite dendrite (ONT2 (ONT2 in in Figure Figure 12.4). 12.4). In In addition addition to to linking linking attributes attributes to to concept concept names, names, a a constraint constraint may may also also in in, brain_ volve volve inter-concept inter-concept relationships. relationships. Assume Assume proj p r o j e ec c t sts_to _ t o ( c e(lcell l, brain_ constraint may region) r e g i o n ) is is a a relationship relationship in in the the source source ontology ontology ONT(CCDB). ONT(CCDB). A A constraint may assert (( e .c cee l_ assert that that for for all all instances instances e e of of class class EXPERIMENT, proj p r o j e ects_to cts_to e. ll l_ type '' )) holds OG3 ) . The thus refines t y p e , , ' globus-pal globus_pallil d idus us holds ( (OG3). The constraint constraint thus refines the the original to CCDB ) original relationship relationship proj p r o j e ec c t sts_to _to to suit suit the the specific specific semantics semantics of of OM OM( (CCDB). Such constraint-defined ( S ) and ( S ) are Such constraint-defined correspondences correspondences between between OM ON(S) and ONT ONT(S) are used used in 13]. in the the contextualization contextualization process process [ [13].
' .

Intensional Intensional Definitions Definitions

In source, S, virtual classes In the the CM CM wrapper wrapper of of a a source, S, one one can can define define virtual classes and and associations associations that that can can be be exported exported to to the the mediator mediator as as first-class, first-class, queriable queriable items items by by means means of of an an intensional intensional database database IDB (( S )).. For For example, example, one one can can create create a a new new virtual virtual class class called CCDB ) via called DENATURED_PROTEIN D E N A T U R E D _ P R O T E I N in in IDB IDB ((CCDB) via the the rule: rule"
DENATURED_PROTEIN ProtName ) i f DEPOSIT ID , ProtName DENATURED_PROTEIN ( (ProtName) if D E P O S I T ((ID, ProtName, , protein , depos i t_in_structure ( ID ) 1 protein, , dark dark , . ._, . . _)),deposit_in_structure(ID) # 00

1 2.3 12.3
=

Model-Based Model-Based Mediation Mediation


_ ~ ~ = _ _ _

==~_~.... = ~ = ~

-. . . . ~______~~. . . .

~o,o~__~~

..................................

349 349

Thus, n instance s created Thus, a an instance of of a a DENATURED_PROTEIN DENATURED_PROTEIN iis created when when a a dark protein protein deposit deposit is is recorded recorded in in an an instance instance of of DEPOSIT and and there there is is some some structure structure in in which which this this deposit deposit is is found. found. As As a a general general principle principle of of creating creating a a CM CM wrapper, wrapper, such a will be such a definition definition will be supplemented supplemented by by additional additional constraints constraints to to connect connect it it to For example, CCDB ) already contains a to the the local local ontology. ontology. For example, assume assume that that ONT ONT ((CCDB) already contains a concept called process. Item ONT3 defines denaturati on as a specialization concept called p r o c e s s . Item ONT3 defines d e n a t u r a t i o n as a specialization of of proce p r o c e ss ss . . The The constraint constraint OG4 completes completes the the semantic semantic specification specification about about the the new DENATURED_PROTEIN DENATURED_PROTEIN object. object. new
Contextual Contextual References References

It It is is a a common common practice practice for for scientific scientific data data sources sources to to tag tag object object instances instances with with attributes from from a a public public standard standard and to use use controlled controlled vocabularies vocabularies for for the the values attributes and to values of of some some of of these these attributes. attributes. For For example, example, the the source source can can specify specify that that the the domain domain id of of the the DEPOSIT DEPOSIT.. i d field field can can be be accessed accessed through through an an internal internal method, method, which, which, given id given a a protein protein name, name, gets gets its its i d from from a a specific specific database. database. For For example, example, one one can can use use get_expasy get_expasy_protein_id to retrieve retrieve this this information information from from the the Swiss-Prot Swiss-Prot -protein_id to database on database on the the Web. Web. How How the the source source enforces enforces this this integrity integrity constraint constraint is is internal internal to to the the source source and and not not part part of of its its conceptual conceptual export export schema. schema.

1 2 .3.3 12.3.3

IInterplay nte rplay Between nd S o u rces Between Mediato Mediatorr a and Sources
To To address address the the source source registration registration issue, issue, which which components components of of an an existing existing n-source n-source federation 1 st source accessed, by by the the new, new, n+ n+lSt source need need to to be be spec specfederation that that can can be be seen, seen, or or accessed, ified. A 1 ) currently registered conceptual ified. A federation federation at at the the mediator mediator consists consists of: of: ((1) currently registered models CM ( S ) of CH(S) of each each participating participating source source S, S, (2) (2) one one or or more more global ontologies the mediator that have have been been used the federation, 3) ONT ((M) M ) residing residing at at the mediator that used in in the federation, and and ((3) integrated views IVD ((H) M ) defined na defined iin a global-as-view global-as-view (GAV) (GAV) fashion. fashion. Typical M ) are Typical mediator mediator ontologies ontologies ONT ((H) are public, meaning meaning they they serve serve as as domain-specific domain-specific expert expert knowledge knowledge and and thus thus can can be be used used to to glue conceptual conceptual models models from from multiple multiple sources. sources. Examples Examples of of such such ontologies ontologies are are the the Unified Unified Medical Medical Lan Language guage System System (UMLS) (UMLS) from from the the National National Library Library of of Medicine? Medicine 7 and and the the Biological Biological Process Process Ontology Ontology from from the the Gene Gene Ontology Ontology Consortium.8 Consortium. 8 In In the the presence presence of of multi multi16]) ple (mappings between between different different source source ontologies ontologies [ [16]) ple ontologies, ontologies, articulations, (mappings

7 t http://www.nlm.nih.gov/research/umls/ 7.. The The Unified Unified Medical Medical Language Language System System (UMLS) (UMLS) available available a at http.//www.nlm.nih.gov/research/umls/ is, strictly is, strictly speaking, speaking, a a metathesaurus, metathesaurus, or or a a semi-formal semi-formal ontology ontology with with a a limited limited set set of of pre-defined pre-defined relationships relationships such such as as broader-terrnlnarrower-term. broader-term/narrower-term. 8 information about about the Process http://www.geneontology.org/process.ontology for for information the Biological Biological Process 8.. See See http://www.geneontology.org/process.ontology from Ontology Consortium. from the the Gene Gene Ontology Consortium.

350 350

1 2 12

A iator System anaQement A Model-Based Model-Based Med Mediator System for for Scientific Scientific Data Data M Manaqement

ewon

J
Spinyeuroo COSIriarum MediUD1-Spiny
OR has

Compartment

Soma

Axon

Dendrite euroIJ'aIISmilter

MyOendnte AND

ewon

/=

proj

I
proj

/ exp

GlobusYallid ptemaJ

12.5 1 2. 5 FIGURE F IGURE

A domain domain map map (DM) (DM) after after situating situating new new concepts concepts MyNeuron MyNeuron and and MyDendri MyDendrite A te (dark). (dark).

can register with can be be used used to to register with the the mediator mediator information information about about inter-source inter-source relation relationships. Note Note that that a S, usually usually cannot see all all of of the the previously previously discussed discussed ships. a source, source, s, cannot see the medimedi components (1-3) ( 1-3) when when defining model: Although components defining its its conceptual conceptual model: Although S s sees sees the ( M ) , and and thus thus can can define define its model, CM ), ator's ator's ontologies, ontologies, ONT (M), its own own conceptual conceptual model, CM ( S ), relative to to the the mediator's mediator's ontology ontology in in a local-as-view (LAV) (LAV) fashion, fashion, it it cannot cannot didi relative a local-as-view source's conceptual conceptual model, model, CM(S' CM ( S ), ) , nor nor can can it it query query the the rectly employ employ another source's rectly on mediator's integrated integrated view, IVD (M), ( M ) , which which is is defined defined global-as-view (GAV) (GAV) on mediator's view, IVm top S' ' can can register register CM CM ( S '' )), , top of of the the sources. sources. The The former former is is no no restriction restriction because because s in particular particular ONT ( S' S ' ), ) , with with the the mediator, mediator, at at which which point point s S can can indirectly indirectly refer refer to to in registered concepts concepts of of s S' ' via via ONT (M). ( M ) . The The latter latter guarantees guarantees that that query query processing processing registered in this this setting setting does does not not involve involve recursion recursion through through the the Web Web (i.e., (i.e., between between a a source source in 9 s S and and the the mediator mediator M). M). The The dependency dependency graph graph in in Figure Figure 12.3 12.3 is is acyclic. acyclic.9
'

Example 12.3.2 1 2.3.2 Example

Consider the the domain domain (Contextualization: Local-as-View). Consider map in Figure 12.5. Lighter-colored nodes correspond to concepts that the me map in Figure 12.5. Lighter-colored nodes correspond to concepts that the meS, wants wants to to regreg diator understands understands and and a a source source can can see. see. Now Now assume assume a a source, source, s, diator ister information information about about specific specific neurons neurons and and their their dendrites, dendrites, but but the the mediator mediator ister ontology (domain (domain map) map) does does not not have have dedicated dedicated names names for for those those specific specific kinds kinds ontology of neurons neurons and and dendrites. dendrites. In In MBM MBM this this problem problem is is solved solved by by contextualizing contextualizing of

9. At At the the cost cost of of loss loss of of efficiency, efficiency, the the restriction restriction no no recursion recursion through through the the Web Web could could be be lifted. lifted. 9.

12.4 1 2.4

Knowledge Representation for for Model-Based Model-Based Med Mediation iation

351 351

the new new local local source source concepts concepts as as views views on on the the mediator's mediator's global global concepts: concepts: In In the Figure 12.5, 12.5, the the darker-colored darker-colored source source concepts concepts are are hooked to to the the mediator's mediator's Figure domain domain map, map, thereby thereby defining defining their their meaning meaning relative relative to to the the mediator's mediator's concepts. concepts. This is achieved by sending the following first-order axioms (here in description logic syntax) to the mediator:
MyDendri te te == -- Dendri Dendrite MyDendri te
[-] 3exp 3exp.. n

Dopamine_R Dopamine_R

MyNeuron _C Mediuffi_Spiny_Neuron Medium_Spiny_Neuron MyNeuron

n obus-pal l i dus_external R 3proj 3proj.. Gl Globus_pallidus_external n e H Vhas Vhas.. MyDendri MyDendri t te

Thus instances instances of of MyDendri MyDendrite are exactly exactly those those dendrites dendrites that that express express Dopa DopaThus te are mine mine R(eceptor), R(eceptor), and and MyNeuron MyNeuron objects objects are are medium medium spiny spiny neurons neurons projecting projecting to Globus External and and only only have have MyDendri M y D e n d r i tes. t e s . Assuming Assuming properties properties to Globus Pallidus Pallidus External along the the transitive transitive closure closure of of i i ssa, a , it it follows follows that that MyNeuron, MyNeuron, like like are inherited along are any Mediuffi_SpinY_Neuron M e d i u m _ S p i n y _ N e u r o n projects projects to to certain certain structures structures (OR in Figure Figure 12.5). 12.5). (OR in any With the the newly newly registered registered knowledge, knowledge, it it follows follows that that MyNeuron MyNeuron definitely definitely projects projects With to to GlobusYallidus-External. Globus_Pallidus_External. To To specify specify that that it it only only projects projects to to the the latter, latter, a a non nonmonotonic (e.g., using monotonic inheritance inheritance (e.g., using F-logic F-logic with with well-founded well-founded semantics) semantics) can can be be employed.

Note that that the the intuitive in Figure Figure 12.5 12.5 Note intuitive graphical graphical contextualization contextualization depicted depicted in is not unique; logically equivalent domain domain maps may have have different is not unique; logically equivalent maps may different graphical graphical 1 o For For domain domain maps maps that that can be completely completely axiomatized using a representations. 1~ representations. can be axiomatized using a description logic, logic, a a reasoning system such such as as Fast Fast Classification Terminologies description reasoning system Classification of of Terminologies (FaCT) [ 1 7] can can be be employed employed to to compute and, in in particular, particular, (FACT) [17] compute the the deductive deductive closure closure and, to derive derive a and check of a map. to a unique unique concept concept hierarchy hierarchy and check consistency consistency of a domain domain map.

12.4 1 2. 4

K N OWLE DG E REPRESENTATION R E PRESE NTATION FOR KNOWLEDGE FOR M O D E L-BAS E D MEDIATION M E DIATI O N MODEL-BASED
This section section takes takes a a closer closer look look at at the the principal principal mechanisms mechanisms for for specifying specifying glue glue This knowledge: ontologies in the the form form of of domain domain maps maps (DMs) (DMs) and and process process maps maps knowledge: ontologies in (PMs).

1 0 . This i s similar similar t o the the fact fact that that the the same same query query can can have have many many different different syntactic syntactic representations. representations. In In 10.This is to general, equivalence of of first-order first-order (or (or SQL) queries queries is not not decidable. decidable. general, equivalence

352

352

1 2 12

A A Model-Based Model-Based Mediator Mediator

...."" TO '.,...

System for for Scientific Scientific Data Data Management

1 2.4. 1 12.4.1

Domain Maps Do main M a ps


As DMs name concepts by As is is standard standard for for ontologies, ontologies, DMs name and and specify specify relevant relevant concepts by describ describing the the characteristic characteristic relationships relationships among among them them [ [18]. In this this way, way, DMs DMs provide provide the the ing 1 8] . In basic needed to basic domain domain semantics semantics needed to glue glue data data across across different different sources sources in in multiple multipleworld scenarios. DMs world scenarios. DMs can can be be depicted depicted more more intuitively intuitively in in the the form form of of labeled, labeled, directed many other directed graphs. graphs. In In contrast contrast to to many other graph-based graph-based notations, notations, however, however, DMs DMs have have a a solid solid formal formal semantics semantics via via a a translation translation to to logic logic rules. rules. The The graph graph form form of of DMs DMs is is defined defined as as follows. follows.
Definition 1 Domain Definition 12. 12.1 Domain Maps Maps

Let and R Let C C be be a a set set of of symbols symbols called called concepts concepts and 7~ a a set set of of roles. roles. A A DM DM is is a a directed, directed, labeled graph C. A labeled graph with with nodes nodes C. A concept concept C C E ~ C C can can be be understood understood as as denoting denoting a a class common properties. class of of objects objects sharing sharing a a set set of of common properties. To To understand understand how how a a concept concept C is is defined defined relative relative to to other other concepts, concepts, one one needs needs to to inspect inspect its its outgoing outgoing edges. edges. C cE ~ C C denotes denotes that that c c is is an an instance instance of of concept concept c.u C. 11 Edges Edges are are distinguished distinguished in in DMs DMs c as follows: follows: as
1 (s hort: C is, c 1.. C C ~ D D (short: C ~ D) D) defines defines that that every every C C isa isa D, D, that that is, c E ~ C C implies implies
isa

c ED D. c~ . The The subconcept/subclass subconcept/subclass relation relation is is very very common common in in DMs, DMs, thus thus the the isa isa label label used instead. is is usually usually omitted omitted and and the the shorthand shorthand notation notation C C -* D D is is used instead.

2. 2. C C 5; ~ D D defines defines that that for for every every e cE ~ C, C, there there exists exists some some r-related r-related d d E ~ D. D.

Here, binary relation ( c, d) Here, r r E ~R 7~ is is a a role, role, or, or, a a binary relation r r(c, d) between between instances instances of of C C and and D. D.

3. C 3. C aJ!; at_~rD D defines defines that that for for every every c cE ~ C C and and all all x x that that are are r-related r-related to to c c (i.e., (i.e., for for which r r(c, holds), x x E ~ D D holds. holds. ( c, x) x) holds), which

4. ( c, d) 4. C C -~ D D defines defines that that if if c cE ~ C C and and d d E ~ D, D, then then they they are are r-related, r-related, that that is, is, r r(c, d) holds. holds.

D D1,..., Dn, respectively, respectively, defines defines an an anonymous anonymous concept, concept, the the intersection intersection of of 1 , . . . , Dn, concepts . .. .., , Dn. Dn. concepts D D1 1,, . 6. OR D . , Dn}, indicates that that an with n edges to 6. O R - - ~ i;{{D 1l, , . . ... , Dn}, indicates an OR-node OR-node with n outgoing outgoing edges to D union of con , . . . , Dn, respectively, defines an anonymous concept, the D1,..., Dn, respectively, defines an anonymous concept, the union of con1 cepts . .. .., , Dn. Dn. cepts D Dl 1,, .

5. D1 5. AND; A N D - , i {{D Dn} indicates indicates that that an an AND-node AND-node with with n n outgoing outgoing edges edges to to l ,, .. ... ., , Dn}

1 1 . Thus, C and 11.Thus, and D D can can be be viewed viewed as as unary unary predicates. predicates.

1 2.4 12.4

odel-Based Mediation nr ".,nt't, ,.,, n Knowledge Representatio n for for M Model-Based Mediation . . . . . . . . ~ ~ ~ ~ ~ ,
..

353

353

7. 7. C C ~ D D defines defines that that C C is is equivalent equivalent to to D, D, meaning meaning every every C isa D and and vice vice versa. versa.
m

D. It It could could have have been been denoted denoted also also as as C# C~D . However, However, the the directed directed edge edge keeps keeps the the distinction distinction between between C C (the (the definiendum) definiendum) and and its its definition definition D D (definiens). (definiens).

Note Note that that D D can can be be an an atomic atomic or or a a defined defined concept. concept. When When unique, unique, AND AND nodes nodes are are omitted omitted and and outgoing outgoing arcs arcs directly directly attached attached to to the the concept concept being being defined. defined. In In Figure Figure 12.5, 12.5, unlabeled, unlabeled, grey grey edges edges and and edges edges labeled labeled proj (projects-to) correspond correspond to to isa edges edges and and ex:proj ex:proj edges, edges, respectively. respectively.
Reified Reified Roles Roles as as Concepts Concepts

In defined, whereas In DMs, DMs, as as in in description description logics, logics, the the concepts concepts are are being being defined, whereas the the roles roles are are only only a a means means to to that that end. end. To To capture capture the the semantics semantics of of roles, roles, or or define define their their properties properties in in terms terms of of each each other, other, they they need need to to be be defined defined in in terms terms of of concepts concepts themselves. In logic, this themselves. In logic, this "quoting "quoting mechanism" mechanism" is is known known as as rei{ication. reification. M involving Example 2.4. 1 (Roles as Concepts). Consider Example 1 12.4.1 Consider a aD DM involving the the roles roles reg regulates, activates, and in the the given and inhibits, and and assume assume that that in given domain, domain, activates ((C, C, D) D) and and inhibits ((C, C, D) are C, D) are special special cases cases of of regulates ((C, D).. Instead Instead of of in in2 and troducing troducing a a special special notation notation for for sub-roles 1 12 and then then defining defining the the mechanics mechanics of of how how roles roles can can be be related related to to one one another, another, roles roles are are turned turned into into first-class first-class citizens citizens by making them make-concept (mc). ling by making them concepts concepts using using an an operator, operator, make-concept (mc). The The mode modeling capabilities applied to simply state capabilities of of DMs DMs can can be be applied to roles roles and, and, for for example, example, simply state that that . . . isa . mc( mc(activates) regulates). mc( actwates )-+ regulates ). By By modeling modeling roles roles as as concepts, concepts, more more domain domain semantics semantics can can be be formalized, formalized, leading leading to to better better knowledge knowledge engineering. In In particular, particular, during during query processing, such formalized such formalized knowledge knowledge can can be be automatically automatically employed employed by by the the system: system: Given Given a a DM DM (formalized (formalized as as logic logic rules), rules), an an MBM MBM query query or or view view definition definition involving involving activates and and regulates knows knows that that the the former former is is a a subconcept subconcept of of the the latter. latter. If If during Prote during query query processing processing a a goal goal regulates r e g u l a t e s ( ' ( cAMP c~P', P r o t e i ni ) n ) is is evaluated, evaluated, the allow the the logic logic rules rules corresponding corresponding to to the the DM DM knowledge knowledge allow the system system to to deduce deduce cAMP Protein that that any any result result for for activates activates ( ( 'cAMP', P r o t e i n ) ) is is also also an an answer answer for for reg regulates is correct u l a t e s (( ' cAMP cAMP', Protein P r o t e i n ) .) . This This is correct because because a a substitutability substitutability principle holds, which allows replace a holds, which allows the the system system to to replace a concept, concept, D, D, with with any any of of its its subconcepts, subconcepts, C, C, that that is, is, for for which which C C~ D D holds. holds.
I I , I I , I I ,

see 12. RDF(S) RDF(S) has has such such a a notion notion called called subproperty; subproperty; see http://www.w3.orgrrR/rdf-schema/. http'//www.w3.org/TR/rdf-schema/.

354

354

12 12

A A Model-Based Model-Based Mediator Mediator :-'V'T"'rn System for for Scientific Scientific Data Data Management

Generating y Generating the the Role Role Hierarch Hierarchy

3 on When making a role into a a concept, concept, the the isa isa hierarchy hierarchy 1 13 on concepts induces an an isa isa When making a role into concepts induces hierarchy on on roles. roles. hierarchy
Domain Domain Maps Maps as as Logic Logic Rules Rules

Domain maps borrow borrow from from description description logics logics [ [19] the notions notions of of concept concept and and Domain maps 1 9] the roles. roles. Indeed, Indeed, while while some some of of the the previously previously mentioned mentioned constructs constructs of of DMs DMs have have equivalent equivalent formalizations formalizations in in description description logic logic [20], [20], the the fact fact that that additional additional mech mechanisms roles as as concepts concepts and and recursive recursive and and parameterized parameterized roles roles anisms are are needed needed such such as as roles and and concepts, concepts, and and the the fact fact that that executable executable DMs DMs are are wanted wanted during during query query process processing, require require a a translation translation into into a a more more general general logic logic framework. framework. ing, In F-logic [21]. In the the following, following, DMs DMs are are formalized formalized in in a a minimal minimal subset subset of of F-logic [21]. The The semantics of Ms could could be particular in semantics of D DMs be formalized formalized in in other other languages, languages, in in particular in other other de deductive F-logic is ductive database database languages. languages. The The use use of of F-logic is convenient convenient because because a a small small subset subset of of it it already already matches matches nicely nicely the the minimal minimal requirements requirements established established for for a a MBM MBM sys system [20]. [20]. Moreover, Moreover, implementations implementations of of F-logic F-logic are are readily readily available available [22, [22, 23] 23] and and tem have 25, 26]. have been been used used by by the the authors authors in in different different mediator mediator prototypes prototypes before before [24, [24, 25, 26]. In In F-logic, F-logic, c c :: C C and and C C :: :: D D denote denote class class membership membership (c (c E 9 C) C) and and subclassing subclassing (C (C _ D), D), respectively. respectively. Thus, Thus, there there are are logic logic rules rules of of the the form form head head if if body body that that express : " and : : " . Say express the the F-logic F-logic semantics semantics of of " ":" and " "::". Say that that "::" "::" is is a a reflexive, reflexive, transitive, transitive, and antisymmetric antisymmetric 14 14 relation. relation. and
Definition Definition 12.2 12.2 Compilation Compilation of of Domain Domain Maps Maps
1. \II ( C) := {C :: concept}, all atomic 1. ~(C) : - {C concept}, for for all atomic concepts concepts C CE e C C

The DM -+ maps to defined as The mapping mapping \11 qJ :: DM ~ FL FL of of domain domain maps to F-logic F-logic is is defined as follows: follows:
r ) := {r :: role}, all roles roles r 2. \II 9( (r) : - {r role}, for for all r E9n
3 D) := := {C-{ C :: <t> D} U ( D) C -+ 3.. \ ~ ~ D) (1)D} u \II ~(D) II((C
isa

4. C '=: 4. \II( ~(C ~-~ D) D ) ::= =

(a) ( c, _d), _d :: <t> D if (a) {r {r(c,_d),_d @D if c c :: C, C,_ d = = skoID( skolD(c)} U \II ~ ((D ) _d c)} U D)
( D) D) ) } U (b) False if (b) { {False if c c :: C, C,-~(r(c,_d),_d (I)D))} U \II ~(D) -. ( r ( c, _d), _d :: <t>

5. ~ (( c aaJ!; k r D) D) := "5. \II C


(a) D if c r(( c c,, d d)} U~ \I(I D ( D) (a) {d {d:: <t> (I)Dif c :: C, C,r )}U )
d :: <t> ( D) } U (b) { {False if c c :: C, C, r r(c, d), -. --,d (I)(D)} U\ ~(D) False if ( c, d), (b) II ( D)
1 3 . Strictly speaking, the isa does have to be a hierarchy does not have hierarchy but can be any directed directed acyclic acyclicgraph. graph. 13. Strictly speaking, 14. Because classes, this terminological cycles. Becauseconcepts concepts are implemented implemented as as F-logic F-logic classes, this avoids avoids terminological cycles.

for 1 2.4 R 12.4 e p rK e n osw e led n gte a t i nr< . _ .,,,nt<.tir1,n . _ . . on f o r Model-Based M o d e l - B a s e d Mediation M e d i a t i o n .......... ~ . . . . . . . . . . . .

9 . . . . . . . . . . . . . . . . .

355

355

c : C, d : <I> 6. IJI ~ ((C ~ D) D ) ::= = {r {r(c,d) i f c" C,d" r ( D) } U IJI ( D) 6. C- ( c, d) if


7. ( AND ; { Db .. .. .., , Dn } ) := 7. IJI qJ(AND--~i{Da, 19,}) := ( Dn ) Dn ) } U ( D1 ) U {d D1 ) , .. .. .. ,,dd :<1>( {d": skolAND skOIAND if if d d": <1>( cI)(D1), :~(Dn)} U IJI qJ(D1) L3.... LJ IJI ~(Dn) ..U 8. IJI ( OR ; { Dt , .. .. .., , Dn} ) ::= = 8. q~(OR--~i{D1, Dn}) ..v Dn ) } U ( Dt l u ( Dn ) {d {d": skoloR if if d d": <I> r ( Dl ) v v . .. . v d d :<I>( :(I)(Dn)} U IJI ~(D1) U . .. U LJ IJI ~(Dn)
m

( D) 9.. ~ -~ D D)) := := {C {C :: .. <I>( r D) , <I> r ( D) :: .. C C if if <I>( r D) } U u IJI ~(D) 9 IJI((C C

Remarks Remarks

Here, is defined to IJI ( D) , but for a Here, <1>( r D) is defined similar similar to ~(D), but it it returns returns for a compound compound concept concept auxiliary symbol <1>( D) representing compound. For description description D, D, a a new new auxiliary symbol r representing the the compound. For atomic atomic D, D, <I> r ( D) = IJI ~(D) holds simply. simply. The The symbols symbols skolx skolx produce produce new new Skolem Skolem ( D) holds IJI: For function function symbols symbols every every time time they they are are used used in in the the translation translation ~: For example, example, in in 4(a), 4(a), a a symbolic symbolic representation representation is is invented invented for for the the existentially existentially quantified quantified variable variable S _d. _d. Note Note that that c, c, d, d, _d _d are are logic logic variables, while while C, C, D, D, D; Di,, and and False False are are constants. I 15 The The different different variants variants (a) (a) and and (b) (b) in in the the translations translations of of DMs DMs correspond correspond to to different different intended uses: intended uses: in in 4(a), 4(a), an an anonymous anonymous object object is is created created for for the the 3-quantified 3-quantified variable, variable, in (a), all all Cr in 5 5(a), C.r objects objects are are type type coerced coerced into into instances instances of of D. D. In In contrast, contrast, the the (b) (b) translations translations only only check check whether whether the the constraints constraints induced induced by by the the DM DM edges edges are are indeed satisfied satisfied and signal an indeed and signal an inconsistency inconsistency (False) (False) if if they they are are not. not.
Example Example 12.4.2 12.4.2 Roles as Concepts Continued. Continued. Consider Consider a a DM DM stating stating that that isa . I ates some isa Gene. 16 T d e he NProt some Gene, fos NProt - . Proteln, P r o t e i n , NProt N P r o t regu regulates Gene, an and cf os ~2g Gene.16 The role regulates is asserting mc(regulates). role regulates is conceptualized conceptualized by by asserting m c ( r e g u l a t e s ) . When When making making its its tes hidden arguments visible, ((c, C, D ) )) really hidden arguments visible, me (( regula regulate s D) really denotes denotes a a family of of regulates isa hierarchy hierarchy on on regulates r e g u l a t e s concepts concepts is is derived derived from from r e g u l a t e s concepts. concepts. The The isa the hierarchy of the isa isa hierarchy of its its arguments. arguments. For For example: example:
isa

me NProt , e fos ) ) regulates ((NProt, NProt, Gene mc ( (regulates regulates ( (NProt, cfos) ~ me mc ( (regulates Gene)) ) ) regulates ( Protein, Gene -+ me mc ( (regulates (Protein, Gene)) ) )
i s a

Deriving the Role Role Hierarchy Deriving the Hierarchy

Previously Previously the the unary unary operator, operator, mc, me, which which turns turns role role literals literals into into concepts concepts was was oncept introduced. subclass of introduced. It It is is implemented implemented in in FL FL as as a a subclass of the the (meta-)class (meta-)class c o n c e p t by by asserting concept asserting me me-: :. c o n c e p t and and adding adding further further rules rules for for deriving deriving the the role role hierarchy hierarchy

1 5 . This 15. This is reversed reversed from from the usual usual convention convention used used in the rest rest of the chapter chapter to match match this this DM notation. 1 6 . Here, 16. Here, NProt stands stands for nuclear nuclearprotein. protein.

356

1 2 12

A iator System A Model-Based Model-Based Med Mediator System for for Scientific Scientific Data Data

Management

from the the concept concept hierarchy, hierarchy, which which are are given given as as set set of of mc-declarations me-declarations such such as as from r(C, me by r(C, D) D) :"me by the the user: user:

r (C ' , D ' ) r(C,D) r(C',D') r(C,D) r ( C , D ) ::mc, me , r ( C ' , D ' ) ::mc, me , r ( C , D ) :: ::r(C',D')
r(C,D) r(C' r(C,D) D' ) r ( C , D ) ::mc, me , r D ' ) ::mc, me , r ( C , D ) :: ::r(C' r ( C ' ,,D') ( C ' ,,D')

mcV (C ', ') D: r ( C , D ) ::me Vrr (C ' ,DD ' ) ::mc) me ) ,,C: C : ::C', C" D : ::D' D' if ((r(C,D)

((up/down) up/down) ((mixed) mixed)

: ::D' D' me ) , C : ::C', C" D if mcV (C ', DD )) : :mc), C: D: Vrr (C i f ((r(C,D') r ( C , D ' ) ::me ' ,

Observe Observe that that with with these these rules, rules, the the desired desired result result is is obtained obtained in in Example Example 12.4.2. 12.4.2.
Recursive Recursive Concepts Concepts

Consider relationship has_a has_a and interaction with sa. Consider the the part part of ofrelationship and its its interaction with i is a . For For example, example, MyNeuron M y N e u r o n isa isa Medium_Spiny_Neuron, Medium_Spiny_Neuron, which which in in turn turn has_a has_a Neostriatum Neostriatum therefore has_a Neos triatum holds (see Figure therefore MyNeuron MyNeuron has_a Neostriatum holds (see Figure 12.5). 12.5). In In the the general general isa D an 'f C --+ ' ru if E.~ and ha_~a E t then ha~a E.. case, t this rise to to a rule h'IS gives " Ie 1 case, gives rIse a recursive recursive a E a E d D has h en C has Similarly, sa has_a transitive and Similarly, one one can can define define that that i is a and and h a s _ a are are independently independently transitive and that that is is anti-symmetric. anti-symmetric. For For such definitions, an an intuitive intuitive graph graph notation notation i saa is such recursive recursive definitions, can can be be devised devised (e.g., (e.g., using using a a dashed dashed edge edge for for the the concept concept being being defined defined to to its its recursive al. [27] ). In recursive definition, definition, see see Ludascher Lud~ischer et et al. [27]). In a a declarative, declarative, rule-based rule-based query query language F-Iogic, an language like like F-logic, an executable executable specification specification is: is:
has_a(X,Z) if X--Y, has_a(Y,Z).

also be Note Note that that x x,, Y Y,, Z z are are concept concept variables. variables. Such Such F-Iogic F-logic rules rules can can also be used used at at the the mediator to handle inductive Figure 1 2.4, in mediator to handle inductive definitions, definitions, such such as as ONT4 0NT4 in in Figure 12.4, in particular, particular, when the source not have when the source does does not have the the capability capability to to evaluate evaluate recursive recursive definitions. definitions.
Parameterized and Concepts Parameterized Roles Roles and Concepts

Part of such as has_a flavors, F of relationships relationships such as h a s _ a come come in in different different flavors, F (e.g., (e.g., F F E e { , ,} ) and and transitivity { memberlcollection, member~collection, portion/mass, portion~mass, phase/activity" phase~activity,...}) transitivity does does not not 7 This necessarily carry naturally modeled necessarily carry over over across across flavors flavors [15]. [15]. 1 17 This is is most most naturally modeled by by a a pa parameterized role, has_a ( F ) , which is transitive within each flavor, F , but which rameterized has_& (V), which is transitive within each flavor, F, but which domain maps may flavors. Definition Definition 12.2 12.2 shows shows how how domain maps may interact interact in in other other ways ways across flavors. can can be be formalized formalized as as logic logic rules rules via via a a mapping mapping "'. qJ. This This mapping mapping can can be be extended extended for parameterized roles roles and example, assume assume the for parameterized and concepts: concepts: For For example, the parameterized parameterized F ) should should hold hold between role role has_a has_a ( (F) between concepts concepts C C and and D D only only for for some some flavors, flavors, F F,,
1 7 , For 17. For example, example, orchestra orchestra has_a has_a musician musician and and musician musician has-a has_a arm, arm, but but not not orchestra orchestra has_a has_a arm, arm.

1 2.4 Knowledge Knowledge Representation Representation for Model-Based Mediation Mediation 12.4 for Model-Based
, ~ , . -. ,,,~a~,~ .... ~,~* ~* _~ . = .~ ~ .. ~ . . . . . . . . . . . . . . . . . . . . . . . ~ , ~

357

357

satisfying a a condition condition ~0(F). cp( F ) . One One can can extend extend W and compile compile such such a a parameterized parameterized satisfying 9 and DM edge into into F-logic F-Iogic as as follows: follows: D M edge
qJ(C W (C
has a ( F ) [[~p 1I'i' (F) (F) has_a(F) ---+

>

D ) -={ { has_a () Fc),c if c. c: C C,,dd : <l> )( ,F cp D) has_a(F d , d if -~ ( D( )D ,~ )( }F )

}U Uq W J(( DD ))

Note that that a a parameterized parameterized role role such such as as h has_a ( F ) has has a a first-order first-order semantics semantics in in Note a s _ a (F) F-Iogic despite its higher-order higher-order syntax [28]. F-logic despite its syntax [28].

1 2 . 4.2 12.4.2

Process Maps M a ps Process


PMs provide provide abstractions abstractions of of process process knowledge, knowledge, that that is, is, temporal temporal and/or and/or causal causal PMs relationships between between events events that that can can be be used used for situating and and linking data across across relationships for situating linking data different sources. sources. Like DMs, PMs PMs are are directed, labeled graphs, different Like DMs, directed, labeled graphs, albeit albeit with with a a very very different semantics: Nodes are used to to model model states different semantics: Nodes are used states and and edges edges correspond correspond to to state state transitions, which are are labeled labeled with with a name describing describing the the transition. In transitions, which a process process name transition. In this way, data providers (e.g., bench scientists) can not only hook their raw data this way, data providers (e.g., bench scientists) can not only hook their raw data to the (given or or refined) refined) DMs DMs but but also in their their experimental to the (given also to to processes processes witnessed witnessed in experimental studies data bases (see (see Figure studies databases Figure 12.2 12.2 and and Figure Figure 12.8). 12.8).
Initial Process Semantics Initial Semantics PMo

Intuitively, an an edge the form form e. err =s s {'i' s' of of a a PM means that that the process Jr rr Intuitively, edge of of the >}s PM means the process ' ; cp precondition that must hold hold in leads from from state state s s to to s s'; ~0 is is a a necessary necessary precondition that must in s s for for rr Jr to to leads holds in happen, and postcondition, which which holds in s' s' as as a a result result of of rr Jr.. P P Mo M0 denotes denotes happen, and 1/1 ~ is is a a postcondition, the all initial semantics. the set set of of all initial process process semantics. The called a process occurrence occurrence of of rr zr in in PM. PM. Thus, Thus, a a The edge edge err e,~ of of a a PM PM is is called a process process occurrence specifies where process occurs, process occurrence specifies where in in a a PM PM a a process occurs, and and which which prepre- and and 1/1, this postconditions, postconditions, cp ~0 and and ~, this occurrence occurrence satisfies. satisfies. In In addition addition to to the the semantics semantics in PM, implied by implied by the the occurrence occurrence of of err e, in PM, a a process process rr Jr can can have have an an initial initial semantics semantics associated process name, name, rr associated with with the the process zr.. To parameterization of of processes, processes, edge edge labels labels where where process process names names To allow allow for for parameterization ( ll , . . . , T n ) where is a are are first-order first-order atoms atoms (of (of the the form form rr Jr = - rr Jr(T1,..., Tn) where each each term term T; T/is a logic logic variable variable or or constant) constant) are are considered. considered. For For example, example, consider consider rr zr = = opens o p e n s (Channel) (Channel) as as describing describing the the opening opening process process of of an an ion ion channel. channel. Its Its initial initial semantics semantics are are defined defined by the expression: by the expression:
{ open ((Channel {-,open Channel) ) } } opens opens ( (Channel Channel) ) { {open open ( (Channel Channel) ) } }

{~}~{~} ,

meaning that meaning that any any transition transition along along a a process process occurrence occurrence of of rr Jr = = opens o p e n s ((Channel C h a n n e l )) in must be open in a a PM PM must be from from a a state state where where o p e n ((Channel C h a n n e l )) was was false. false. In In the the successor successor state, state, however, however, (after (after rr zr has has happened), happened), open o p e n ((Channel C h a n n e l ) ) is is true. true.

358 358

1 2 12

A A Model-Based Model-Based Mediator Mediator System System for for Scientific Scientific Data Data
~

Management ~

From From Process Process Maps Maps to to Domain Domain Maps Maps

The The first-order first-order predicates predicates occurring occurring in in ({! 99 and and 1jf ~ are are called called open(Channei), open(Channel), fluents, because their truth because their truth is is state state dependent. dependent. It It is is required required that that the the set set of of fluent predicate symbols, symbols, F, ~, is is disjointed disjointed from from the the set, set, P, 79, of of process names and and the the sets sets of of concept concept and n, used and names C and role role names C and 7E, respectively. respectively. In In contrast, contrast, the the constant constant parameters parameters used l are allowed allowed to in in process process occurrences, occurrences, such such as as Channe Channel to be be concepts concepts from from C. C. isa For example, a DA_recep tor For example, a DM DM may may have have that that NM NMDA_receptor - Calcium_channel Calcium_channel channel ---~channel in in which which case case the the process process knowledge knowledge about about the the opening opening of of channels channels and and the the static static knowledge knowledge from from a a DM DM are are directly directly linked linked through through the the common common concept e l. concept Chann Channel. Similarly, Similarly, just just as as roles roles are are first-class first-class citizens citizens by by reifying reifying them them into into concepts, concepts, the the same additional semantics same can can be be done done for for processes, processes, by by specifying specifying additional semantics of of processes processes using domain domain maps. using maps.

Example 1 12.4.3 (Processesas Concepts). Consider Consider the the binds_to binds_to ((X, Y) process Example 2.4.3 (Processes X,Y ) process with with the the initial initial semantics. semantics.
{ bound ( X , Y ) } binds_to (X, Y) { bound ( X , Y ) } {~bound(X,Y)} binds_to(X,Y) {bound(X,Y) }

Now DM in which processes reified as Now consider consider a a DM in which processes were were reified as concepts concepts as as follows: follows:
. . dimerizes dlmerl z e s ((X) X) isallx=y
--+

. isallx>=Y binds_to ((X, Y) blnds_to X, Y )

It this (parameterized) M edge, It is is easy easy to to see see that that this (parameterized) D DM edge, when when translated translated into into F-Iogic, F-logic, allows (DM U allows the the system system to to conclude conclude in in the the combined combined knowledge knowledge base base (DM U PM PM0) that o ) that
{ bound ( X , X ) } dimerizes ( X ) { bound ( X , X ) } . {~bound(X,X) }dimerizes(X) {bound(X,X)}.
Process P r o c e s s Elaboration Elaboration and and Abstraction Abstraction

The , of process occurrence real process. The edge, edge, en e~, of a a process occurrence can can be be seen seen as as an an abstraction abstraction of of a a real process. In In addition addition to to its its initial initial semantics, semantics, PM PM0, and the the semantics semantics induced induced by by its its concrete concrete o , and occurrence occurrence in in a a specific specific PM, PM, this this abstraction abstraction can can be be elaborated elaborated by by replacing replacing the the en e~ ' with whose initial initial and . The with a a (sub-)process (sub-)process map map elab e l a b ( en e~ )),, whose and final final states states are are s s and and 5 s'. The annotated with newly nodes and newly created created nodes and edges edges of of the the elaboration, elaboration, elab e lab ( ( en e~ ) ),, are are annotated with includes at least a the unique elaboration identifier eID. The the same same unique The eID includes at least a reference reference , indicating being elaborated, to to en e~, indicating the the edge edge being elaborated, and and the the author author (data (data provider) provider) of of the the elaboration. elaboration. The The converse converse of of elaboration, elaboration, abstraction, takes takes a a connected connected subgraph, subgraph, TI H(( S, S, s so, o, S ) , with nodes S, S, edges E, and s f, f, E E), with nodes edges E, and distinguished distinguished nodes nodes s so, sf f E e S S (initial (initial and and final final o, 5

1 2.4 12.4 R e p r eKnowledge sentation

Representation for odel-Based Med iatio n for M Model-Based Mediatio

. . . . . . . . . . . . . . . . . . .~

359

359

state), n ((S, S, s state), and and abstracts abstracts n FI into into a a single single edge edge err e= = = abs a b s t trac ract t ( (H sO o , ,s sf E)) ) ).. f,, E The The abstracted abstracted edges edges E E of of n H are are marked marked with with a a unique unique abstraction abstraction identifier identifier aID, aiD, which , and which includes includes a a reference reference to to the the new new abstraction abstraction edge, edge, err e=, and the the author author of of the the abstraction. abstraction.

Definition 12.3 12.3 Process Process Maps Maps Definition


A ( S, so, with nodes, A PM PM n H(S, so, s s f, f, E E)) is is a a connected, connected, directed directed graph graph with nodes, S, labeled labeled edges, edges, E, and E, and initial initial and and final final states states so, so, S sf E e S. The The edges edges err e,~ of of E E are are of of the the form form
s s
{~0}~r{q/} (g>"'} s -+ s''

( err ) (e=)

where where the the process process name name 1{ 7c is is a a first-order first-order atom atom and and cp 9) and and 1/1 ~p are are first-order first-order formulas, formulas, called the called the precondition precondition and and postcondition postcondition of of err e.,, respectively. respectively. Given process map ( S, so, so, S ) , the Given an an edge edge e e = Sa s~ ...:; ~ Sb sb of of a a process map n H(S, sf f,, E E), the elaboration, elaboration, elab(e), S', Sa , Sb, E') 1 ) the initial and elab(e), of of e e is is a a process process map map n'( Fl'(S',sa, E') such such that that ((1) the initial and final final , Sb, (2) , Sb}, and states states are are Sa $a,$b, (2) S' n N S = {Sa {$a,$b}, and (3) (3) all all e' e' E e E' E' are are linked linked to to e e via via a a common, common, unique unique identifier identifier eid(e', eid(e', e). e). A connected connected subgraph subgraph of of a a PM PM with with distinguished distinguished initial initial and and final final state state is is called called A a ( S, So, ) , the a subprocess subprocess map map (sub-PM). (sub-PM). Given Given a a PM PM n FI(S, so, S s f, E E), the abstraction abstraction of of a a sub subS', Sa , Sb, E') E') of n, denoted n' ) , is PM H'(S',sa, of H, denoted abstract abstract ( (FI'), is a a new new edge edge err' e,~, = Sa Sa ...:; - ~ Sb, sb, PM n'( where (i) err' (ii) all where (i) e~, fj. ~ E, E, and and (ii) all e' e' E e E' E' are are linked linked to to err' e,, via via a a common, common, unique unique ' , err' d( e identifier identifier ai aid( e', e~, )).. Marking Marking edges edges with with elaboration elaboration and and abstraction abstraction identifiers identifiers guarantees guarantees one oneto-one to-one mappings mappings between between an an edge edge and and its its elaboration elaboration and and similarly, similarly, between between a a sub-PM this way, sub-PM and and its its abstraction. abstraction. In In this way, data data providers providers can can "double-click" "double-click" on on an an , and and elaborate , to provide more edge, err edge, e,~, elaborate the the processes processes into into a a PM, PM, n FI, to provide more precise precise links links to n, into , if to their their data. data. Conversely, Conversely, they they may may collapse collapse a a sub-PM, sub-PM, FI, into a a single single edge, edge, err e,~, if the the data data does does not not provide provide information information at at the the detailed detailed level level of of n FI and and hence hence is is more more adequately adequately hooked hooked to to the the overall overall process, process, err e~..
=

Process Process Maps Maps as as Logic Logic Rules Rules

Similarly to (PM). The Similarly to DMs, DMs, one one can can translate translate PMs PMs into into a a logic logic representation representation W ~(PM). The difference is that DMs, the formalization in description logic F-Iogic yields yields a difference is that for for DMs, the formalization in description logic or or F-logic a first-order logic semantics, whose unique unique minimal interprets con first-order logic semantics, whose minimal model, model, M AA(DM), (DM), interprets concepts individuals. The cepts and and roles roles as as unary unary and and binary binary predicates predicates over over a a set set of of individuals. The model, model, M, implies A/f, implies that that data data objects, objects, which which are are linked linked as as concept concept instances instances to to a a DM, DM, have have the map (e.g., (e.g., the the properties properties defined defined by by the the domain domain map the neurons neurons in in the the images images linked linked to to MyNeuron in Example 1 2.3.2 project MyNeuron in Example 12.3.2 project to to Globus_Pallidus_External). Globus_Pallidus_External). In In contrast, contrast, the logic PM specifies the logic representation representation of of a a PM specifies only only some some process process properties properties via via prepre- and and postconditions postconditions in in the the PM PM and and the the PM's PM's graph graph stlmcture. structure. The The details details of of the the semantics semantics

360 ~ ~ ~ ~ ~ = ~ ~ = = ~ ~ ~ ~ ~ ~ ~ = ~ ~ = = ~ 360

1 2 12

A iator A Model-Based Model-Based Med Mediator

,",V''' '' 'Tl System

for Scientific Scientific Data Data Ma Management for nagement

are are omitted omitted due due to to lack lack of of space. space. The The basic basic idea idea is is that that the the graph graph structure structure of of PMs PMs (with (with its its embedded embedded hierarchy hierarchy of of elaborations elaborations and and abstractions) abstractions) is is formalized formalized via via a nested nested Kripke structure in in which which the the nodes nodes of of PM PM (states) (states) have have associated associated first firsta Kripke structure order process edges order models models and and in in which which labeled labeled process edges specify specify a a temporal accessibility between states. states. 1 TM In particular, particular, a a process process elaboration elaboration of of an an edge, edge, el( e,~, adds relation between S In , adds to to the the initial initial semantics, semantics, PM PM0, and the the semantics semantics of of the the prepre- and and postconditions postconditions of of o , and the the concrete concrete occurrence occurrence of of el( e~ in in PM, PM, an an elaboration semantics (i.e., (i.e., a a sequence sequence of of intermediate intermediate states states with with first-order first-order constraints constraints along along the the paths paths of of the the elaboration). elaboration).

~ ~,~ ". _ \ ._ ..... ~, _ _ t ~ _

12.5 1 2.5

M O D E L- BAS E D M E DIATO R SYSTE M AN D MODEL-BASED MEDIATOR SYSTEM AND TOOLS TOO LS

At system. Other Other impor At the the core core of of the the MBM MBM framework framework is is the the KIND KIND mediator mediator system. important Markup and Atlas for tant components components are are the the Spatial Spatial Markup and Rendering Rendering Tool Tool (SMART) (SMART) Atlas for annotating, relating data brain atlases, annotating, displaying, displaying, and and relating data with with brain atlases, the the CCDB, CCDB, defined defined in in Example Example 12.3.1 12.3.1 as as the the primary primary source source of of experimental experimental data, data, and and the the Knowl Knowledge edge Map Map Explorer Explorer (Know-ME) (Know-ME) tool tool for for concept-based concept-based navigation navigation of of source source and and mediated mediated views. views. For For a a description description of of Know-ME, Know-ME, see see Qian Qian et et al. al. [29] [29];; the the other other components are are described in the the following following text. text. components described in

12.5.1 1 2. 5 . 1

The KIND Th e KI N D Mediator Mediato r Prototype Prototype


The architecture architecture of the KIND mediator system depicted on top in Figure 12.6. The of the KIND mediator system is is depicted on top in Figure 12.6. At the bottom, bottom, a snapshot of the prototype prototype execution shown: After the user user At the a snapshot of the execution is is shown: After the issues a against the integrated view, the results results on issues a query query against the integrated view, the the system system situates situates the on a a domain map, in this case case ANATOM ANATOM (simple (simple ontology ontology of brain anatomy). anatomy). By By clicking domain map, in this of brain clicking on the orange orange diamonds, diamonds, the the user user can can retrieve the actual result objects, objects, grouped grouped on the retrieve the actual result by by concept concept (foreground). (foreground). In the first first prototype prototype [9, [9, 30] 30] the the F-logic F-Iogic implementation implementation FLORA FLORA [31] [3 1 ] was was used used In the as the only query query processing and deduction engine. As part of large, collaborative as the only processing and deduction engine. As part of a a large, collaborative [4] the the prototype prototype is is being being re-implemented re-implemented as as a a modular, modular, distributed distributed medimedi project project [4] ator system that includes includes several several additional additional components, components, including including the the following: following: ator system that
9

Logic plan generator: Given a a user user query, and an an integrated integrated view view definition definition Logic generator: Given query, Q, Q, and IVD, Q Qo o IVD IVD is is translated translated into into a a plan plan generator generator program program PG( Q Qo o IVD) IVD) that, that, IVD, Qo 0 IVD. IVD. Here Here "o" "0" when executed, executed, produces produces an an initial initial logic logic query query plan plan for for Q when denotes query query composition. composition. denotes

1 8 . See See Lausen Lausen et et al., aI., Section Section 6 6 [27] [27] for for a a formalization formalization of of hierarchical hierarchical processes processes using using nested nested Kripke Kripke 18. structures. structures.

1 2.5 Model-Based M odel-Based Mediator Medi ator System System and a n d Tools Tools 12.5

-=====

361

36 1

: rolllrl-r

11,",1

co. (

................. ..... . . .... ...

:. ... ... . . . . . . . . . . .. . . . .. . . . . . .

,-

,-

.-

1 2 .6 12.6 F IGURE FIGURE

Top: Top: Architecture of the

KIND KIND model-based mediator. Bottom: Snapshot of the ad hoc queries queries against CM( M); background CM(M); background right shows shows a a generated generated subgraph having the re requested result data shown in their their anatomi,cal anatomi,cal context. context. Clicking on (diamond) result foreground center). center). result node node retrieves the the actual result data data (see (see foreground
prototype. Background left left shows a mediator shell for for issuing issuing

362

362

1 2 12

A iator A Model-Based Model-Based Med Mediator

..... '">To 'm

System for for Scientific Scientific Data Data Management

9 Query rewriter: This This module module takes takes a a logic logic query query plan plan and and rewrites rewrites it it into into an an executable, executable, distributed distributed plan plan based based on on the the capabilities capabilities of of a a source source (e.g., (e.g., conjunctive queries queries with with binding binding patterns patterns or or complete complete SQL). SQL). conjunctive 9 Execution Execution plan plan compiler: compiler: For For final final execution, execution, the the rewritten rewritten plan plan is is compiled compiled into into a a logic logic program program whose whose run-time run-time execution execution sends sends the the corresponding corresponding sub subqueries queries to to wrapped wrapped sources, sources, retrieves retrieves results, results, and and post-processes post-processes them them (e.g., (e.g., joins, joins, group-bys, group-bys, and and unions unions across across sources) sources) before before sending sending them them to to the the user. user. 9 SQL SQL plan plan generator: generator: For For relational relational sources sources (those (those having having SQL SQL query query capa capabilities) , this bilities), this wrapper wrapper module module translates translates a a logic logic query query plan plan into into an an equivalent equivalent SQL SQL statement, statement, similar similar to to Draxler's Draxler's tool tool [32] [32].. A version of 13] A preliminary preliminary version of this this new new system system has has been been recently recently demonstrated demonstrated [ [13] and f the and includes includes all all o of the modules modules previously previously listed. listed. Plan Plan generation generation and and rewriting rewriting is is implemented implemented using using logic logic programming programming technology technology [33]. [33]. The The SQL SQL plan plan generator generator has been final system has been implemented implemented in in Java. Java. It It is is planned planned that that the the final system will will include include specialized [34] for specialized inference inference engines engines such such as as FLORA FLORAand and XSB XsB [34] for handling handling deductive deductive and object-oriented database capabilities, FaCT [[17] 1 7] for reasoning tasks and object-oriented database capabilities, and and FaCT for reasoning tasks over over domain maps description logics. domain maps that that are are formalized formalized in in description logics.

12.5.2 1 2.5.2

The l l-Centered Data base a nd S MART Atl as: The Ce Cell-Centered Database and SMART Atlas: Retrieva nd Navigation N avigati o n Through Thro u g h M u lti-Sca l e Retrievall a and Multi-Scale Data Data
The CCDB mentioned earlier, earlier, in 2.3 . 1 , houses houses different different types types of high The CCDB mentioned in Example Example 1 12.3.1, of highresolution, 3D light and electron microscopic cells and and subsub resolution, 3D light and electron microscopic reconstructions reconstructions of of cells cellular structures at the cellular structures produced produced at the National National Center Center for for Microscopy Microscopy and and Imaging Imaging 19 [14]. Research 19 [14]. It It contains information dede Research contains structural structural and and protein protein distribution distribution information rived from confocal, microscopy, including correlated rived from confocal, multiphoton, multiphoton, and and electron electron microscopy, including correlated microscopy. Many Many of of the sets are are derived from electron tomography, a a powpow microscopy. the data data sets derived from electron tomography, erful technique for deriving 3D information from electron microscopic specimens. erful technique for deriving 3D information from electron microscopic specimens. Electron tomography is is similar similar in in concept to medical medical imaging imaging techniques techniques like like comcom Electron tomography concept to puterized axial axial tomography tomography (CAT) ( CAT) scans scans and and magnetic magnetic resonance resonance imaging imaging (MRI) (MRI) puterized in that that it it derives derives a a 3D 3D volume volume from from a a series series of of 2D 2D projections projections through through a a structure. structure. in In this this case, case, the the structures structures are are contained contained in in sections sections prepared prepared for for electron electron mimi In croscopy, which which are are tilted tilted through through a a limited limited angular angular range. range. Examples Examples of of datasets datasets croscopy, in the the CCDB CCDB are are shown shown on on the the left left of of Figure Figure 12.7. 1 2.7. in
1 9 . The National National Center Center for Microscopy Microscopy and Imaging Imaging is a research research facility facility specialized specialized in the developdevelop 19. technologies for improving improving the understanding understanding of biological biological structureand function function relationships ment of technologies 3 (http-//www.ncmir.ucsd.edu). 3 to 50~m SOJLm3 spanning the dimensional dimensional range range from 5nm Snm3 (http://www.ncmir.ucsd.edu).

1 2.5 12.5

=====

M odel-Based Mediator Model-Based Mediator System and and Tools Tools

363 363

1 2. 7 12.7 F IGURE FIGURE

Left: Examples Examples of of tomographic tomographic data data sets sets in in the the CCDB. CCDB. A A and and B B show show a a selectively selectively

stained spiny spiny dendrite dendrite from from a a Purkinje A is is a a projection projection of of the the volume volume recon reconstained Purkinje cell. cell. A struction struction (dendrite (dendrite appears appears as as white white against against dark dark background). background). B B is is the the segmented segmented dendrite. reconstruction of dendrite. C C and and D D show show a a tomographic tomographic reconstruction of the the node node of of Ranvier. Ranvier. C C is is a a single single computed computed slice slice through through the the volume. volume. D D is is a a surface surface reconstruction reconstruction of of the the various components the node. J.Lm; in various components comprising comprising the node. Scale Scale bar bar in in B B= I 1/~m; in C C = O.5J.Lm. 0.51zm. Right: Right: Registration Registration of of a a data data set set with with the the Smart Smart Atlas. Atlas. The The user user draws draws a a polygon polygon representing the the location location of of a a data data set, set, in in this this case case a a filled neuron. The The representing filled Purkinje Purkinje neuron. user user specifies specifies the the database database containing containing this this data, data, then then enters enters an an annotation annotation and and se selects UMLS or lects a a concept concept from from the the UMLS or some some other other ontology. ontology. The The concept concept ID ID is is stored stored in in the the database. database.
= =

A hot of the Smart Smart Atlas Atlas tool 2.7. It A screens screenshot of the tool is is shown shown on on the the right right of of Figure Figure 1 12.7. It is is based based on on a a geographic geographic mapping mapping tool tool [35] [35] and and allows allows users users to to define define polygons polygons on on a a series series of of 2D 2D vector vector images images and and annotate annotate them them with with names, names, relationships, relationships, and and concept concept IDs IDs from from an an ontology ontology such such as as UMLS. UMLS. This This tool tool provides provides another another kind kind of of glue map (in (in addition addition to to domain domain and and process process maps). maps). First, First, a a brain brain atlas atlas such such as as that that by by Paxinos Paxinos and and Watson Watson [36] [36] is is translated translated into into a a spatial spatial format, format, such such as as Scalable Scalable Vector Vector Graphics Graphics (SVG). (SVG). The The user user then then marks marks up up the the atlas atlas using using the the Smart Smart Atlas Atlas tool tool (e.g., (e.g., with with concept concept names names from from UMLS) UMLS).. Once Once the the atlas atlas has has been been (partially) (partially) marked marked up, up, it it can can be be queried queried from from the the same same browser: browser: Clicking Clicking on on any any point point in in the the atlas atlas will will return return the the stereotaxic stereotaxic coordinates; coordinates; clicking clicking on on a a brain brain region region will will return return the the name name of of that that region, region, along along with with any any synonyms, synonyms, and and highlight highlight all all planes planes containing containing that that structure. structure. The The Smart Smart Atlas Atlas can can now now be be used used to to register register a a researcher's data to specific spatial also links links the registered data researcher's data to a a specific spatial location. location. This This also the registered data

364 364

12 1 2

A A Model-Based Model-Based Mediator Mediator System System for for Scientific Scientific Data Data Management Management

It
11

_ -_eCaJ. ., .... 11

""'I --l.cL,"

... _ OAI

12.8 1 2 .8

Process maps with elaborations and abstractions.

FIGURE
automatically to UMLS ontology by virtue of the automatically to the the UMLS ontology by virtue of the earlier earlier semantic semantic markup markup of spatial To register register source the user user draws of spatial objects. objects. To source data, data, the draws an an arbitrary arbitrary polygon polygon representing the the approximate approximate data on one atlas planes (Figure 12.7, representing data location location on one of of the the atlas planes (Figure 1 2.7, right) . The The user is then with a a form form that be used right). user is then presented presented with that can can be used to to add add annotations annotations or provide additional additional links links to to concepts concepts of Although the the UMLS UMLS is or provide of an an ontology. ontology. Although is used in in the examples shown user will will eventually be able multiple used the examples shown here, here, the the user eventually be able to to use use multiple ontologies, including those those of of their their own own creation, creation, for semantically indexing indexing data. data. ontologies, including for semantically Tools are are also also being being developed developed to to define new terms terms and and relationships relationships in in existing Tools define new existing ontologies. Another Another component the system has been been demonstrated demonstrated and and shows shows ontologies. component of of the system has how spatial spatial and and conceptual conceptual information information can can be be used used together together in in a a mediator mediator system system how [37]; see see also also Martone Martone et et al.'s al. 's chapter chapter in in Neuroscience Neuroscience Databases Databases [38] [38] for for further further [37]; details details on on the the use of of the Smart Atlas.

12.6 1 2.6 12.6.1 1 2.6. 1

RE LATE D WORK WO R K AND A N D CONCLUSION CO N C L U S I O N RELATED

Related R e l ated Work Wo rk


Significant the general Significant progress progress has has been been made made in in the general area area of of data data mediation mediation in in recent recent years, and and several several prototype prototype mediator mediator architectures architectures have have been been designed designed by by projects projects years, like TSIMMIS TSIMMIS [39], [39], SIMS SIMS [40], [40], Information Information Manifold Manifold [41], [41], Garlic Garlic [42], [42], and and MIX MIX like [43]. While While these these approaches approaches focus focus mostly mostly on on structural structural and and schema schema aspects, aspects, the the [43]. problem of of semantic semantic mediation mediation has has also also been been addressed: addressed: In In the the DIKE DIKE system system [44], [44], problem

12.6 1 2.6

Work and Conclusion Related Work

365

the focus focus iis on automatic extraction extraction of of mappings mappings between between semantically semantically analogous analogous the so n automatic elements elements from from different different schemas. schemas. A A global global schema schema is is defined defined in in terms terms of of a a con conceptual ceptual model model (SDR (SDR network), network), in in which which the the nodes nodes represent represent concepts concepts and and the the (di (directed) edge edge labels labels represent represent their their semantic semantic distances, distances, and and a a score score called called semantic rected) measures the relevance measures the number number of of instances instances of of the the target target node node that that are are also also in instances of the source node. The correspondence between objects is defined stances of the source node. The correspondence between objects is defined in in terms of of synonymies, homonymies, and and sub-source similarities, defined defined by by finding finding terms maximal matching matching between between the the two two graphs. graphs. maximal ODB-Tools ODB-Tools [45] [45] is is a a system system developed developed on on top top of of the the MOMIS MOMIS [46] [46] system system for for modeling modeling and and reasoning reasoning about about the the common common knowledge knowledge between between two two to-be to-beintegrated schemas. schemas. They They present present the the object-oriented object-oriented language, language, ODL[3 ODLI3, derived integrated , derived from a a description description logic logic (OCDL) (OCDL).. The The language language allows allows a a user user to to create create complex complex from objects values, union objects with with finite finite nesting nesting of of values, union and and intersection intersection types, types, integrity integrity con constraints, paths. These straints, and and quantified quantified paths. These constructs constructs are are used used to to define define a a class class in in one one schema as as a schema a generalization, aggregation, aggregation, or or equivalent with with respect respect to to another; another; subsumption of of a a class class by by another another can can be be inferred. inferred. An An integrated integrated schema schema is is ob obtained tained by by clustering clustering schema schema elements elements that that are are close close to to one one another another in in terms terms of of an an affinity metric. metric. affinity Calvanese al. [47] [47] perform Calvanese et et al. perform semantic semantic information information integration integration using using a a local localas-view as-view approach approach by by expressing expressing the the conceptual conceptual schema schema by by a a description description logic logic language and language called called VCR Ds and subsequently subsequently defining defining non-recursive non-recursive Datalog Datalog views views to to express source data elements in terms model. The The language VCR express source data elements in terms of of the the conceptual conceptual model. language Ds represents relations, R, R, and and a a set set of of assertions C C2 C2 represents concepts, concepts, C, C, relations, assertions of of the the form form Cl C1 c or Rl C R , where RI , R are VCR relations with the same arity. Mediation is or R1 C R2, relations with the same arity. Mediation is 2 where R1, R2 2 are Ds that a a accomplished defining reconciliation correspondences, or or specifications specifications that accomplished by by defining query to match a conceptual-level to data different sources. query rewriter rewriter uses uses to match a conceptual-level term term to data in in different sources. Recently Peim et et al. al. [48] [48] have have proposed to the well-known Recently Peim proposed an an extension extension to the well-known TAMBIS system [49]. Their Their approach is similar similar to ours [ 1 8 , 50] 50] in in that that a logic TAMBIS system [49]. approach is to ours [18, a logicbased ontology ontology (in (in their their case case the ACCQI description logic) interfaces with an an based the As description logic) interfaces with source. While While F-logic F-logic [28] is used used here the internal object-wrapped source. [28] is here as as the internal knowledge knowledge representation and and query query language, language, their their work work focuses focuses on on how a query query on on the the onon representation how a tology is is transformed transformed to to monoid mono id comprehensions comprehensions for for semantic semantic query query optimization. optimization. tology

12.6.2 1 2 .6.2

S u m m a ry: Model-Based Model-Based Mediation M ed iati o n Summary" a nd Reason-Able Reaso n-Able Meta-Data M eta-Data and
MBM was presented as as a a methodology methodology that that supports supports information information integration integration of of MBM was presented scientific data data across across complex, complex, multiple-world multiple-world scenarios scenarios as as found found in in the the neuroneuro scientific science domain. domain. In In this this framework, framework, object-oriented object-oriented models models and and conceptual conceptual modmod science ( CM), domain domain maps maps (DM), (DM), and and process process maps maps (PM) (PM) all all provide provide means means to to els els (CM), to link link capture more more domain domain semantics semantics and and thus thus can can act act as as glue knowledge sources to capture

366 366 ~ ~ : ~ ~ . ~ : ~ ~ ~ ~ ~ ~ i ~ ~

1 2 12

A V'TAlm for A Model-Based Model-Based Mediator Mediator :-' System for Scientific Scientific Data Data Management

hard-to-correlate hard-to-correlate sources. sources. Mechanisms Mechanisms to to contextualize contextualize source source data data formally formally were were presented. presented. The The graph graph structures structures thus thus constructed constructed have have been been shown shown to to be be useful useful for for navigating local data navigating across across related related concepts concepts and and querying querying local data during during navigation navigation [29]. [29]. Logic formalizations formalizations of of DMs DMs and and PMs PMs can can be be seen seen as as "reason-able" "reason-able" or or "ex "exLogic ecutable" "meta-data" 1 ] ) : Unlike ecutable .... meta-data" (see (see a a paper paper by by Horrocks Horrocks [5 [51]): Unlike conventional, conventional, de descriptive meta-data, which are primarily used for data discovery, formal ontologies, scriptive meta-data, which are primarily used for data discovery, formal ontologies, such such as as DMs DMs and and PMs, PMs, can can support support much much more more versatile versatile computational computational tasks tasks in in a a mediator mediator system, system, as as illustrated illustrated in in this this chapter. chapter. For For example, example, different different and and ap apparently parently unrelated unrelated data data objects objects can can be be associated associated and and retrieved retrieved together together or or even even fused mediator's integrated because IVDs fused by by the the mediator's integrated view view definition definition (IVD), (IVD), because IVDs can can be be defined defined as as deductive deductive rules rules over over DMs DMs and and PMs PMs (Figure (Figure 12.3). 12.3). In In this this way, way, in in model modelbased based mediation mediation (MBM), (MBM), logic logic rules rules play play the the role role of of executable executable or or computational computational meta-data integration. The challenging application meta-data for for scientific scientific data data integration. The latter latter is is a a challenging application and and benchmark benchmark for for combined combined database database and and knowledge knowledge representation representation techniques. techniques.

ACK N OWLE DG M E NTS ACKNOWLEDGMENTS

This This work work has has been been supported supported by by NIHlNCRR NIH/NCRR 3 3 P41 P41 RR08605-08S1 RR08605-08S1 (Biomedical (Biomedical Informatics Informatics Research Research Network Network [BIRN] [BIRN])) and and NSF-NPACI NSF-NPACI Neurosciences Neurosciences Thrust Thrust ACI 9020. The colleagues and ACI 961 9619020. The authors authors thank thank their their colleagues and students students involved involved in in the the BIRN BIRN project project for for their their contributions, contributions, in in particular, particular, Xufei Xufei Qian, Qian, Edward Edward Ross, Ross, Joshua Joshua Tran, Tran, and and lIya Ilya Zaslavsky. Zaslavsky.

REFERENCES
[ 1] [1]
Y. Y. Papakonstantinou, Papakonstantinou,

A . Gupta, and L .M A. L. M.. Haas. "Capabilities-Based "Capabilities-Based Query Rewriting in Mediator Systems." Systems." Distributed and Parallel Parallel Databases Databases 6, no. 1 1

( 998): 73-1 10. (1 1998): 73-110. 2] [2] [


C. Li, R. Yerneni, Capability Based Yerneni, V. V. Vassalos, Vassalos, et al. " "Capability Based Mediation in TSIMMIS." TSIMMIS." In

Proceedings Proceedings of of the A CM International International Conference Conference on Management of of Data (SIGMOD), 564-566. 1 998. 1998. 3] [3] [ [4] [5] [s]
National Partnership for Computational Infrastructure (NPACI ) : Neuroscience (NPACI)1 . http://www.npaci.edurrhrusts/Neuro/. Thrust Area, Area, 200 2001. http://www.npaci.edu/Thrusts/Neuro/. Biomedical . Biomedical Informatics Research Research Network Coordinating Center (BIRN-CC) (BIRN-CC). 1. University Diego. http://nbirn.net/, 200 University of California, San Diego. 2001.
V. V. Kashyap Kashyap and

A. Sheth. Sheth. "Semantic "Semantic and Schematic Schematic Similarities Similarities Between Between Database Objects: A Context-Based . " VLDB Journal S, 1 996): 276-304. 5, no. 4 ((1996): Context-Based Approach Approach."

References References

3 67 367

[6] [6]

D. D . Calvanese, Calvanese, G. D. D . Giacomo, Giacomo, M. M. Lenzerini, Lenzerini, et et al. "Description "Description Logic Framework of the Sixth for Information Information Integration." Integration." In Proceedings of for Sixth International Conference of Knowledge Representation and Reasoning (KR'98), (KR '98), 2-13. 2-1 3 . Morgan Morgan on Principles of Kaufmann, 1998. 1 998. Kaufmann,
Tanaka, et Numbers of Puncta O. Bozdagi, W. Shan, H. Tanaka, et al. "Increasing Numbers of Synaptic Puncta During Late-Phase LTP: LTP: N-Cadherin N-Cadherin is Synthesized, Synthesized, Recruited to Synaptic Sites, During (2000): 245-259. 245-259. and Required Required for Potentiation." Potentiation." Neuron and Neuron 28, no. 1 (2000): Kasahara, K. Fukunaga, and Calcium! J. Kasahara, and E. Miyamoto. "Activation of Calcium/ Calmodulin-Dependent Protein Kinase IV in Long Term Potentiation in the Rat Rat Calmodulin-Dependent of Biological Chemistry 276, 276, no. 26 (2001): (2001 ) : Hippocampal CA1 Region." Journal of Hippocampal 24044-24050. 24044-24050. Gupta, B. Ludfischer, Ludascher, and and M. E. Martone. Martone. "An Extensible Extensible Model-Based A. Gupta, . " In Demonstration Session of the 2 1 st Mediator System with Domain Maps Maps." 21st International International Conference on Data Engineering (ICDE), Heidelberg, Germany, 2001 . 2001.

[7] [71

[8] [8]

[9]

[ 1 0] S Grant, and J [10] S.. Chakravarthy, J J.. Grant, J.. Minker. "Logic-Based "Logic-Based Approach to Semantic 5, no. 2 Query Optimization." Optimization." ACM Transactions Transactions on Database Systems Systems (TODS) (TODS) 1 15, ( 1 990): 1 62-207. (1990): 162-207. [ 1 1 ] B. Ludfischer, Ludascher, Y. Papakonstantinou, and P. [11] P. Velikhov. Velikhov. "Navigation-Driven of the International Evaluation of Virtual Mediated Views." Views. " In Proceedings of Conference on Extending Extending Database Technology (EDBT), Lecture Notes in 1 777, 150-165. 1 5 0-165. Springer, Computer Science 1777, Springer, 2000. V. Vassalos. "The Enosys Markets Data [12] Y. Y. Papakonstantinou and V. Data Integration Platform: Lessons from the Trenches." In International Conference on Information and 1. and Knowledge Knowledge Management Management (CIKM), (CIKM), 200 2001. [[13] 13] A . Ludascher, Registering Scientific A.. Gupta, Gupta, B B. Ludfischer, and and M M.. E E.. Martone. " "Registering Scientific Information 1 st International Conference Sources for Semantic Mediation. " In 2 Mediation." 21st Conference on Conceptual Modeling (ER). (ER). Lecture Notes in Computer Science Science 2503. Springer, Springer, 2002. [ 14] M. E. Martone, A. Gupta, M [14] M.E. M.. Wong, et al. "A Cell-Centered Database for Electron 3 8 (2002) Tomographic Tomographic Data." Data." Journal of of Structural Biology Biology 1 138 (2002):: 145-155. 145-155. http://ncmir. ucsd.edu/CCD B/. http://ncmir.ucsd.edu/CCDB/. [ 1 5] A. Artale, [15] Artale, E. E. Franconi, Franconi, N. N. Guarino, Guarino, et et al. al. "Part-Whole "Part-Whole Relations Relations in in Object-Centered Systems: " Data Systems: An Overview. Overview." Data and and Knowledge Engineering Engineering 20 ((1996): 1 996): 347-3 83. 347-383. [16] [16] P. P. Mitra, Mitra, G. G. Wiederhold, Wiederhold, and and M. M. L. L. Kersten. Kersten. "A "A Graph-Oriented Graph-Oriented Model Model for for Articulation of Ontology Interdependencies. " In Proceedings Interdependencies." Proceedings of the International Conference on Extending 00. 2000. 86-100. Extending Database Technology Technology (EDBT), 86-1

[17] I.R. FaCT or Fiction? Fiction?" [ 1 7] I. R. Horrocks. "Using an Expressive Description Logic: FaCT " In KR'98: KR'98" Principles Principles of of Knowledge Representation and and Reasoning, edited by A. G. Cohn, L. Schubert, and S. S. C. Shapiro, 636-645. San Francisco: Morgan Kaufmann, 998. Kaufmann, 1 1998.

368

====; == ;:==;..

12 12

A iator System A Model-Based Model-Based Med Mediator System for for Scientific Scientific Data Data

Management

[ 1 8 ] B. B. Ludascher, [18] Ludischer, A. Gupta, and M. E. Martone. "Model-Based Mediation with 7th International Conference . " In Proceedings Domain Maps Maps." Proceedings of of the 1 17th Conference on Data Society, 200 2001. (ICDE). New York: IEEE IEEE Computer Society, 1. Engineering (ICDE). [ 1 9] D [19] D.. Calvanese, Calvanese, G G.. D. Giacomo, M. Lenzerini, Lenzerini, et al. "Description Logic Framework for Information Integration." In Principles Principles of of Knowledge Representation and 2-13. Reasoning, 2-1 3. 1998. [20] B. Ludascher, Lud/ischer, A. Gupta, and M. E. Martone. "Model-Based Mediation with Domain Maps . " In Proceedings 7th International Conference Maps." Proceedings of of the 1 17th Conference on Data IEEE Computer Society, 1. Engineering (ICDE), 8 1 -90. New York: IEEE 81-90. Society, 200 2001. [21] f Object-Oriented and [21] M M.. Kifer, Kifer, G G.. Lausen, and J. Wu. "Logical Foundations o of CM 42, no. 4 (July Frame-Based Languages. " Journal of Languages." of the A ACM (July 1995): 741-843. [22] FLORA homepage. homepage, http://www.cs.sunysb.edu/sbprolog/flora/. http://www.cs.sunysb.edu/~sbprolog/flora/. [23] homepage. http://www.informatik.uni-freiberg.de/dbislflorid!. [23] FLORID homepage, http://www.informatik.uni-freiberg.de/~dbis/florid/. [24] B. Lud/ischer, Ludascher, A. Gupta, and M. E. Martone. "Model-Based Information 6th Integration in a Neuroscience Mediator System. " In Proceedings System." Proceedings of of the 2 26th International Conference Large Data Bases Conference on Very Very Large Bases (VLDB), 639-642. San Francisco: Morgan Morgan Kaufmann, 2000. Francisco: [25] R. Himmeroder, Himmer6der, G. Lausen, B. Ludascher, Ludischer, et al. "FLORID: A DOOD-System for

Querying the Web." In Demonstration Session at EDBT. EDBT. Valencia, Valencia, Spain, 1 998. 1998.
[26] B. Ludascher, Ludischer, R. Himmeroder, Himmer6der, G. Lausen, et al. "Managing Semistructured Data with FLORID: A Deductive " Information Deductive Object-Oriented Perspective. Perspective." Information Systems 23, no. 8 ((1998): 1 998): 589-6 13. 589-613. [27] . Ludascher, [27] G G.. Lausen, B B. Ludischer, and W. May. "On Active Deductive Databases: The Statelog " In Statelog Approach. Approach." In Transactions Transactions and Change Change in Logic Databases, Lecture Lecture Notes in Computer Science Science 1472, 69-106, edited by B. B. Freitag, H. Decker, M, Kifer, et al. Springer, Springer, 1998. [28] [28] M. Kifer, Kifer, G. Lausen, and J. Wu. "Logical Foundations of Object-Oriented and CM 42, no. 995): 741-843. Frame-Based " Journal of Frame-Based Languages. Languages." of the A ACM no. 4 (July (July 1 1995): [29] [29] X. Qian, Qian, B. Ludascher, Lud/ischer, M. E. Martone, et al. "Navigating Virtual Information Sources with Know-ME." In EDBT, Lecture Notes in Computer Science Science 2287, 739-741, 739-741, 2002.

[30] B. B. Ludascher, Lud/ischer, A. Gupta, Gupta, and M. E. Martone. "Model-Based Information 6th Integration in a Neuroscience Mediator System. " In Proceedings System." Proceedings of of the 2 26th International Conference on Very Very Large Large Data Bases Bases (VLDB), 639-642. San Francisco: Morgan Kaufmann, 2000.
3 1 ] G. Yang and M. Kifer. [31] Kifer. "FLORA: Implementing an Efficient Efficient DOOD DOOD System Using a [
Tabling Logic Engine." In Sixth International Conference on Rules and Objects in Databases (DOOD), 2002.

References References

369
[32] C. Draxler. Draxler. A A Powerful Powerful Prolog Prolog to to SQL SQL Compiler, Compiler, technical technical report. report. M/inchen, Miinchen, Centre for for Information and and Language Processing, Processing, LudwigsLudwigs Germany: Centre Maximillians-Universitat M/inchen, Miinchen, 1992. 1 992. Maximillians-Universit~it [33] B. Lud~ischer. Ludascher. "Mediator " Mediator Query Query Processing Processing with technical with Prolog Technology," technical note, BIRN-DI-TN-2002-01. BIRN-DI-TN-2002-01 . Biomedical Informatics 2002. note, Informatics Research Network, Network, 2002.
F. Sagonas, T. Swift, and and D. D . S. Warren. "XSB as an Efficient Deductive Database Database [34] K. E of the A CM of Engine." In Proceedings of CM International Conference on Management of (SIGMOD), 442-453. 442-453. 1994. 1 994. Data (SIGMOD),

for Interactive Online Mapping." Cartographic [35] I. Zaslavsky. "A New New Technology for Online Mapping." (2000): 65-77. 65-77. Perspectives 37 (2000):

Rat Brain in Stereotaxic Stereotaxic Coordinates. San Diego: and C. Watson. The The Rat [36] G. Paxinos and 1 998. Academic Press, 1998.
[37] Ludascher, M. E. Martone, et al. "A System for Managing Alternate [37] A. Gupta, B. Lud~ischer, 1 9 th British Mediation." In Advances in Databases, 19th Models in Model-Based Mediation." 9), Lecture notes in Computer National Conference on Databases (BNCOD 1 19), Science 2405. 2405. Springer, Springer, 2002. [38] M.E. M. E. Martone, A. Gupta, B. Lud~ischer, Ludascher, et al. "Federation of Brain Data [38] Data through Knowledge-Guided Mediation." Mediation." In Neuroscience Neuroscience Databases: A Practical Practical Guide, 275-292. Boston: Kluwer Academic edited by R. K6tter, 275-292. Academic Publishers, Publishers, 2002. [39] H. Garcia-Molina, Y. [39] Y. Papakonstantinou, Papakonstantinou, D. D. Quass, Quass, et et al. al. "The "The TSIMMIS Abstract) . In Approach to Mediation: Data Models and Languages" (Extended Abstract). Next 995. Next Generation Information Technologies and Systems. 1 1995. [40] C. A. Knoblock, S. Minton, J. L. Ambite, et al. "Modeling Web Sources for C.A. Information " In Information Integration. Integration." In Proceedings Proceedings of of the Fifteenth National Conference on 99 8 . Artifi cial Intelligence. 1 Artificial 1998. [4 1] A . Y. Levy, [41] A.Y. Levy, A A.. Rajaraman, Rajaraman, and and J J.. J J.. Ordille. Ordille. "Querying Heterogeneous Information Sources Using Source Descriptions." In Proceedings Proceedings of of the International Conference on Very 1-262. 1 996. 251-262. 1996. Very Large Data Bases Bases (VLDB), 25 [42] M. Haas, Optimizing Queries [42] L. L.M. Haas, D. D. Kossmann, E. E. L. L. Wimmers, et et al. al. " "Optimizing Queries Across Across Diverse Data Sources." In Proceedings Proceedings of of the International Conference on Very Very 997. Large Data Bases Bases (VLDB), 276-285. 276-285. 1 1997. [43] c. C. Baru, A. Gupta, B. B. Ludascher, Lud~ischer, et al. "XML-Based Information Mediation with MIX." In Proceedings of of the ACM SIGMOD International Conference on Management of 999), 597-599. Philadelphia: of Data (SIGMOD 1 1999), Philadelphia: ACM Press, 1 999. 1999. [44] L. Palopoli, G. Terracina, and D D.. Ursino. "The System DIKE: DIKE: Towards the Semi-Automatic Synthesis of Cooperative Information Systems Systems and Data Warehouses." In Proceedings 1 7. Proceedings of of the ADBIS-DASFAA Symposium, 108-1 108-117. 2000. 2000.

370

' 2 12

A A Model-Based Model-Based Mediator Mediator System System for for Scientific Scientific Data Data
~ =

Management

[45] Extensional Knowledge [45] D. Beneventano Beneventano and S. S. Bergamaschi. " "Extensional Knowledge for Semantic Query Optimization in a Mediator-Based System. " In International Workshop on System." Foundations l. Foundations of Models Models for Information Integration Integration (FMII-2001), (FMII-2001), 200 2001. [46] S . Bergamaschi, . Castano, and M S. Bergamaschi, S S. M.. Vincini. "Semantic Integration of Semistructured and Structured Data Sources." SIGMOD Record 28, no. 1 1 999): 1 ((1999): 54-59.
no, F. [47] [47] D. Calvanese, Calvanese, S. S. Casta Castano, E Guerra, et al. "Towards a Comprehensive

Methodological Framework for Semantic Integration of Heterogeneous Data Sources." In Eighth Eighth International Workshop on Knowledge Knowledge Representation Representation Meets

Databases (KRDB), 200 2001. Databases 1.


Query Processing with Description Logic [48] [48] M M.. Peim, E. Franconi, N N.. Paton, et al. " "Query Ontologies Over Object-Wrapped Databases." In Proceedings Proceedings of the International Conference . 2002. Conference on Scientific Scientific and Statistical Statistical Database Management Management (SSDBM) (SSDBM).

[49] [49] C. Goble, R. Stevens, Stevens, G. Ng, et al. "Transparent Access Access to Multiple Multiple Bioinformatics 1 ) : 534-55 1. Information Sources. " IBM Systems Journal 40, no. 2 (200 Sources." (2001): 534-551. [50] . Ludascher, [50] A A.. Gupta, B B. Lud~scher, and M M.. E E.. Martone. "Knowledge-Based Integration of Neuroscience Data Sources." In Proceedings Proceedings of of the Twelfth Twelfth International Conference on Scientific Scientific and Statistical Statistical Database Database Management (SSDBM), (SSDBM), 39-52. IEEE Computer Society, Society, 2000. [51] I. Horrocks. "DAML+OIL: A Reason-Able Web Ontology Language." In Proceedings of of the International Conference Conference on Extending Database Database Technology (EDBT) 3 . 2002. (EDBT),, 2-1 2-13.2002.

PH
.

'

"

"

CHAPTER CHAPTER

13 1 3

Compared Evaluation Com pared Eva l uation Management Systems M anagement Systems
Zoa Zob Lacroix Lacroix and and Terence Terence Critchlow Critchlow

of Scientific Scientific Data Data of

The The variety variety of of biological biological information information systems systems currently currently available available raises raises the the in inevitable needs? To evitable question: question: Which Which system system best best meets meets my my needs? To decide decide which which system system to to choose, or or if if a a custom custom system system is is required, required, a a detailed detailed analysis analysis of of user user needs needs should should choose, be performed. performed. Carefully Carefully performing performing this this analysis analysis will will identify identify the the best best options options and and be clarify the the buy-or-build buy-or-build decision. decision. clarify This This chapter chapter outlines outlines several several techniques techniques and and metrics metrics that that can can be be considered considered when 3 . 1 begins when performing performing an an evaluation. evaluation. Section Section 1 13.1 begins this this discussion discussion by by defining defining evaluation techniques. techniques. Section 3 .2 presents presents a in detail evaluation Section 1 13.2 a set set of of evaluation evaluation criteria criteria in detail and Finally, Section Section 13.3 13.3 discusses and describes describes how how they they may may be be applied. applied. Finally, discusses some some of of the explicit tradeoffs that that can can be be made made and and the the effects effects they they have have on on the the explicit tradeoffs the overall overall systems. Unfortunately, in previous systems. Unfortunately, neither neither evaluation evaluation of of the the systems systems described described in previous chapters nor nor comparisons them are chapters comparisons between between them are provided provided because because these these activities activities can can only be the context context of of specific user requirements. requirements. only be performed performed within within the specific user

13.1 1 3. 1

P E R F O R MANCE M O DE L PERFORMANCE MODEL


Whether users users are are selecting selecting a a new new system, system, evaluating evaluating an an existing existing system, system, or or deterdeter Whether what requirements requirements a a system system must must meet, meet, they they need need a a performance performance model. model. mining what mining The performance performance model model is is used used to to evaluate evaluate the the system's system's ability ability to to meet user rere The meet user quirements and and to to provide provide the the basis basis for comparing systems systems beyond this starting starting quirements for comparing beyond this point. The The model model is is composed composed of of a a set set of of specifications specifications and and associated associated metrics metrics point. that can can be be used used to to evaluate evaluate a a system. system. Ideally, Ideally, it it identifies identifies the the target target environment environment that in which which the the system system will will actually actually be be deployed deployed and and reflects the relative relative importance importance in reflects the of all all the the system system features features within within that that environment. environment. Because Because of of this this tight tight coupling coupling of between a model model and and its its environment, environment, the the model model cannot cannot be be directly directly applied applied to to between a other other environments. environments.

372

372

1 3 13

l uation of ComparedEva Evaluation of Scientific Scientific Data Data Management Systems

The The first first and and most most important important step step in in defining defining a a model model is is to to establish establish the the min minimal imal set set of of specifications specifications required required for for a a system system to to be be considered considered of of interest. interest. This This could could be be as as simple simple as as using using previously previously defined defined use use cases cases or or system system requirements, requirements, discussed .4 . 1 , as discussed in in Section Section 1 1.4.1, as the the system system specification, specification, or or it it could could be be as as complex complex as as performing performing a a new, new, detailed detailed evaluation evaluation that that augments augments these these initial initial requirements requirements with with priorities priorities and and a a ranking ranking of of all all desired desired functionality. functionality. This This is is the the most most impor important removes from that tant step step because because it it immediately immediately removes from consideration consideration those those systems systems that do all of basis on which all all of do not not meet meet all of the the requirements requirements and and provides provides the the basis on which of the the systems will be systems will be compared. compared. The The list list of of specifications specifications should should be be as as complete complete and and detailed detailed as as possible possible as as two small set two significantly significantly different different systems systems may may agree agree on on a a small set of of specifications specifications while while differing specifications, the differing on on other other characteristics. characteristics. The The more more complete complete the the specifications, the fewer fewer the that can can meet meet them them and and the more likely the number number of of systems systems that the more likely the the solution solution will will meet meet the the users' users' expectations. expectations. For For example, example, in in the the context context of of the the design design of of a a vehicle vehicle if if the the specification specification is is to to transport transport a a person person within within a a town, town, two two possible possible designs designs are are a car car and and a a bicycle. bicycle. However, However, should the specification specification also also include include that that the the vehicle vehicle a should the be be able able to to carry carry heavy heavy objects, objects, only only the the car car satisfies satisfies the the requirements. requirements. Once Once the the specifications specifications have have been been identified, identified, they they can can be be translated translated into into a a collection is collection of of characteristics characteristics or or metrics metrics that that define define the the areas areas where where the the system system is going to represented as going to be be evaluated. evaluated. These These criteria criteria can can be be represented as an an evaluation evaluation matrix, matrix, as 3 . 1 . 1 and 3.2.3. While as outlined outlined in in Sections Sections 1 13.1.1 and 1 13.2.3. While explicit explicit metrics metrics are are useful, useful, eval evaluating uating a a system system must must include include defining defining the the relative relative value value of of appropriate appropriate resources resources or 3 . 1 .2 outlines or capabilities. capabilities. Section Section 1 13.1.2 outlines several several cost cost models models that that can can be be used used to to compare 3 . 1 .3 and 3 . 1 .4 present compare systems. systems. Finally, Finally, Sections Sections 1 13.1.3 and 1 13.1.4 present techniques techniques to to collect collect the the measurements. measurements.

1 3. 1 . 1 13.1.1

Eva l u ation M atrix Evaluation Matrix


To To measure measure the the performance performance of of a a system, system, two two feature feature sets sets need need to to be be determined: determined: ((1) 1 ) the the perspectives perspectives from from which which the the evaluation evaluation will will occur occur and and (2) (2) the the parameters parameters to to be be measured. measured. Typically, Typically, the the parameters parameters are are derived derived from from the the system system specifications specifications previously collected. collected. Two obvious perspectives previously Two obvious perspectives are are the the developers' developers' perspective perspective and and the the users' users' perspective. perspective. Each Each perspective perspective encompasses encompasses most, most, if if not not all, all, the the evaluation evaluation parameters. parameters. A A matrix matrix representation representation formed formed with with the the perspectives perspectives along along the the x x axis axis and and the the performance performance parameters parameters forming forming the the y y axis axis can can be be used used to to summarize summarize the the evaluation, 3 .2 . 3 . evaluation, as as illustrated illustrated in in Section Section 1 13.2.3.

1 3. 1 . 2 13.1.2

Cost M od e l Model
There There are are two two common common cost cost measurements measurements used used to to evaluate evaluate systems-time systemsmtime and and space. space. Time Time can can be be considered considered in in various various ways ways and and at at various various granularities. granularities.

1 3. 1 13.1

Performance Performance Model Model


. . . . . .

373 373

The The user user response response time time is is the the total total time time needed needed to to answer answer a a query. query. This This time time can can be be split split between between the the actual actual transaction transaction time time (time (time needed needed to to process process the the query) query) and and the the transmission transmission time time (time (time needed needed to to display display the the result result to to the the user). user). For For integrated integrated systems, Additional transmissions systems, transaction transaction time time may may be be split split further: further: Additional transmissions to to several several systems may be included, as systems may be included, as well well as as the the processing processing time time of of each each of of these these systems. systems. While usually less important, important, pre-processing pre-processing time time should should also also be be considered considered to to While usually less ensure viability of ensure the the viability of the the system. system. For For example, example, it it is is important important that that pre-processing pre-processing delays delays do do not not prevent prevent data data from from being being integrated integrated into into the the system system in in a a timely timely man manner. ner. The The cost cost function function considered considered for for DiscoveryLink DiscoveryLink is is based based on on time time as as presented presented in 1 .2.3. in Section Section 1 11.2.3. Unlike Unlike standard standard models models of of computation computation in in which which uniform uniform access access to to all all data data is is assumed, 1 models for database database management management systems systems must must take take into into consideration consideration assumed,l models for location of the the notion notion of of space, or or the the actual actual location of the the data. data. In In particular, particular, as as most most of of the the data main memory, typically kept data do do not not fit fit into into main memory, they they are are typically kept in in secondary secondary storage storage (e.g., (e.g., on on a a magnetic magnetic disk). disk). For For applications applications that that require require the the storage storage of of very very large large data data sets, as is is often sets, as often the the case case for for scientific scientific applications, applications, tertiary tertiary storage storage devices devices such such as as tape tape drives drives are are also also used. used. The The cost cost of of moving moving between between storage storage levels levels is is high high because because data take significantly data access access and and writing writing times times take significantly longer. longer. Ultimately, Ultimately, however, however, the the appropriate manipulation, such appropriate data data need need to to be be in in main main memory memory to to perform perform any any manipulation, such as as returning returning the the results results of of a a query. query. In In database database management management systems, systems, a a buffer manager partitions main memory which are regions into partitions available available main memory into into buffers, buffers, which are regions into which disk disk blocks All system need access access to which blocks can can be be transferred. transferred. All system components components that that need to the data interact the data interact with with the the buffer buffer manager. manager. Space management tightly coupled Space management is is tightly coupled to to time time management. management. Measuring Measuring the the space required tasks is performance of space required for for the the system system tasks is a a way way to to evaluate evaluate the the performance of the the system. Efficient decrease the amount of system. Efficient space space management management may may significantly significantly decrease the amount of time time required to required tasks. required to perform perform required tasks. Space Space management management can can be be improved improved by by algo algorithms rithms that that minimize minimize data data exchanges exchanges between between tertiary tertiary and and secondary secondary storage, storage, limit limit the better exploit the number number of of disk disk accesses, accesses, and and better exploit data data cached cached in in the the buffer. buffer. In In dis distributed local storage available for tributed systems, systems, increasing increasing the the local storage available for use use may may decrease decrease the the response time response time because because it it provides provides improved improved caching caching or or eliminates eliminates the the need need for for network time than network communication communication or or complex complex calculations calculations that that consume consume more more time than accessing local storage. storage. This This complex accessing local complex relationship relationship between between time time and and space space high highlights lights the the importance importance of of understanding understanding the the value value of of different different resources resources and and the the tradeoffs being evaluated. tradeoffs made made by by each each system system being evaluated. Of are other Of course, course, there there are other ways ways to to measure measure the the cost cost of of a a system. system. One One obvious obvious example example is is the the monetary monetary cost cost to to purchase, purchase, license, license, or or use use it. it. A A less less obvious obvious ex example is ample is the the correctness correctness of of results results provided provided by by the the system, system, as as compared compared to to some some

1 . These models are often often called called random-access random-access models models (RAM). 1. These models (RAM).

374

13

ComparedEvaluation of Scientific Data Management Systems

ground truth. These These costs can be time, space, space, and each other ground truth. costs can be compared compared against against time, and each other to to define other other tradeoffs. tradeoffs. For For example, example, correctness correctness and and time time can can be be traded traded off off against against define each subset (or (or superset) each other other in in a a system system in in which which a a quick quick response response returns returns a a subset superset) of of the actual results, results, and and a a complete complete answer the actual answer is is only only returned returned if if the the query query is is allowed allowed to to run run without without restriction. restriction. Ultimately, Ultimately, the the best best cost cost metrics metrics for for a a given given situation situation depend on which resources are most valuable in that environment. depend on which resources are most valuable in that environment.

1 3 . 1 .3 13.1.3

Bench m a rks Benchmarks


A software benchmark benchmark is is a a collection collection of of programs programs used used to to generate generate measurements measurements for evaluation of for evaluation of some some capability, capability, usually usually efficiency. efficiency. Benchmarks Benchmarks guide guide users users in in the the selection selection of of a a system system with with respect respect to to specific specific performance performance considerations. considerations. They They also also offer offer a a good good quality quality assurance assurance test test for for software software developers. developers. The The development development of of benchmarks benchmarks is is driven driven either either by by an an application application domain domain (domain-specific (domain-specific benchmark) benchmark) or or by by the the objective objective to to evaluate evaluate a a general general type type of of . For system system (generic (generic benchmark) benchmark). For example, example, the the SEQUOIA SEQUOIA 2000 2000 storage storage bench benchmark 1 ] is mark [ [1] is a a domain-specific domain-specific benchmark benchmark designed designed to to evaluate evaluate earth earth scientific scientific ap applications, 007 X007 [3] [3] are plications, whereas whereas 0 0 7 [2] [2] and and XOO7 are generic generic benchmarks benchmarks developed developed to to evaluate evaluate object-oriented object-oriented database database systems systems and and extensible extensible markup markup language language (XML)-management systems, respectively. While generic benchmarks provide (XML)-management systems, respectively. While generic benchmarks provide a a gauge of performance across across a gauge of performance a wide wide range range of of application application domains, domains, the the domain domainspecific specific benchmarks benchmarks are are able able to to capture capture peculiarities peculiarities of of the the domain domain that that may may not not be more general be common common in in more general applications. applications. Thus, Thus, if if an an appropriate appropriate domain-specific domain-specific benchmark exists, exists, it usually will will be benchmark it usually be preferred. preferred. The selection of requires the The selection of a a benchmark benchmark first first requires the analysis analysis of of the the characteristics characteristics the evaluate. For example, 007 the benchmark benchmark is is expected expected to to evaluate. For example, OO7 aims aims to to capture capture the the abil ability ity of of an an object-oriented object-oriented database database system system to to perform perform pointer pointer traversals, traversals, updates, updates, and these characteristics (schema and query query processing. processing. Once Once these characteristics are are specified, specified, a a data data set set (schema and instance) and and instance) and queries queries are are designed. designed. The The design design of of the the data data set set needs needs to to be be com complex plex enough enough to to reflect reflect the the characteristics characteristics of of the the real real data. data. The The structure structure (schema), (schema), including including data data types, types, and and the the size size of of the the dataset dataset are are the the two two features features that that play play a a role. itself is role. The The data data itself is often often randomly randomly generated. generated. Typically, Typically, queries queries are are grouped grouped in the different in collections collections that that capture capture the different characteristics. characteristics. For For example, example, the the queries queries of in three of X007 XOO7 are are grouped grouped in three collections: collections: data-driven data-driven queries, queries, document-driven document-driven queries, and navigational queries. queries. queries, and navigational Benchmarking Benchmarking data data management management systems systems involves involves the the evaluation evaluation of of various various tasks tasks such such as as query query execution execution and and transactions transactions (updates). (updates). In In such such contexts, contexts, a a benchmark benchmark consists consists of: of: a a schema, schema, a a set set of of data data corresponding corresponding to to that that schema, schema, and and a a set set of of queries. queries. The The evaluation evaluation of of a a data data management management system system thus thus follows follows these these steps: 1 ) define load the steps: ( (1) define the the data data structure structure (schema) (schema) in in the the system, system, (2) (2) load the data, data, and and ((3) 3 ) run run each each of of the the queries, queries, recording recording measurements measurements corresponding corresponding to to the the cost cost

13 3 ..1 P e r f o r m a n c nce e Mo del 1 1 Performa M odel . . . . . . . . . . . ....................................................................

375 375

model as as defined defined in in Section Section 1 13.1.2 (typically the the time time and and the the space space needed needed to to model 3 .1 .2 (typically execute execute the the query). query). The The comparison comparison of of two two systems, systems, therefore, therefore, consists consists in in com comparing the the measurements measurements collected collected when when running running the the benchmark benchmark on on each each system. system. paring The analysis analysis of of the the collected collected measurements measurements often often requires requires knowledge knowledge of of internal internal The implementation implementation decisions, decisions, which which are are typically typically not not available. available. For For more more information information on benchmarking, refer to Gray's The Benchmark Handbook [4]. on benchmarking, refer to Gray's [4]. The design of The design of a a benchmark benchmark for for biological biological information information system system is is a a difficult difficult task. For For a a user user to to be be satisfied, satisfied, the the benchmark benchmark must must capture capture the the tasks tasks expected expected task. to be be performed performed by by the the system. system. Because Because mediation mediation systems systems integrate integrate a a variety variety of of to resources, including including Web Web resources, resources, it it is is challenging challenging to to define define a a single single benchmark benchmark resources, that can can accurately accurately evaluate evaluate an an entire entire system. system. An An alternative alternative is is to to use use a a collection collection that of of generic generic benchmarks benchmarks and and combine combine the the results. results. This This has has the the advantage advantage of of being being able to able to leverage leverage existing existing work, work, such such as as relational relational database database and and XML XML benchmarks, benchmarks, where appropriate. appropriate. where

13.1.4 1 3. 1 . 4

User S Survey User u rvey


Human Human computer computer interaction interaction (Hel) (HCI) is is the the discipline discipline concerned concerned with with the the design, design, evaluation, evaluation, and and implementation implementation of of interactive interactive computing computing systems systems for for human human use use and Hel research and the the major major phenomena phenomena surrounding surrounding them them [5]. [5]. Thus, Thus, most most HCI research fo focuses on cuses on the the design design and and development development of of user user interfaces. interfaces. A A survey survey is is a a common common approach used used by by the the HCI HeI community community for These surveys surveys approach for evaluating evaluating interfaces. interfaces. These obtain feedback feedback from from a a large large number number of of users users on a variety characteristics and and obtain on a variety of of characteristics allow researchers researchers to to identify identify the the strengths strengths and and weaknesses weaknesses of an interface. interface. Their Their allow of an of surveys surveys to to collect collect requirements requirements and and evaluate evaluate satisfaction be exploited exploited use satisfaction can can be use of to design design and and evaluate evaluate an an entire entire system. system. Unfortunately, despite previous previous efforts, efforts, to Unfortunately, despite there is is no no commonly commonly accepted accepted characterization characterization of of the the various various tasks tasks that that life scien there life scientists perform with data data management management systems. systems. As As there is no no standard standard functionality functionality tists perform with there is common to to bioinformatics bioinformatics applications, applications, each must be evaluated by by a a survey survey common each system system must be evaluated designed around around the the needs needs of of a a specific specific set of users. users. The evaluation of of such such a a syssys designed set of The evaluation tem further complicated for computer tem is is further complicated because because the the expectations expectations of of life life scientists scientists for computer support support evolve evolve significantly significantly over over time. time. In addition addition to to evaluating evaluating individual individual systems, systems, user user surveys surveys can can be be used used to to ideniden In tify and and influence influence trends trends across across an an entire entire scientific scientific domain. domain. For For example, example, consider consider tify a survey survey performed among 31 3 1 biologists biologists at at Arizona Arizona State State University, University, in in which which 26 26 a performed among agreed that that luck luck was was involved involved in biological discovery, discovery, and and all all agreed agreed on the imporimpor agreed in biological on the tance tance of of creativity creativity [6]. [6]. Despite Despite being being a a fairly fairly small small survey, survey, it it was was able able to to highlight highlight the need need for for systems systems that that encourage, encourage, not not discourage, discourage, creative creative exploration exploration of of the the the data. Another Another classic classic survey survey is is the the set set of of unanswerable unanswerable queries queries published published in in the the DeDe data. partment of of Energy Energy (DOE) (DOE) report report on on Genome Genome Informatics Informatics [7] [7] and and listed in Figure Figure partment listed in 8 . 1 . These These queries, queries, the the result result of of input input from from a a large large number number of of geneticists, geneticists, have have 8.1.

376

1 3 13

Compared l uation of a nagement Systems Compared Eva Evaluation of Scientific Scientific Data Data M Management Systems

driven significant significant improvement improvement in in biological biological data data management management by by motivating motivating the the driven shift human processing processing of machine processing processing and shift from from human of data data to to machine and manipulation manipulation of of data. A more more recent recent survey survey to to identify identify and and classify classify tasks tasks in in bioinformatics bioinformatics was was data. A conducted interviewing interviewing 30 active biologists from academia and industry industry [8]. [8]. Out Out conducted 30 active biologists from academia and of 1 5 identified of the the 3 315 identified tasks, tasks, 54 54 % % were were similarity similarity search, search, multiple multiple pattern pattern and and func functional tional motif motif search, search, and and sequence sequence retrieval. retrieval. This This emphasizes emphasizes the the need need for for biological biological interfaces to to combine combine traditional traditional query query languages, languages, such such as as structured structured query query lan laninterfaces guage (SQL), (SQL), with with searching searching and and analysis analysis tools. tools. Finally, Finally, surveys surveys can can provide provide insight insight guage into into user user satisfaction satisfaction with with a a system. system. An An example example of of this this type type of of survey survey is is the the Baylor Baylor College College of of Medicine Medicine (BCM) (BCM) Search Search Launcher Launcher User User Survey Survey sponsored sponsored by by the the DOE. DOE. 2 2

1 3. 2 13.2

EVALUATI ON C R ITE R IA EVALUATION CRITERIA


A needed to compare systems A consistent consistent set set of of metrics metrics is is needed to compare systems effectively. effectively. These These metrics metrics should distinguish distinguish between approaches and should between the the approaches and identify identify their their relative relative strengths strengths and and weaknesses. weaknesses. There There are are a a wide wide variety variety of of metrics metrics that that can can be be used, used, each each with with their their ad advantages vantages and and disadvantages. disadvantages. The The following following six six characteristics characteristics can can be be applied applied using using both both the the computer computer science science (implementation) (implementation) and and the the life life science science (user) (user) perspectives perspectives to to evaluate evaluate genomics genomics data data management management systems. systems. The The metrics metrics are are efficiency, efficiency, exten extensibility, sibility, functionality, functionality, scalability, scalability, understandability, understandability, and and usability. usability. These These metrics metrics cover issues of cover a a wide wide range range of of issues of practical practical concern. concern. Unfortunately Unfortunately their their definitions definitions are are vague, vague, and and they they can can be be applied applied with with various various degrees degrees of of vigor. vigor. While While this this makes makes them them difficult difficult to to apply apply consistently, consistently, it it also also provides provides the the flexibility flexibility required required to to dif differentiate ferentiate between between alternative alternative approaches approaches and and a variety of environments. environments. Each user user team team may may refine refine and and specify specify these these criteria criteria with with appropriate appropriate benchmarks benchmarks to to measure measure the the characteristics characteristics of of interest interest with with respect respect to to a a customized customized cost cost model model or or user 3 . 1 . Typically, user survey, survey, as as introduced introduced in in Section Section 1 13.1. Typically, cost cost models models will will be be used used to to evaluate evaluate implementation implementation performance performance and and user user surveys surveys will will be be used used to to measure measure users' users' satisfaction. satisfaction. The The implementation implementation perspective perspective captures captures the the characteristics characteristics of of the the system system from from the Much of this perspective the technical technical point point of of view. view. Much of this perspective is is driven driven ultimately ultimately by by the the user requirements. user requirements. However, However, it it reflects reflects only only one one of of many many possible possible implementations implementations satisfying satisfying these these requirements. requirements. While While both both views views are are helpful helpful in in understanding understanding a a system, and overlap between success of system, and there there is is significant significant overlap between them, them, the the true true success of a a system system is Thus, we is determined determined by by whether whether or or not not its its users users are are satisfied. satisfied. Thus, we believe believe the the user user perspective perspective is is ultimately ultimately the the more more important. important.
2. The BCM BCM Search SearchLauncher LauncherUser UserSurvey Surveyis is available available at http://searchlauncher.bcm.tmc.edu/user ..survey/user..survey.html. http://searchlauncher.bcm.tmc.edu/user_survey/user_survey.htmI.

1 3.2 13.2

Eval uation Criteria Evaluation Criteria

377

1 3. 2 . 1 13.2.1

The IImplementation Perspective The m p l e mentati o n Perspective


The implementation implementation perspective to fulfill the user's expectations, subject subject to to The perspective aims aims to fulfill the user's expectations, organizational constraints, constraints, by by optimizing optimizing the the six six metrics metrics subject to the the goals goals and and organizational subject to constraints constraints they they define. define.
Efficiency Efficiency

Implementation Implementation efficiency efficiency is is a a combination combination of of query query efficiency, efficiency, data data storage storage size, size, communication communication overhead, overhead, and and data data integration integration overhead. overhead. Query Query efficiency efficiency reflects reflects the the ability ability of of the the system system to to respond respond to to user user queries queries and and reflects reflects factors factors such such as as the the correlation between the data format and the expected queries. Data storage size correlation between the data format and the expected queries. Data storage size can relational database repli can be be affected affected by by choices choices such such as as using using flat flat files files or or a a relational database and and replicating cating data data locally locally or or accessing accessing them them remotely. remotely. Communication Communication overhead overhead is is char characterized by acterized by data data transfer transfer requirements requirements as as well well as as the the complexity complexity and and frequency frequency of integration overhead overhead is defined by of commands commands executed executed remotely. remotely. The The data data integration is defined by the the complexity transformations being complexity of of the the transformations being performed performed between between the the data data sources sources and and the user interface. the user interface. Each Each of of these these four four characteristics characteristics can can be be divided divided between between the the efficiency efficiency during during a a pre-processing pre-processing step step and and the the efficiency efficiency in in response response to to a a query. query. Often Often there there is is a a tradeoff which time tradeoff between between characteristics characteristics when when deciding deciding at at which time to to perform perform a a task. task. For For example, example, the the transformation transformation of of a a data data element element may may be be performed performed as as a a pre preprocessing step in a data warehouse, where it is converted prior to being loaded processing step in a data warehouse, where it is converted prior to being loaded into into the the warehouse, warehouse, or or at at run run time time in in a a federated federated database, database, where where it it is is converted converted on on the the fly fly in in response response to to a a query. query. This This decision decision may may have have a a dramatic dramatic impact impact on on the the sys system's tem's required required storage. storage. Although Although the the overall overall efficiency efficiency of of the the system system is is important, important, many many systems systems will will seek seek to to reduce reduce the the query query response response time time because because pre-processing pre-processing time all of time can can be be amortized amortized over over all of the the queries. queries. Efficiency Efficiency is is a a criterion criterion that that clearly clearly distinguishes distinguishes the the systems systems presented presented in in this this book. book. Not Not surprisingly, surprisingly, mediation mediation systems systems developed developed in in industry industry such such as as dis discoveryHub, coveryHub, the the commercial commercial version version of of Kleisli Kleisli (see (see Chapter Chapter 6), 6), or or DiscoveryLink DiscoveryLink (see 1 ) usually (see Chapter Chapter 1 11) usually perform perform better better than than the the academic academic ones. ones. This This can can be be explained by robust and systems. Sim explained by the the industry's industry's need need to to provide provide robust and efficient efficient systems. Simply ply stated, stated, optimizations optimizations are are typically typically designed designed and and developed developed for for industrial industrial sys systems, tems, and and they they appear appear to to be be less less of of a a priority priority for for academic academic systems. systems. In In general, general, efficient efficient query query processing processing is is not not a a requirement requirement for for the the presented presented academic academic sys systems. tems. Some Some architectural architectural choices choices lead lead to to systems systems that that are are more more efficient efficient in in certain certain ways. ways. Simpler Simpler data data integration integration platforms, platforms, such such as as link-driven link-driven federations, federations, offer offer ef efficient query processing perform data ficient query processing because because they they do do not not perform data conversion conversion or or complex complex data manipulation. In data manipulation. In addition, addition, partially partially materialized materialized approaches approaches with with indices, indices, or or

378

1 3 13

Compared l uation of ComparedEva Evaluation of Scientific Scientific Data Data Management Management Systems Systems
'" ,'" 7'-cj, , ".W""" _,",_"""""____ , "-"" wft""h" 0' _ >Y " = _ " ""-'_ =_"__

completely completely materialized materialized approaches approaches in in data data warehouses, warehouses, are are often often more more efficient efficient than explained in 3 .3 . 1 . Thus, than non-materialized non-materialized ones-as onesmas explained in Section Section 1 13.3.1. Thus, link-driven link-driven fed federations such (see Chapter ) often erations such as as Entrez Entrez or or SRS SRS (see Chapter 5 5) often rely rely on on large, large, pre-computed pre-computed indices to processing: SRS indices to optimize optimize query query processing: SRS queries queries usually usually take take less less than than one one minute minute to to complete. complete. Kleisli automatically Kleisli automatically optimizes optimizes queries, queries, and and its its space space management management is is generally generally more files. However, more economical economical than than flat flat files. However, it it has has translation translation time time overheads overheads when when mapping integrated sources mapping data data between between integrated sources and and the the system. system. While While such such translation translation times may may be be reduced, reduced, they they are are inevitable inevitable in in a a mediation mediation architecture. architecture. Discov Discovtimes eryLink performs two two optimization optimization steps: steps: query query rewriting rewriting followed followed by by cost-based cost-based eryLink performs optimization. Query optimization. Query rewriting rewriting transforms transforms the the user's user's query query into into a a semantically semantically equivalent equivalent query query (i.e., (i.e., a a query query that that will will return return the the same same output) output) for for which which more more efficient plans are possible. Such efficient execution execution plans are possible. Such techniques techniques were were described described in in Chap Chapter 4. 4. Cost-based Cost-based optimization optimization exploits exploits a a broad broad range range of of alternative alternative execution execution ter strategies, taking input strategies, taking input from from the the wrappers wrappers and and assessing assessing the the cost cost of of functions, functions, scans, [9, 10]. scans, and and general general sub-queries sub-queries performed performed at at the the integrated integrated source source [9, 10]. Such Such methods methods were were introduced introduced in in Section Section 4.4. 4.4.
Extensibility Extensibility

Extensibility ease with which the can be Extensibility refers refers to to the the ease with which the functionality functionality of of the the system system can be increased. increased. New New requirements requirements can can result result in in the the need need for for a a new new query query (in (in the the simplest simplest view) view) or or new new tools tools and and types types of of data. data. Examples Examples of of these these types types of of extensions extensions include include the the addition addition of of a a new new similarity similarity search search query, query, the the inclusion inclusion of of expression expression array array data data in in a a system system that that previously previously contained contained only only sequence sequence data, data, and and the the integration integration of clustering tool. tool. The complex because because most of a a new new clustering The evaluation evaluation of of extensibility extensibility is is complex most systems ways. For example, a may systems are are extensible extensible only only in in certain certain ways. For example, a given given system system may simplify adding new tools that use the data already in the system, but integrating simplify adding new tools that use the data already in the system, but integrating new all of new types types of of data data to to that that same same system system may may be be very very challenging. challenging. While While all of these these extensions extensions can, can, in in theory, theory, be be made made to to any any system, system, the the actual actual implementation implementation effort effort varies varies greatly greatly depending depending on on the the system system design. design. This This characteristic characteristic attempts attempts to quantify the effort required for the system being evaluated. to quantify the effort required for the system being evaluated. Many Many systems systems presented presented in in this this book book rely rely on on a a mediation mediation architecture architecture that that provides provides a a virtual virtual view view to to users users while while keeping keeping the the data data in in each each integrated integrated source. source. These These approaches approaches typically typically use use wrappers wrappers to to access access and and retrieve retrieve data data from from the the inte integrated means developing grated sources. sources. Extending Extending the the system system to to new new applications applications usually usually means developing new new wrappers wrappers and and registering registering them them with with the the system. system. Systems Systems that that exploit exploit meta metainformation about integrated information about integrated sources sources require require additional additional information. information. Thus, Thus, typi typically these cally these systems systems are are a a little little less less extensible extensible but but also also more more efficient. efficient. A A system system with with a consistent view, than a semantically semantically consistent view, such such as as TAMBIS, TAMBIS, appears appears to to be be less less extensible extensible than others because its semantic model model must all new others because its semantic must be be extended extended to to include include all new concepts concepts

3.2 Evalua tion Criteria 1 3.2 Eval uation Criteria . . . . . . . . . . . .

o.~.

-.

~--

379 379

and relationships relationships in in the the appropriate appropriate way way so so that that queries queries may may be be asked asked against against and the new new data. data. Within Within TAMBIS, TAMBIS, this this implies implies annotating annotating the the sources sources and and services services the model 1 ) . Extensibility model (see (see Section Section 7.4. 7.4.1). Extensibility may may also also be be more more affected affected by by materialized materialized approaches. approaches. Partial Partial materialization materialization through through indices indices may may be be costly costly to to extend, extend, and and totally totally materialized materialized approaches approaches may may rely rely on on a a management management system system that that is is difficult difficult to extend. extend. to

Functionality Functionality
Functionality Functionality reflects reflects the the system's system's ability ability to to perform perform a a wide wide variety variety of of analysis analysis over over the the data data contained contained within within it. it. This This includes includes both both the the types types of of queries queries that that the system system supports supports and and their their complexity. complexity. For For example, example, a a simple simple keyword keyword search search the reflects a a fairly fairly low low level level of of functionality, functionality, a a system system that that supports supports keyword keyword searches searches reflects using wildcards wildcards across across a a subset subset of of the the attributes attributes would would rate rate better, better, and and a a system system using that that augments augments keyword keyword searches searches with with sequence sequence homology homology comparisons comparisons and and data data clustering would would have have much much greater greater functionality. functionality. clustering Functionality their usage. Functionality of of presented presented systems systems often often depends depends on on their usage. Industrial Industrial systems are are widely widely used; used; therefore, therefore, more more functionality functionality is is provided provided or or made made avail availsystems able. Alternatively, academic academic systems systems are are designed designed for for specific specific usage usage in in a a limited limited able. Alternatively, context context and and thus thus often often provide provide fewer fewer capabilities. capabilities.

Scalability Sea/ability
Three basic components components comprise the system can Three basic comprise scalability: scalability: the the amount amount of of data data the system can handle, the the number number of of users the system simultaneously support, handle, users (queries) (queries) the system can can simultaneously support, and and of data sources that be integrated. integrated. The the number number of data sources that can can be The amount amount of of data data a a system system the handle is not only only limited limited by by available available disk disk space (a limit limit that is becoming becoming can is not space (a that is can handle less significant disks become become both both larger larger and cheaper) but by its ability to to less significant as as disks and cheaper) but also also by its ability effectively manipulate manipulate that that data. For example, a federated federated system system that that dynamically dynamically effectively data. For example, a retrieves data may be to respond to queries retrieves data from from external external sources sources may be designed designed to respond to queries without without needing a a disk-based disk-based representation. representation. However, However, if if a returned a large amount amount needing a query query returned a large of data, data, it it may may overload overload the the infrastructure, infrastructure, either either because of network network bandwidth of because of bandwidth problems or or an an inability inability to to hold hold the the results results in in memory. memory. Similarly, Similarly, systems systems need need to to problems be designed designed for for multiple multiple users users if if they they are are to to function function beyond beyond a a single single scientist's scientist's be desktop. A A system system may may be be designed designed to to work work for for a a single single user, user, making making the the assumpassump desktop. tion that that there there would would only only be be one one query query executing executing at at a a time time to to simplify simplify resource resource tion allocation. While While this this assumption assumption holds, holds, there there would would not not be be a a problem, problem, but but if if the the allocation. system is is deployed deployed in in a a multi-user multi-user environment, environment, this this will will quickly quickly become become a a limiting limiting system factor. Even Even if if a a system system is is designed designed for for concurrent use, there there are are often real limits limits factor. concurrent use, often real on the number of users it can support. These restrictions become important as on the number of users it can support. These restrictions become important as the system system extends extends its its user user base base from from an an individual individual to to a a lab lab and and eventually eventually to to the the the community. Finally, Finally, there there are are several several factors factors affecting affecting the the number number of of data data larger larger community.

380 380

~~~~:~:~:;~:~:~;:~:~:~:=~`=~`~:~`~;~;:`:~:~:~:~;~;~`~``~:~`=~==`=~=~:~`;~:~:~:~;~=~=~:~;~;~:~`:~:~:~;~`=~`=:``~;~=~`~:~==;~~~

1 3 13

Compared l uation of a nagement Systems Compared Eva Evaluation of Scientific Scientific Data Data M Management Systems
"." o. <>._ "'.h',_..' , _ """" ' ''' " ''_ ''*'' '*''' ''' '' ___ ''''_ ''' _ _ _ _ _ _ , , ,',__ ,,__,_ _" V =_ " =,.,, _ _ ,

sources that that can can be be integrated integrated realistically realistically into into a a system, system, including including scarcity scarcity of of local local sources resources, resources, communication communication bandwidth, bandwidth, query query response response times, times, and and the the incremental incremental effort required required to to integrate integrate and and maintain maintain each each additional additional data data source. source. These These fac faceffort tors tors are are conflicting, conflicting, and and finding finding an an acceptable acceptable balance balance requires requires understanding understanding your your scientists' needs. needs. scientists' The The relative relative importance importance of of each each of of these these three three components components varies varies with with the the underlying underlying approach approach the the system system uses uses and and the the environment environment in in which which the the system system is is expected expected to to be be used. used. For For example, example, values values appropriate appropriate for for a a single-user, single-user, desktop desktop system system in in a a small small company company would would be be very very different different from from those those for for a a large large company company supporting supporting a a community community resource. resource. Serious Serious consideration consideration to to the the number number of of sources sources that that can can be be integrated integrated must must be be given given when when evaluating evaluating systems systems because because scientists scientists constantly want want to to access access more more data. data. constantly Again, industrial con Again, not not surprisingly, surprisingly, systems systems designed designed and and developed developed in in industrial context to scale text appear appear to scale better better than than academic academic ones. ones. The The main main reason reason is is that that scala scalability part of bility is is often often not not part of academic academic systems systems requirements, requirements, whereas whereas it it is is typically typically mandatory in mandatory in industrial industrial context. context. Kleisli Kleisli regularly regularly handles handles hundreds hundreds of of megabytes megabytes and and more more than than 60 60 types types of of integrated integrated data data sources. sources. However, However, each each invocation invocation of Kleisli only of Kleisli only runs runs one one top-level top-level query query at at a a time time through through the the system, system, generat generating ing multiple multiple concurrent concurrent sub-queries. sub-queries. In In contrast, contrast, TAMBIS TAMBIS only only provides provides access access to to five data data sources sources and and offers offers little little explicit explicit support support for for simultaneous queries and and five simultaneous queries users. users.

Understandability Understandability
Understandability expresses expresses the the clarity of the system design. design. If If the system is is well well Understandability clarity of the system the system designed (e.g., using strong software engineering practices and object-oriented or or designed (e.g., using strong software engineering practices and object-oriented component-based techniques), as new new developers developers join join the project they they are are able able to to component-based techniques), as the project understand easily the implementation implementation details understand easily the details and and quickly quickly begin begin contributing contributing to to the the project. If If the the system poorly designed designed and and overly overly complex, complex, people people not not involved involved project. system is is poorly in the the original original design design of of the system require require additional they are able to to in the system additional time time before before they are able make significant significant contributions contributions to to the the project. project. In addition, their their contributions contributions are are make In addition, likely to to further complicate the the architecture architecture because because identifying identifying the the most most effective effective likely further complicate way to to implement implement a a specific specific feature feature is is difficult difficult without without a a solid solid understanding understanding of of way the overall overall architecture. architecture. Unfortunately, Unfortunately, without without intimate intimate access access to to implementation implementation the details, this this characteristic characteristic is is very very hard hard to to determine. determine. details, All presented presented systems systems offer offer a a clear clear and and understandable understandable design: design: link-based link-based fedfed All eration architecture (see Chapter Chapter 5), 5), mediation mediation architecture combined with with wrapwrap eration architecture (see architecture combined pers (see (see Chapters Chapters 6, 6, 7, 7, 8, 8, 11, 1 1 , and and 12), 12), or or warehousing warehousing (see (see Chapter Chapter 10). 10). These These pers designs are are described described in in detail detail in in their their respective respective sections sections and and enable enable new new develdevel designs opers to to understand understand their their overall overall architecture architecture quickly. quickly. opers

1 3.2 13.2

Eva l uation Criteria Evaluation Criteria


9 9 . . . . . ~. . . . . . . . . . . ~ _ _

38 1
3 8 1

Usability Usability

Implementation usability usability reflects reflects the the ability ability of of the the user to modify the system beImplementation user to modify the system be ha vior and and the system capabilities some type havior the exposure exposure of of underlying underlying system capabilities through through some type of of application user interface. interface. Of interest is application or or user Of particular particular interest is the the ability ability to to adapt adapt system system capabilities based on user needs. needs. For example, while built capabilities based on specific specific user For example, while a a system system may may be be built on relational database, provide users on a a relational database, it it may may not not provide users full full SQL SQL query query access, access, choosing choosing instead instead to to limit limit the the allowable allowable queries queries by by publishing publishing only only a a simple simple keyword keyword search search interface. interface. There There is is more more to to usability, usability, however, however, than than simply simply expanded expanded capability. capability. While wide variety While providing providing a a wide variety of of complex complex queries queries is is generally generally valuable, valuable, as as we we discuss in the next section, it is important to ensure that they are presented in discuss in the next section, it is important to ensure that they are presented in a a useful way. is easy easy to overwhelm users users by many options-forcing useful way. It It is to overwhelm by providing providing too too many options~forcing people to become experts your tool tool before before they able to people to become experts in in your they are are able to accomplish accomplish anything. anything. Usability is Usability is often often not not a a specification specification for for systems systems developed developed in in academic academic con contexts. TAMBIS TAMBIS (see 7) does texts. (see Chapter Chapter 7) does not not provide provide an an API, API, and and its its limited limited extensibility extensibility may further restrict restrict this this ability. ability. KIND KIND provides provides an an API API only only for for certain certain features. features. In In may further contrast, high-level query CPL and contrast, Kleisli Kleisli provides provides access access to to its its high-level query languages languages ((CPL and sSQL), sSQL), an an API API to to SMLNJ SMLNJ function function calls, calls, and and the the Pizzkell Pizzkell suite suite of of JDBC-like JDBC-like interfaces interfaces ((CPL2Perl CPL2Perl and and CPL2Java) 1 1 ] . SRS SRS requires requires programming programming skills CPL2Java) [ [11]. skills in in the the Icarus Icarus language language to to modify modify the the system. system. It It is is likely likely that that the the next next generation generation of of systems, systems, including including future future versions versions of of existing existing ones, ones, will will focus focus on on including including new new interfaces interfaces to facilitate their their maintenance. For example, a graphical graphical administration administration tool for to facilitate maintenance. For example, a tool for SRS that that would use of under development. SRS would obviate obviate the the use of Icarus Icarus is is currently currently under development.
Summary Summary

The The implementation implementation perspective perspective can can be be expressed expressed faithfully faithfully with with the the six six metrics. metrics. However, However, it it is is difficult difficult to to optimize optimize all all of of them them concurrently. concurrently. Indeed, Indeed, they they are are far far from with some some metrics metrics being and others from being being independent, independent, with being positively positively correlated correlated and others being being negatively negatively correlated. correlated. For For example, example, the the use use of of a a query query language language such such as as SQL SQL may may improve improve both both extensibility extensibility and and efficiency. efficiency. Extensibility Extensibility could could be be improved improved because able to because the the system system would would be be able to express express a a wider wider variety variety of of queries, queries, and and the the efficiency could be efficiency could be improved improved by by its its ability ability to to perform perform query query optimization. optimization. Similarly, Similarly, a system with will be less extensible. a poorly poorly designed designed system with minimal minimal understandability understandability will be less extensible. On On the the other other hand, hand, a a system system that that is is highly highly functional, functional, with with many many analysis analysis tools tools available, available, is is likely likely to to be be inefficient inefficient by by combining combining these these various various functionalities. functionalities. Every Every system system must must find find the the appropriate appropriate set set of of tradeoffs tradeoffs to to balance balance the the needs needs and and expectations expectations of of its its target target users. users. One One of of the the goals goals of of this this book book is is to to aid aid potential potential developers developers and and users users in in identifying identifying their their needs needs and and choosing choosing the the best best approach approach and system system for for them. them. and

382

382
1 3. 2 . 2 13.2.2

1 3 13

uation of Compared Eval Evaluation of Scientific Scientific Data Data Management Systems

The se r Pe rspective The U User Perspective


The The user user perspective perspective defines defines the the requirements requirements of of the the system system to to be be developed developed or or chosen. chosen. The The metrics metrics presented presented here here can can be be used used to to identify identify the the users' users' expectations expectations and and needs needs and and evaluate evaluate systems systems accordingly. accordingly.
Efficiency Efficiency

From From the the user's user's point point of of view, view, efficiency efficiency is is evaluated evaluated as as the the ability ability of of the the system system to frame and ability to to perform perform a a single single task task in in a a satisfactory satisfactory time timeframe and its its overall overall ability to sup support in the succession of tasks. This has two port its its user user base base in the succession of their their complex complex tasks. This metric metric has two components. perspective but components. The The first first is is similar similar to to the the implementation implementation perspective but at at a a slightly slightly higher higher level: level: How How quickly quickly does does the the system system respond respond to to queries? queries? In In effect, effect, this this view view summarizes implementation implementation efficiency efficiency from from a a purely purely practical practical perspective. perspective. If If the the summarizes system provides a reasonable response time, the efficiency is deemed acceptable. system provides a reasonable response time, the efficiency is deemed acceptable. The The faster faster the the response response time, time, the the better better the the ranking. ranking. The The second second is is a a consolidation consolidation of of the the remaining remaining categories: categories: How How effectively effectively can can users users ask ask the the system system the the ques questions tions they they need need answers answers to, to, get get the the answers, answers, and and continue continue using using those those answers answers in their their analysis? analysis? Given Given that that the the second second component component is is reflected reflected in in the the remaining in remaining characteristics, characteristics, only only the the first first definition definition will will be be considered. considered.
Extensibility Extensibility

Extensibility Extensibility expresses expresses the the users' users' ability ability to to ask ask new new questions questions and and customize customize the the system needs. While system to to meet meet their their specific specific needs. While plug-and-play plug-and-play search search and and analysis analysis tools tools are well-designed system are a a long long way way away, away, a a well-designed system often often allows allows the the user user to to extend extend it it in familiar with in limited limited ways ways without without becoming becoming intimately intimately familiar with the the underlying underlying system. system. These variations on queries, such These extensions extensions may may be be simple simple variations on previous previous queries, such as as chang changing ing the the attributes attributes being being searched searched during during a a keyword keyword search, search, they they may may be be slightly slightly more more complicated complicated variations variations on on a a query, query, such such as as changing changing the the sequence sequence homol homology ogy algorithm algorithm used used to to perform perform a a search, search, or or they they may may be be a a completely completely new new type type of of query. query. Extensibility Extensibility is is not not limited limited to to queries; queries; it it also also reflects reflects the the ability ability of of the the user to introduce new user to introduce new data, data, and and new new data data types, types, into into the the system. system. For For example, example, it ability to it reflects reflects the the user's user's ability to include include data data from from a a new new data data source. source. As As with with im implementation plementation extensibility, extensibility, many many systems systems theoretically theoretically can can be be extended extended by by their their users. users. This This characteristic characteristic reflects reflects how how much much effort effort a a user user must must expend expend to to add add new new types required to types of of queries queries or or data data and and how how much much programming programming skill skill is is required to perform perform this minimal programming this extension. extension. In In a a truly truly extensible extensible system, system, only only minimal programming expertise expertise would would be be required. required. The The presented presented approaches approaches assume assume that that users users are are not not actively actively involved involved in in system system extensions. typically provide provide users extensions. However, However, they they typically users as as much much flexibility flexibility as as possible possible in in

1 3.2 13.2

Evaluation Evaluation Criteria Criteria

383

customizing customizing their their queries. queries. For For example, example, when when performing performing a a similarity similarity search, search, the the various application version various program program parameters parameters and and the the choice choice of of application version are are made made avail available. User typically is is linked all presented able. User extensibility extensibility typically linked to to user user functionality, functionality, and and all presented systems systems perform perform well well for for both both criteria. criteria.
Functionality Functionality

User not only only the types of that can User functionality functionality reflects reflects not the different different types of queries queries that can be be asked, asked, but but also also how how those those queries queries can can be be changed changed and and combined combined to to form form new new queries. queries. For For example, a a BLAST BLAST search search that that accepts accepts a a protein protein sequence sequence is is useful, useful, but but being being able able example, to to BLAST BLAST both both protein protein and and nucleotide nucleotide sequences sequences is is better. better. And And the the ability ability to to take take the the results results of of a a BLAST BLAST search search and and use use them them as as input input into into other other BLAST BLAST searches, searches, or send send them them to to an an analysis analysis program, is even even better better still. When querying querying a a system, system, or program, is still. When scientists have have in in mind mind a a specific specific question question they they are are trying trying to to answer, answer, which which may may scientists or not be or may may not be part part of of a a much much larger larger question. question. The The more more of of the the question question that that can can be be answered answered within within a a single single environment, environment, the the more more useful useful that that environment environment is. is. To To completely completely answer answer a a question, question, a a system system must must have have both both the the correct correct types types of of data data and and the the right right capabilities. capabilities. The The more more questions questions a a system system can can answer, answer, the the greater greater its its functionality. functionality. All presented presented systems are generic that they they are not designed All systems are generic in in that are not designed to to answer answer a small set a single single query query or or even even a a small set of of queries queries in in a a particular particular context. context. Instead, Instead, they that allow they provide provide query query languages languages that allow the the formulation formulation of of a a variety variety of of queries. queries. The The advantages advantages of of generic generic systems systems are are presented presented in in Chapter Chapter 4. 4. In In addition, addition, some some systems systems such such as as DiscoveryLink DiscoveryLink and and Kleisli Kleisli provide provide access access to to multiple multiple sources sources and and analytical analytical applications. applications. This This variety variety of of integrated integrated resources resources significantly significantly increases increases users' functionality. users' functionality.
Scalability Scalability

User scalability has some some components system scalability: User scalability has components shared shared with with system scalability: Can Can the the system users that does it provide access system handle handle the the number number of of users that it it has, has, and and does it provide access to to enough enough data using? However, data and and tools tools to to make make it it worth worth using? However, it it also also has has a a unique unique component: component: the large numbers objects. This the ability ability of of the the system system to to handle handle large numbers of of input input objects. This is is becoming becoming an small-scale science, an increasingly increasingly important important issue issue as as genomics genomics moves moves from from small-scale science, in in which which a a researcher researcher may may focus focus on on a a single single gene, gene, to to large-scale large-scale analysis analysis of of entire entire genomes. only capable single input time, it genomes. If If a a system system is is only capable of of processing processing a a single input at at a a time, it will will not to analyze not be be useful useful to to users users who who need need to analyze hundreds hundreds or or thousands thousands of of these these objects. objects. For For example, example, Web-based Web-based homology homology searches searches often often take take only only a a single single sequence sequence at at a a time time as as input. input. As As a a result, result, these these interfaces interfaces quickly quickly become become of of limited limited use use if if a a scientist has hundreds of clones clones to scientist has hundreds of to analyze. analyze. Unless Unless scalability scalability is is part part of of the the system system requirements, academic well. requirements, academic systems systems typically typically do do not not scale scale well.

384 384

13 1 3

u ation o f Scientific Compared Eval Evaluation of Scientific Data Data Management Systems

Understandability Understandability

Although a a system be considered considered usable usable without understandable, Although system cannot cannot be without also also being being understandable, we we separate separate the the concepts concepts for for evaluation evaluation purposes. purposes. For For this this discussion, discussion, a a system system is is understandable if if its its users users not not only only understand understand the the queries queries being being asked, asked, but but also also understandable the the results results being being returned. returned. This This means means that that the the semantics semantics of of both both the the interface interface and the data should be clear clear and that when and the data should be and well well documented documented so so that when a a question question is is asked, what that that question is asked, what question was was is is known known exactly. exactly. This This sounds sounds obvious, obvious, but but it it is often overlooked. overlooked. For For even even simple as keyword searches, the the semantics semantics often simple queries queries such such as keyword searches, may "Is the may not not be be clear. clear. Questions Questions such such as as "Is the text text of of the the entire entire system system searched searched for that that keyword or are are only only certain certain objects objects or or attributes attributes searched? searched?" must be be for keyword or " must answered. to understand the semantics semantics of of a a query query may may lead lead to to asking asking answered. Failure Failure to understand the the results. Furthermore, the wrong wrong question question or or misinterpreting misinterpreting the the results. Furthermore, it it may may appear appear that the the semantics semantics of of a a system system is is always always well well defined. Unfortunately, this this is is not not that defined. Unfortunately, always always the the case. case. Many Many systems systems do do not not precisely precisely define define the the data data they they contain contain and and instead rely rely on domain expertise instead on the the user's user's domain expertise to to guide guide them. them. This This situation situation is is even even worse in in integrated integrated systems systems in in which which returned returned data data have have been been obtained from a a worse obtained from variety of In these many subtle variety of data data sources. sources. In these systems, systems, there there may may be be many subtle semantic semantic inconsistencies. Even systems that that claim claim to to provide provide consistent views of of the the data, data, inconsistencies. Even in in systems consistent views the associated with the precise precise semantics semantics associated with some some aspects aspects of of the the data data may may be be elusive. elusive. These implicit semantics These types types of of implicit semantics dramatically dramatically reduce reduce the the understandability understandability of of the the results. results. Most Most presented presented systems systems were were designed designed to to be be understandable understandable to to their their scientific scientific users. Some emphasis on original users. Some systems systems put put a a creative creative emphasis on this this criteria criteria and and provide provide original solutions. TAMBIS TAMBIS focused focused on on a a transparent transparent access access to to data data through through an an ontology ontology solutions. reflecting the the scientists' scientists' view view of of the the data. KIND returns returns outputs outputs in the context context reflecting data. KIND in the of of a a domain domain map-a m a p ~ a graphical graphical representation representation of of ontological ontological knowledge knowledge of of the the scientific other systems, Kleisli and scientific domain. domain. In In contrast, contrast, other systems, such such as as Kleisli and DiscoveryLink, DiscoveryLink, rely rely on on the the data data organization organization provided provided by by the the integrated integrated sources, sources, assuming assuming that that this is appropriately this organization organization is appropriately known known and and understood understood by by users. users.
Usability Usability

Usability most important system and and one Usability is is probably probably the the most important feature feature of of a a system one of of the the most most difficult difficult to to obtain. obtain. While While focusing focusing on on facets facets of of the the other other characteristics, characteristics, it it is is remark remarkably easy ably easy to to develop develop a a system system that, that, despite despite providing providing all all of of the the required required efficiency, efficiency, extensibility, and scalability, target audience. extensibility, functionality, functionality, and scalability, is is unusable unusable by by its its target audience. This This occurs programmers often occurs because because programmers often design design a a system system for for themselves, themselves, forgetting forgetting that that their not programmers their users users are are not programmers and and have have no no desire desire to to become become programmers. programmers. An An ideal ideal system system provides provides an an intuitive intuitive query query interface, interface, directly directly supporting supporting only only the the queries be executed queries that that need need to to be executed and and returning returning the the results results in in the the most most useful useful format. format.

1 3.3 Tradeoffs Tradeoffs 13.3

385

In reality, reality, the the number number of of scientists scientists and and types types of of queries queries a a system system needs needs to to support support In makes this this impossible. impossible. One One approach approach to to increasing increasing the the usability usability of of a a system system is is to to makes provide multiple multiple interfaces interfaces targeting targeting different different user user groups. groups. For For example, example, a a graphgraph provide ical interface interface can can help help novice novice users users comfortably comfortably interact with a a system, system, but but it it may may ical interact with be too too slow slow for for experts. To increase its usability, usability, a a system could either either add add shortshort be experts. To increase its system could cut keys keys or or have have a a separate command line line interface interface for for experts. While requiring requiring cut separate command experts. While additional development development effort, effort, this this would would allow allow users users familiar familiar with with the the system system to to additional perform their their queries queries quickly, quickly, while while not not forcing forcing novices novices to to learn learn the the more more advanced advanced perform interface immediately. immediately. interface

1 3.3 13.3
. . . .

TRADEOFFS TRADEOFFS
This section section explores of the the tradeoffs tradeoffs to to b e considered considered when when evaluating This explores some some of be evaluating systems and of the the unique unique characteristics data management systems and some some of characteristics of of biological biological data management systems that complicate their systems that complicate their design design and and evaluation. evaluation. As As with with the the evaluation evaluation met metrics, is no no clearly best approach, approach, but rics, there there is clearly best but rather rather the the user user requirements requirements and and system system constraints need the evaluation. constraints need to to be be included included in in the evaluation. The The purpose purpose of of the the following following sections is is simply to call out certain certain characteristics the readers readers to sections simply to call out characteristics and and encourage encourage the to consider them. them. This This is is meant meant to exhaustive, list. list. There consider to be be an an illustrative, illustrative, not not exhaustive, There are are many tradeoffs tradeoffs and considerations that that are not discussed many and considerations are not discussed but but that that may may be be impor important for evaluating tant evaluating a a system within within a a specific environment. environment.

1 3.3. 1 13.3.1

M ateri a l ized vs. N on - M ate ri a l ized Materialized Non-Materialized


Materialized Materialized approaches approaches usually usually are are faster faster than than non-materialized non-materialized ones ones for for query query execution. execution. This This makes makes intuitive intuitive sense sense because because the the data data is is stored stored in in a a single single location location and and in in a a format format supportive supportive of of the the queries. queries. To To confirm confirm intuition, intuition, tests tests were were run run in in 1 995 with 1995 with several several implementations implementations of of the the query: query: "Retrieve "Retrieve the the HUGO HUGO names, names, accession numbers, amino acid known human human genes accession numbers, and and amino acid sequences sequences of of all all known genes mapped mapped These tests using the Genomic Unified to c" [12]. to chromosome chromosome c" [12]. These tests were were performed performed using the Genomic Unified 3 Schema warehouse as Schema (GUS) (GUS)warehouse as the the materialized materialized source source and and the the K2IKleisli K2/Kleisli 3 system system as as the the non-materialized non-materialized source. source. The The query query requires requires integrating integrating data data from from the the Genome Genome DataBase GDB), the DataBase ((GDB), the Genome Genome Sequence Sequence DataBase DataBase (GSDB), (GSDB), and and GenBank. GenBank. Mea Measures sures showed showed that that for for all all implementations, implementations, the the warehouse warehouse is is significantly significantly faster. faster. In In certain certain cases, cases, queries queries executed executed by by K2 K2 as as part part of of this this evaluation evaluation failed failed to to com complete plete due due to to network network timeouts. timeouts. The The expression expression of of the the query query (using (using semi-joins semi-joins

3 3.. The The version version of of K2 K2 used used for for this this comparison comparison is is much much earlier earlier than than the the version version presented presented in in Chapter Chapter 8.

386 386

13 1 3

Compared l u ation of ComparedEva Evaluation of Scientific Scientific Data Data

Management Systems

rather than than nested nested loop loop iterations) iterations) also also affected affected the the performance performance of of the the execution execution rather of of the the query. query. In addition addition to to the the communication communication overhead, overhead, the the middleware middleware between between the the user user In interface overhead. Recently, interface and and the the remote remote data data may may introduce introduce computational computational overhead. Recently, tests been performed performed at whether or tests have have been at IBM IBM to to determine determine whether or not not a a middleware middleware approach such as as DiscoveryLink DiscoveryLink (presented (presented in in Chapter Chapter 1 11) affects the the access access costs costs approach such 1 ) affects when when interacting interacting with with a a single single database. database. They They conducted conducted two two series series of of tests tests in in which which DiscoveryLink DiscoveryLink was was compared compared to to a a production production database database at at Aventis Aventis [13]. [13]. The The results results show show that, that, in in the the tested tested context, context, for for a a single single user, user, the the middleware middleware did did not not affect affect the the performance. performance. None None of of the the tested tested queries queries involved involved the the manipulation manipulation of of large large amounts amounts of of data; data; however, however, they they presented presented many many sub-queries sub-queries and and unions. unions. In some some cases, cases, accessing accessing the the database database through through a a middleware middleware and and a a wrapper wrapper was was In even even faster faster than than the the direct direct access access to to the the database database system. system. The The load load test test shows shows that that both both configurations configurations scale scale well, well, and and the the response response times times for for both both approaches approaches are are comparable comparable to to the the single-user single-user case. case. There are are a a variety variety of of factors factors to to be be considered considered beyond beyond the the execution execution cost. cost. Ma MaThere terialized bases are terialized data databases are generally generally more more secure secure because because queries queries can can be be performed performed entirely a firewall. firewall. Non-materialized Non-materialized approaches have the the advantage advantage that that entirely behind behind a approaches have they always return up-to-date information they always return the the most most up-to-date information available available from from the the sources, sources, which which can can be be important important in in a a highly highly dynamic dynamic environment. environment. They They also also require require sig significantly nificantly less less disk disk space space and and can can be be easier easier to to maintain maintain (particularly (particularly if if the the system system does does not not resolve resolve semantics semantics conflicts). conflicts).

1 3.3.2 13.3.2

Data b uti o n and H eterogeneity Data Distri Distribution Heterogeneity


Many systems systems presented presented in chapters are Many in the the previous previous chapters are mediation mediation systems. systems. Media Mediation tion systems systems integrate integrate fully fully autonomous, autonomous, distributed, distributed, heterogeneous heterogeneous data data sources sources such database systems such as as various various database systems (relation, (relation, object-relational, object-relational, object, object, XML, XML, etc.) etc.) and and flat flat files. files. In In general, general, the the performance performance characteristics characteristics of of distributed distributed database database systems systems are are not not well well understood understood [14]. [14]. There There are are not not enough enough distributed distributed database database applications applications to to provide provide a a framework framework for for evaluation evaluation and and comparison. comparison. In In addition, addition, the the performance performance models models of of distributed distributed database database systems systems are are not not sufficiently sufficiently de developed, and veloped, and it it is is not not clear clear that that the the existing existing benchmarks benchmarks to to test test the the performance performance of transaction processing of transaction processing applications applications in in pure pure database database contexts contexts can can be be used used to to measure measure the the performance performance of of distributed distributed transaction transaction management. management. Furthermore, Furthermore, because bases, the because the the resources resources are are not not always always data databases, the mediation mediation approach approach is is more more complex than distributed database complex than the the multi-database multi-database and and other other distributed database architectures architectures typ typically studied in computer science. ically studied in computer science. For issues related For many many bioinformatics bioinformatics systems, systems, issues related to to data data distribution distribution and and het heterogeneity considerable and erogeneity are are considerable and significantly significantly affect affect the the performance. performance. As As a a result, result,

3.3 1 3.3

Tradeoffs Tradeoffs

.........................

~ ~ ~ o ~ o ~ ~ ~ ~

387

they f sources they typically typically integrate integrate only only the the minimal minimal number number o of sources required required to to perform perform a a given task, even when when additional additional information be useful. useful. The The complexity complexity of of given task, even information could could be this domain lack of information favor this domain and and the the lack of objective objective information favor domain-specific domain-specific evaluation evaluation approaches ones for this characteristic. approaches over over generic generic ones for this characteristic.

1 3.3.3 13.3.3

S e m i -Struct u red Data vs. F u l ly Structu red Data Semi-Structured Fully Structured
Previous Previous chapters chapters have have pointed pointed out out that that scientific scientific data data are are usually usually complex, complex, and and their their structures structures can can be be fluid. fluid. For For these these reasons, reasons, a a system system relying relying on on a a semi-structured semi-structured framework rather rather than than a a fully fully structured structured approach, approach, such such as as a a relational relational database, database, framework seems seems more more adequate. adequate. Although Although there there are are systems systems that that utilize utilize meta-level meta-level capabili capabilities ties within within relational relational databases databases to to develop develop and and maintain maintain flexibility, flexibility, they they are are usually usually not not scalable scalable enough enough to to meet meet the the demands demands of of modern modern genomics. genomics. The The success success of of XML XML as as a a self-describing self-describing data data representation representation language language for for electronic electronic information information inter interchange makes makes it good candidate change it a a good candidate for for biological biological data data representation. representation. The The design design of of a a generic generic benchmark benchmark for for evaluating evaluating XML XML management management systems systems is is a a non-trivial non-trivial task task in in general, general, and and it it becomes becomes much much more more challenging challenging when when combined combined with with data data management management and and performance performance issues issues inherent inherent to to genomics. genomics. Some Some attempts attempts have have been been made made to to design design an an XML XML generic generic benchmark. benchmark. Three Three XML benchmarks limited machine XML generic generic benchmarks limited to to locally locally stored stored data data and and in in a a single single machine or single user designed: X007 1 5] , and or single user environment environment have have been been designed: XOO7 [3], [3], XMach-1 XMach-1 [ [15], and XMark [16]. [16]. X007 XOO7 attempts attempts to to harness harness the the similarities in data data models of XML XML XMark similarities in models of and and object-oriented object-oriented approaches. approaches. The The XMach-l XMach-1 benchmark benchmark [15] [15] is is a a multi-user multi-user benchmark benchmark designed designed for for business-to-business business-to-business applications, applications, which which assumes assumes the the data data size 1 to 14 KB) 1 6] is size is is rather rather small small ( (1 to 14 KB).. XMark XMark [ [16] is a a newer newer benchmark benchmark for for XML XML data data stores. models an stores. It It consists consists of of an an application application scenario scenario that that models an Internet Internet auction auction site site and XQuery queries designed to essentials of and 20 20 XQuery queries designed to cover cover the the essentials of XML XML query query processing. processing. X007 XOO7 appears appears to to be be the the most most comprehensive comprehensive benchmark. benchmark. Both Both XMark XMark and and XMach-1 XMach-1 focus focus on on a a data-centric data-centric usage usage of of XML. XML. All All three three benchmarks benchmarks provide provide queries queries to to test test relational relational model model characteristics characteristics such such as as selection, selection, projection, projection, and and reduction. reduction. Properties Properties such such as as transaction transaction processing, processing, view view manipulation, manipulation, aggre aggregation, and update, are not yet gation, and update, are not yet tested tested by by any any of of the the benchmarks. benchmarks. XMach-1 XMach-1 cov covers ers delete delete and and insert insert operations, operations, although although the the semantics semantics of of such such operations operations are are not clearly defined query model. not yet yet clearly defined for for the the XML XML query model. Additional Additional information information about about XML found in in Bressan Bressan et XML benchmarks benchmarks can can be be found et al.'s al.'s XML XML Management Management System System Benchmarks 1 7] . Benchmarks [[17]. Native Native XML XML systems systems have have been been compared compared to to XML-enabled XML-enabled systems systems (relational (relational systems systems that that provide provide an an XML XML interface interface that that allows allows users users to to view view and and query query their their data data in in XML) XML) with with three three collections collections of of queries: queries: data-driven, data-driven, document-driven, document-driven, and and navigational 1 8]. Tests navigational queries queries [ [18]. Tests confirm confirm that that XML-enabled XML-enabled management management systems systems

388 ~ 388

1 3 13

l uation of Compared Eva Evaluation of Scientific Scientific Data Data Management Systems

perform perform better better than than XML XML native native systems systems for for data-driven data-driven queries. queries. However, However, XML XML native native systems systems outperform outperform XML-enabled XML-enabled ones ones on on document-driven document-driven and and naviga navigational queries. queries. This This is is not not unexpected unexpected because because enabled enabled systems systems are are tuned tuned to to opti optitional mize mize the the execution execution of of relational relational queries. queries. However, However, they they do do not not efficiently efficiently represent represent nested linked data. nested or or linked data. Thus, Thus, navigational navigational queries queries within within enabled enabled systems systems are are rather rather slow; slow; whereas whereas native native systems systems are are able able to to exploit exploit the the concise concise representation representation of of data data in in XML. XML. Finally, Finally, document document queries queries may may use use the the implicit implicit order order of of elements elements within within the file. This bases, the XML XML file. This ordering ordering is is not not typically typically represented represented in in relational relational data databases, therefore therefore defining defining an an appropriate appropriate representation representation is is a a tedious tedious task task and and negatively negatively affects performance. performance. affects The The type type of of system system that that is is most most appropriate appropriate depends depends heavily heavily on on the the types types of of queries queries expected, expected, the the data data being being integrated, integrated, and and the the tools tools with with which which the the system system must interact. interact. Scientific Scientific queries queries exploit exploit all all characteristics characteristics of of XML XML queries: queries: data, data, must navigation, navigation, and and document. document. An An XML XML biological biological information information system system will will need need to to perform well in all all these these contexts. contexts. An An XML XML biological benchmark will will be be needed needed perform well in biological benchmark to evaluate evaluate XML XML biological information systems. systems. to biological information

13.3.4 1 3.3.4

Text Retrieva Retrievall Text


For many many tasks, scientists access through a For tasks, scientists access their their data data through a document-based document-based interface. interface. Indeed, a amount of annotations. Life Indeed, a large large amount of the the data data consists consists of of textual textual annotations. Life scientists scientists extensively use use search engines to to access to explore explore the extensively search engines access data data and and navigation navigation to the data. data. Unlike database approaches, structured models be used to represent represent a a Unlike database approaches, structured models cannot cannot be used to document or or many many queries queries over over document document sets sets (e.g., a document, document, find find other other document (e.g., given given a documents that that are are similar similar to to it). it). The evaluation of of a textual retrieval retrieval engine engine documents The evaluation a textual relevance of document. A A document document is typically relies relies on on the the notion notion of of relevance typically of a a document. is relevant relevant if it it satisfies satisfies the the query. The notion notion of of relevance is subjective subjective because because retrieval retrieval if query. The relevance is engines typically typically provide provide users users with with a limited query language consisting of Boolean Boolean engines a limited query language consisting of expression of keywords phrases (strings (strings of of characters) In such expression of keywords or or phrases characters).. In such context, context, the the query often does does not not express express the intent, and thus, the notion of of relevance relevance is is query often the user's user's intent, and thus, the notion used to to capture capture the the level level of of satisfaction satisfaction of of the the user user rather rather than than the the validation validation of of used the query. query. Relevance Relevance is is considered considered to to have have two two components: components: recall recall and and precision. precision. the Recall is is the the ratio ratio of of the the number number of of relevant relevant documents documents retrieved retrieved by by the the engine engine Recall to the the total total number number of of relevant relevant documents documents in in the the entire entire data data set. set. A A recall recall equal equal to to one one means means all all relevant documents were retrieved, whereas a recall recall of of zero zero to relevant documents were retrieved, whereas a means no no relevant relevant document document was was retrieved. retrieved. A A recall recall of of one one does does not not guarantee guarantee means the satisfaction satisfaction of of the the user; user; indeed, indeed, the the engine engine may may have have retrieved retrieved numerous numerous nonnon the relevant relevant documents documents (noise). (noise). Precision is is the the ratio ratio of the number number of relevant documents documents retrieved retrieved by the Precision of the of relevant by the engine to to the the total total number number of of retrieved retrieved documents documents and and thus thus reflects reflects the the noise noise in in engine the response. response. A A precision precision equal equal to to one one means means all all retrieved retrieved documents documents are are relevant, relevant, the

1 3.4 Summar,y . 13.4

389 389

whereas whereas a a precision precision of of zero zero means means no no retrieved retrieved document document is is relevant. relevant. Ideally, Ideally, a a document would have returning exactly exactly the document would have both both a a precision precision and and a a recall recall of of one, one, returning the set documents desired. desired. Unfortunately, set of of documents Unfortunately, state-of-the-art state-of-the-art text text query query engines engines are are far from ideal. Currently, precision are inversely related most far from that that ideal. Currently, recall recall and and precision are inversely related in in most systems, and a balance balance is sought to obtain the systems, and a is sought to obtain the best best overall overall performance performance while while not not being overly being overly restrictive. restrictive.

1 3.3.5 13.3.5

IIntegrating nteg rat i n g Appl icati ons Applications


System System requirements requirements usually usually include include the the ability ability to to use use sophisticated sophisticated applications applications to data. The to access access and and analyze analyze scientific scientific data. The more more applications applications that that are are available, available, the has. However, the better better functionality functionality the the system system has. However, integrating integrating applications applications such such as as BLAST BLAST may may significantly significantly affect affect the the system system performance performance in in unanticipated unanticipated and and unpredictable unpredictable ways. ways. For For example, example, a a call call to to blastp blastp against against a a moderate moderate size size data data set set will return return a a result result within within seconds, seconds, whereas whereas a a call call to to tblastn tblastn against against a a large large data data set set will may may require require hours. hours. The The evaluation evaluation of of the the performance performance of of the the overall overall integration integration approach must about the stand-alone performance approach must include include information information about the stand-alone performance of of the the integrated integrated resources. resources. This This information, information, including including the the context context in in which which optimal optimal performance performance can can be be obtained, obtained, is is often often poorly poorly documented. documented. This This is is partially partially because because many of the useful analysis tools are developed in academic contexts where many of the useful analysis tools are developed in academic contexts where little little effort made to and advertise effort is is made to characterize characterize and advertise their their performance. performance. Readers Readers who who are are involved in tool involved in tool development development are are invited invited to to better better characterize characterize the the performance performance of of these these tools tools for for systems systems to to better better integrate integrate them. them.

1 3. 4 13.4

S U M MARY SUMMARY
Each systems described specific user Each of of the the systems described in in this this book book was was designed designed to to address address specific user needs, needs, and and these these requirements requirements led led to to vastly vastly different different approaches. approaches. These These systems systems represent represent the the wide wide spectrum spectrum of of tradeoffs tradeoffs that that may may be be made. made. Ideally, Ideally, a a table table or or other other mechanism mechanism would would summarize summarize their their characteristics characteristics with with respect respect to to the the variety variety of of parameters 3.2 and parameters presented presented in in Section Section 1 13.2 and would would allow allow readers readers to to identify identify the the system approach that system or or approach that best best meets meets their their needs needs quickly. quickly. Unfortunately, Unfortunately, such such a a comparison without significantly insight into, comparison is is not not possible possible without significantly more more insight into, on on one one hand, hand, the the users' users' requirements requirements and, and, on on the the other other hand, hand, the the systems systems implementation implementation and and feedback feedback on on user user satisfaction. satisfaction. In In particular, particular, it it would would require require familiarity familiarity with with the the environment environment in in which which the the system system was was to to be be used, used, the the users users who who would would be be working working with with it, it, the the value value of of various various resources resources in in the the environment, environment, and and how how the the system system would be tools. Although would be expected expected to to interact interact with with other other tools. Although it it would would be be possible possible to to invent invent example example users users and and evaluate evaluate some some of of the the systems systems with with respect respect to to them, them, this this would would involve involve vast vast simplifications simplifications and and would would be .be a a disservice disservice to to systems systems targeting targeting different different users. users. It It is is safe safe to to say, say, however, however, that that given given the the tradeoffs tradeoffs that that must must be be

390

1 3 13

n " ,n <> ,rn<>nt Systems Compared l uation of Compared Eva Evaluation of Scientific Scientific Data Data Management Systems

made when when developing developing a a system, system, there there is is no no approach approach that that is is obviously obviously better better made than all all the than the others. others. Instead, Instead, each each user user group group could could analyze analyze carefully carefully the the specific specific requirements presented in requirements corresponding corresponding to to their their needs needs and and use use the the approach approach presented in this this chapter chapter to to select select the the approach approach and and system system that that best best meets meets them. them. When When the the requirements requirements are are identified, identified, contacting contacting the the systems' systems' designers designers and and asking asking them them how how their their approach approach performs performs in in such such context context will will allow allow each each user user team team to to compile compile their their own comparison matrix and and select select an an appropriate appropriate approach approach and and system. system. When When own comparison matrix performing this this evaluation, evaluation, it it is is important important to to consider consider all of the the users' requirements performing all of users' requirements because because focusing focusing on on only only a a few few could could lead lead to to the the selection selection of of a a less less desirable desirable approach. approach. For For that that purpose, purpose, the the contact contact information information for for each each of of the the presented presented systems System Information systems is is provided provided in in the the System Information section. section. While evaluating evaluating systems systems using using the the metrics metrics proposed proposed in in Section Section 1 13.2 is some someWhile 3 .2 is what what subjective, subjective, when when applied applied consistently, consistently, they they form form a a reasonable reasonable basis basis for for identi identifying fying the the strengths strengths and and weaknesses weaknesses of of disparate disparate systems. systems. In In addition, addition, their their flexible flexible definitions definitions allow allow them them to to be be refined refined as as needed needed to to obtain obtain the the proper proper level level of of detail detail with with respect respect to to a a particular particular evaluation's evaluation's requirements. requirements. For For example, example, if if efficiency efficiency is could be is an an important important consideration, consideration, a a more more detailed detailed evaluation evaluation could be performed, performed, resulting in in specific specific information information about about query query efficiency, efficiency, data data storage storage size, size, com comresulting munication munication overhead, overhead, and and data data integration integration overhead. overhead. Similarly, Similarly, metrics metrics can can be be combined if high-level overview desired. Finally, combined if only only a a high-level overview of of a a system system is is desired. Finally, readers readers should should feel feel free free to to introduce introduce new new metrics metrics to to capture capture other other properties properties of of systems systems if if you you determine determine them them to to be be important important to to an an evaluation. evaluation. There There is is nothing nothing sacred sacred about about this this evaluation evaluation matrix matrix that that can can be be refined refined and and extended extended to to meet meet one's one's needs. needs.
UIlIU!IIIlII__

R E FE R E NCES REFERENCES
[1] [ 1]
M. Stonebraker, l . "The Sequoia 2000 Benchmark. " In Stonebraker, J. Frew, Frew, K. Gardels, et a al. Benchmark." CM SIGMOD Proceedings of the A ACM SIGMOD International Conference Conference on Management 1 . ACM Press, Press, 1 993. 2-11. 1993. Data, 2-1 M. Carey, " In Proceedings of Carey, D. DeWitt, and J. Naughton. "The 007 0 0 7 Benchmark. Benchmark." the A CM SIGMOD International Conference on Management Data, 12-2 1 . ACM ACM 12-21. Press, 1 993. 1993. S. S. Bressan, Bressan, G. Dobbie, Z. Lacroix, et al. "X007: "XOO7: Applying 007 0 0 7 Benchmark to CM International XML Query Processing Tool." In Proceedings of the A ACM 67-174. 200 1 . Conference 167-174.2001. Conference on Information and Knowledge Management (CIKM), 1 J. Gray, Gray, ed. The Benchmark Handbook: For Database and Transaction Processing Systems, second ed. San Francisco: Morgan Kaufmann.
T. T. T. T.

[2]

[3]

[4] [5]

Hewett, R. R. Baecker, Baecker, S. S. Card, et al. "Human-Computer Interaction." In

Curricula For Human-Computer Human-Computer Interaction. New York: York: ACM Press, http://sigchi.org/cdg/cdg2.html. http-//sigchi.org/cdg/cdg2.html.

References References

39 1 391
[6] [6]
[7] [8] [9] A . E . Lawson: "What D o Biologists Think about the Nature o f Biology ? " Science A.E. Do of Biology?" and Education (2003), not yet published. R. J. Robbins, ed: Report Report of of the Invitational DOE Workshop on Genome R.J. Informatics, April 1 993, 26-27. 993. 1993, 26-27. Baltimore, MD: 1 1993. R. Stevens, C. Goble, P. " P. Baker, et al. "A Classification of Tasks in Bioinformatics. Bioinformatics."

17, (2001): 180-188. Bioinformatics 1 7, no. 2 (2001 ): 1 80-1 8 8 .


L. M. Haas, D. Kossmann, E. L. Wimmers, e ta l . "Optimizing Queries Across L.M. et al. Diverse Data Sources." In Proceedings of the 23rd International Conference Conference on 276-328. Morgan Kaufmann, 1 1997. Very Large Databases (VLDB), 276-328. 997.

T. Roth, F. [ 1 0] M. [10] M.T. E Ozcan, and L. M. Haas. "Cost Models DO Matter: Providing Cost " In Proceedings of System." Information for Diverse Data Sources in a Federated System. the 10. the 23rd International International Conference Conference on Very Very Large Databases Databases (VLDB), 599-6 599-610. Morgan Kaufmann, Kaufmann, 1 1999. Morgan 999.

[11] and an Application in [ 1 1 ] L. Wong. "Kleisli: Its Exchange Format, Supporting Tools, and Protein Interaction Extraction." International Symposium on Extraction." In IEEE International Bioinformatics 1-28. IEEE Computer Bioinformatics and Biomedical Engineering Engineering (BIBE), 2 21-28. 2000. Society, 2000. [12] S.B. "K2/Kleisli B. Davidson, V. Tannen, J. Crabtree, et al. " K21Kleisli and GUS: Experiments in [ 12] S. Integrated Access to Genomic Data Data Sources." IBM Systems Journal 40, no. 2 (2001): 512-531. (2001 ): 5 12-53 1 .
[ 1 3 ] L. Haas, [13] Haas, P. Scharz, P. P. Kodali, et al. "DiscoveryLink: A System for Integrated Access to Life Sciences Data Journal 40, no. 2 (2001 (2001): 489-511. to Data Sources." IBM Systems Journal ) : 489-5 11.

of Distributed Distributed Database Database Systems, 2nd ed. [14] T. Oszu and and P. Valduriez. Principles of 1 999. Upper Saddle Saddle River, NJ: Prentice Prentice Hall, 1999.
Rahm. "Multi-user [15] and E. Rahm. "Multi-user Evaluation of XML XML Data Data Management Management [ 1 5] T. B6hme and Xmach- 1 . In Efficiency Efficiency and Effectiveness Ef fectiveness of XML Tools and of XML systems with Xmach-1.

Techniques Workshop EEXTT EEXTT Techniques and Data Integration over the Web, VLDB 2002 Workshop Chaudhri, M. L. and CAiSE 2002 Workshop DIWeb, edited by S. Bressan, A. B. Chaudhri, Lee, et aI., 148-1 58. Springer, Springer, 2003. al., 148-158.
F. Waas, M. L. Kerste, et al. "Assessing XML Data Data Management Management A. R. Schmidt, E [16] [ 1 6] A.R. Efficiency and and Effectiveness Effectiveness of of XML XML Tools and and Techniques Techniques and and with Xmark." Xmark." In In Efficiency with Data Integration Integration over the Web, VLDB VLDB 2002 2 002 Workshop Workshop EEXTT EEXTT and CAiSE 2002 CAiSE 2002 Data edited by S. Bressan, A. B. Chaudhri, Chaudhri, M. M. L. Lee, aI., Workshop DIWeb, DIWeb, edited Workshop Lee, et al., 144-145. 144-145. Springer, Springer, 2003. 2003.

[ 1 7] S. Bressan, M. M. L. Lee, Y. G. Li, Li, et al. "XML Management System Benchmarks." Benchmarks . " In [17] "XML Management XML Native XML XML and XML-Enabled Database Systems, Systems, XML Data Management: Native and XML-Enabled Chaudri, A. Rashid, Rashid, and and R. Zicari. Zicari. Addison Wesley, 2003. 2003. edited by A. Chaudri,

[18] et al. "Current [ 1 8] U. Nambiar, Nambiar, Z. Z. Lacroix, Lacroix, S. Bressan, et "Current Approaches Approaches to to XML XML Management." IEEE Internet Computing Journal 2002): Management." Internet Computing Journal 6, no. 4 (July-August 2002): 43-51. 43-5 1 .

This Page Intentionally Left Blank

; n

Concl uding Rem arks Concluding Remarks


As first book book focusing focusing on data, this this material As the the first on management management systems systems for for biological biological data, material is is a a detailed detailed introduction introduction to to the the variety variety of of problems problems and and issues issues facing facing data data integra integration major issue tion and and the the presentation presentation of of numerous numerous systems. systems. The The major issue these these systems systems are are trying trying to to address address is is the the large large number number of of distributed, distributed, semantically semantically disparate disparate data data sources sources that that need need to to be be combined combined into into a a useful useful and and usable usable system system for for geneticists geneticists and biologists to and biologists to perform perform their their research. research. This This issue issue is is complicated complicated by by the the variety variety of of data data formats, formats, inconsistent inconsistent semantics, semantics, and and custom custom interfaces interfaces supported supported by by these these sources-as sources~as well well as as the the highly highly dynamic dynamic nature nature of of these these characteristics characteristics and and the the data data themselves. themselves. Ideally, Ideally, a a data data integration integration system system would would provide provide consistent consistent access access to to all of by scientists. all of the the data data and and tools tools needed needed by scientists. However, However, no no single single system system meets meets this this ideal for all users. users. This final section ideal for all This final section provides provides a a brief brief summary summary and and a a peek peek into into the the future bioinformatics. future of of bioinformatics.

_ _ _ &W

ARY SUM MM MARY

The introductory chapters chapters establish shared by computer scientists The introductory establish a a terminology terminology shared by computer scientists and life scientists. They focus on the different steps in the design of a system and life scientists. They focus on the different steps in the design of a system and and highlight the problems faced highlight the differences differences between between the the problems faced by by those those in in bioinformatics bioinformatics and and other other facets facets of of these these respective respective disciplines. disciplines. Upon Upon first first glance, glance, these these differences differences may may seem seem insignificant, insignificant, but but understanding understanding them them is is the the first first step step in in understanding understanding the the realities realities of of the the environment environment in in which which bioinformatics bioinformatics solutions solutions must must work. work. The The desire desire to to simplify simplify this this environment environment is is common common in in people people starting starting out out in in this this domain, but domain, but overcoming overcoming it it is is critical critical to to successfully successfully addressing addressing the the problems problems being being faced. in bioinformatics bioinformatics are faced. Many Many of of the the challenges challenges in are derived derived from from the the inherent inherent complexity domain, and complexity of of the the domain, and failure failure to to embrace embrace this this results results in in approaches approaches that, that, while while acceptable acceptable in in theory, theory, are are not not workable workable in in the the real, real, complex complex world world in in which which bioinformatics applied. bioinformatics solutions solutions must must be be applied. Once Once a a common common background background has has been been established, established, the the following following chapters chapters present present several several bioinformatics bioinformatics systems systems that that are are currently currently in in use. use. The The wide wide variety variety of described in of systems systems described in this this book book provides provides significant significant insight insight into into the the complexity complexity of rapidly changing domain of genomics. The of performing performing data data integration integration in in the the rapidly changing domain of genomics. The fact fact that that these these systems systems are are still still evolving evolving indicates indicates that that none none of of these these approaches approaches has has

394 394

rks Concluding Rema Remarks

yet yet led led to to an an ideal ideal solution solution for for all all applications. applications. This This is is a a testament testament to to the the difficulty difficulty of of creating creating a a bioinformatics bioinformatics solution solution that that addresses addresses the the needs needs of of all all users. users. Most Most of of these these systems systems evolved evolved independently, independently, and and many many began began as as attempts attempts at at addressing addressing specific challenges challenges facing facing scientists scientists in in a a particular particular organization. organization. The The challenges challenges specific focused focused on on by by a a given given solution solution are are generally generally the the most most important important problems problems facing facing the the associated associated organization organization or or its its customers. customers. While While each each system system presented presented here here has has met met its its original original goals, goals, as as the the scope scope of of its its usage usage evolved, evolved, it it has has encountered encountered new new challenges. challenges. As As discussed discussed in in Chapter Chapter 13, 13, evaluating evaluating a a system system requires requires detailed detailed knowledge knowledge of of the the environment environment in in which which the the system system will will be be deployed. deployed. Part Part of of the the reason reason no no single single approach approach is is clearly clearly better better than than another another is is that that the the bioinformatics bioinformatics community community places places conflicting conflicting goals goals on on systems. systems. As As a a simple simple example, example, notice notice that that although although providing providing a a semantically semantically consistent consistent view view of of the the data data greatly greatly improves improves the the usability usability of also places places practical practical limits limits on number of of the the system, system, it it also on the the number of data data sources sources to to which which the the system system can can provide provide access. access. This This is is because because each each data data source source provides provides its its own own unique unique semantics semantics for for the the data data it it contains, contains, and and an an expert expert is is required required to to perform perform the the mapping from ones. However, mapping from these these semantics semantics to to the the global global ones. However, the the more more sources sources to to which a more valuable which a system system provides provides access, access, the the more valuable it it is is in in general. general. As As scalability scalability and and semantic semantic consistency consistency are are mutually mutually exclusive exclusive goals, goals, a a system system can can excel excel in in only only one one of of them, them, providing providing at at best best marginal marginal performance performance in in the the other. other. Whether Whether such such a a system another depends depends on users' values. example illustrates system is is better better than than another on the the users' values. This This example illustrates only one of the the tradeoffs tradeoffs bioinformatics bioinformatics systems Because of these only one of systems strive strive to to meet. meet. Because of these conflicting constraints, it is currently impossible for a single system to provide conflicting constraints, it is currently impossible for a single system to provide the bioinformatics solution that meets scientist's needs. needs. Although Although this this is is a the bioinformatics solution that meets every every scientist's a discouraging realization, it it is not a to bioinformatics. bioinformatics. Indeed, Indeed, it it discouraging realization, is not a situation situation unique unique to appears to be a characteristic of any rapidly evolving scientific domain, domain, and and as as appears to be a characteristic of any rapidly evolving scientific such, the techniques used by bioinformaticians are more generally applicable than such, the techniques used by bioinformaticians are more generally applicable than typically thought. typically thought.
--=--

LOOKI N G TOWARD UTU R E LOOKING TOWARD THE F FUTURE

As one one becomes becomes familiar familiar with with the the problems problems facing facing bioinformatics bioinformatics and and the the apap As proaches being being pursued pursued to to address address them, them, it it is is easy easy to to become become disenchanted. disenchanted. The The proaches problems are are daunting, daunting, and and there there is is no no clear clear path path that that will will lead lead to to a a unifying unifying solusolu problems tion. Some Some issues, issues, such such as as query query optimization optimization and and data data caching, are just just now now being being tion. caching, are investigated seriously seriously in in this this context. context. Other Other issues issues appear appear as as the the result result of of applying applying investigated existing technology in new ways and the development of new technology. Indeed, existing technology in new ways and the development of new technology. Indeed, sometimes it it feels feels as as if if we we are are moving moving in in the the wrong wrong direction: direction: As As it it becomes becomes inin sometimes creasingly easy easy to to distribute distribute data data via via the the Web, Web, the the number number and and heterogeneity heterogeneity of of creasingly

Towa rd the Future

395

data sources containing information relevant scientists keeps data sources containing information relevant to to scientists keeps increasing. increasing. Unfor Unfortunately, community standards results in tunately, a a lack lack of of community standards results in each each source source publishing publishing its its own own distinct semantics semantics and tools available researchers, and distinct and interfaces. interfaces. The The number number of of tools available to to researchers, and their making them their complexity, complexity, continues continues to to increase increase without without significant significant progress progress at at making them interoperable. Multimedia data is is becoming becoming more more common common as as genomics genomics research research interoperable. Multimedia data continues continues to to move move onto onto computers computers and and out out of of the the wet-lab, wet-lab, which which causes causes problems problems for for data data integration integration systems systems that that are are expecting expecting textual textual data. data. Large-scale Large-scale data data are are also also becoming becoming more more common common as as access access to to powerful powerful computers computers and and related related infras infrastructures increases. This tructures increases. This changes changes the the value value of of bandwidth bandwidth and and requires requires rethinking rethinking many assumptions about many assumptions about the the underlying underlying data. data. Grid Grid technology technology is is emerging emerging and and will likely soon will likely soon allow allow data data and and computation computation to to be be spread spread transparently transparently among among a a large will be used is entirely clear, large number number of of machines. machines. How How this this technology technology will be used is not not entirely clear, but will likely significant impact impact on but it it will likely have have a a significant on computational computational biology. biology. While data integration and access While each each of of these these issues issues raises raises significant significant data integration and access chal challenges, lenges, they they also also provide provide new new opportunities opportunities to to solve solve existing existing bioinformatics bioinformatics prob problems lems and, and, in in turn, turn, to to advance advance the the state state of of genomics genomics research. research. For For example, example, grid grid technology technology may may be be able able to to minimize minimize the the impact impact of of large large data data sets sets by by moving moving the the computation computation to to the the place place where where the the data data resides. resides. Thus, Thus, there there is is still still hope hope that that we we will will achieve achieve the the goal goal of of providing providing scientists scientists with with intuitive intuitive access access to to all all the the relevant relevant data data they they need. need. One One of of the the more more promising promising emerging emerging trends trends is is an an effort effort to to define define data data semantics semantics precisely possible, although probable, result precisely through through ontologies. ontologies. A A possible, although not not necessarily necessarily probable, result of of this this effort effort is is a a single single unifying unifying ontology ontology that that is is able able to to identify identify accurately accurately the the information contained contained in all data sources. Having would allow information in all data sources. Having this this global global ontology ontology would allow mappings mappings between between related related concepts concepts to to be be easily easily identified, identified, and and thus thus would would greatly greatly reduce vision may reduce the the burden burden placed placed on on integration integration systems. systems. Unfortunately, Unfortunately, this this vision may take realized, if major reason take decades decades to to be be realized, if it it happens happens at at all. all. The The major reason for for this this is is that that life inherently complex complex domain, domain, and life science science is is an an inherently and there there is is a a lot lot of of information information that that is is not not yet yet understood. understood. Thus, Thus, the the ability ability to to correctly correctly define define the the semantics semantics between between these these complex complex concepts concepts is is severely severely limited limited by by this this lack lack of of comprehension. comprehension. Because Because of of this this difficulty, difficulty, the the ontologies ontologies currently currently being being developed developed are are generally generally small small and and define define semantic semantic concepts concepts only only for for a a specific specific sub-community sub-community of of life life science. science. The The creation creation and and adoption adoption of of these these smaller smaller ontologies ontologies are are likely likely to to occur occur over over the the next next few few years. years. Although Although a a less less than than ideal ideal solution, solution, these these ontologies ontologies could could be be extremely extremely useful useful to to bioinformatics bioinformatics by by reducing reducing the the number number of of semantic semantic definitions definitions that that need need to to be be integrated. integrated. Integrating Integrating data data from from multiple multiple resources resources also also raises raises challenging challenging issues issues related related to to data data provenance, provenance, data data ownership, ownership, data data quality, quality, privacy, privacy, and and security, security, which which will need will need to to be be addressed addressed in in the the short short future. future. Indeed, Indeed, integrated integrated data data is is often often com composed of posed of several several data data items, items, each each coming coming from from a a different different resource. resource. Tracking Tracking data data

396

396

rks C o n c l u d i n g Rema Remarks

provenance is critical to applications as provenance is critical to scientific scientific applications as it it enables enables users users to to know know where where each data data item item comes comes from. from. This This knowledge knowledge is is relevant relevant to to data data ownership ownership and and each quality. For For example, example, when when exploiting exploiting data, data, it it is is important important to to give give credit credit to to the the quality. researcher who researcher who has has generated generated or or annotated annotated the the data. data. In In addition, addition, data data provenance provenance may may affect affect the the expected expected quality quality of of the the data data (e.g., (e.g., when when they they are are not not curated curated or or vali validated) and should be dated) and thus thus the the way way it it should be exploited. exploited. But But if if scientific scientific integration integration systems systems evolve evolve to to track track down down data data provenance, provenance, they they might might also also enable enable to to reconstitute reconstitute the the original datasets, datasets, which raises privacy issues as original which raises privacy and and security security issues as scientific scientific discovery discovery will data. Biological Biological integration will need need to to integrate integrate more more and and more more clinical clinical data. integration systems systems may have have to to comply comply with with regulations regulations such such as as the the privacy privacy provisions provisions and and the the stan stanmay dards dards for for the the security security of of electronic electronic health health information information of of the the V.S. U.S. federal federal law, law, the the Health 996 (HIPAA). Health Insurance Insurance Portability Portability and and Accountability Accountability Act Act of of 1 1996 (HIPAA). Which Which trends trends will will continue continue and and impact impact the the bioinformatics bioinformatics community community as as a a whole has has yet seen. The will continue whole yet to to be be seen. The only only thing thing certain certain is is that that bioinformatics bioinformatics will continue to to be be an an exciting exciting and and evolving evolving discipline discipline for for years years to to come. come. As As comprehensive comprehensive as as we we have this book provides only fascinating world world of have tried tried to to be, be, this book provides only an an introduction introduction to to the the fascinating of bioinformatics while the bioinformatics data data integration. integration. Furthermore, Furthermore, while the challenges challenges outlined outlined herein herein are disciplinary are daunting, daunting, addressing addressing them them is is only only the the first first step step the the evolving, evolving, multi multidisciplinary field field of of bioinformatics bioinformatics must must take. take. Once Once these these challenges challenges have have been been overcome, overcome, there there is is still still a a huge huge amount amount of of work work to to be be done done to to use use that that information information effectively effectively to understand the to understand the mechanics mechanics of of life. life. Despite Despite the the tremendous tremendous amount amount of of work work still still to to do, do, the the path path is is fascinating fascinating and and the the rewards rewards for for successfully successfully unraveling unraveling the the mysteries mysteries of of the the genome genome are are unparalleled. unparalleled. We We hope hope that that this this book book has has provided provided not just insight insight into not just into the the challenges challenges currently currently being being addressed addressed in in bioinformatics, bioinformatics, but also inspiration inspiration to but also to help help overcome overcome them. them.

tI

11

Appendix : Biological Biological Appendix: Resources Resources

Useful biological biological resources, resources, databases, databases, organizations, Useful organizations, and and applications applications are are listed listed in tables. The include resources book. The in three three tables. The tables tables include resources cited cited in in the the book. The acronyms acronyms com commonly used to to refer refer to to the the resources are spelled URLs are are provided. provided. monly used resources are spelled out, out, and and current current URLs Additional resources are the the Public Additional resources are Public Catalog Catalog of of Databases Databases available available at at INFOBIO INFOBIOGEN (http;llwww.infobiogen.frlservicesldbcat) and the Biocatalog available at the the G E N ( h t t p . / / w w w . i n f o b i o g e n . f r / s e r v i c e s / d b c a t ) and the Biocatalog available at European (http.//www.ebi.ac.uk/biocat). European Bioinformatics Bioinformatics Institute Institute (http;llwww.ebi.ac.uklbiocat).
Category Category Comprehensive Center: Broad Comprehensive Data Center: Broad content including sequence, sequence, structure, structure, function, etc. Databases and URLs URLs EBI http.//www.ebi.ac.uk/ EBI (European (European Bioinformatics Bioinformatics Institute): Institute): http://www.ebi.ac.uk/ EMBL EMBL (European (European Molecular Biology Biology Laboratory): Laboratory): -heidelberg.del http://www.embl h ttp-//www, em b l-h eide lb erg. de/ ExPaSy Analysis System-Swiss ExPaSy (Expert (Expert Protein Analysis SystemmSwissInstitute Institute of Bioinformatics): http.//us.expasy.org Bioinformatics): http://us.expasy.org The INFOBIOGEN INFOBIOGEN Deambulum: Deambulum: http://www.infobiogen.fr/services/deambulum/english/menu.html http://www.infobiogen.fr/services/deambulum/english/menu.html Institut Pasteur: Pasteur: http://bioweb.pasteurfr/docs/gendocdb/banques.html http.//bioweb.pasteur, fr/docs/gendocdb/banques.html NCBI NCBI (National (National Center Center for Biotechnology Biotechnologyand Information): Information): http://www.ncbi.nlm.nih.gov/ http://www.ncbi.nlm.nih.gov/ TIGR TIGR (The (The Institute Institute of Genome Genome Research): Research): http://www.tigr.org/ http://www.tigr.org/ WhiteheadlMIT Whitehead/MIT (Massachusetts (Massachusetts Institute Institute of Technology) Technology) Genome http'//www-genome.wi.mit.edu/ Genome Center: Center: http://www-genome.wi.mit.edu/

A. 1 A.1 TAB LE TABLE

Biological bases. Biological data databases.

398 398

rces Appendix: Biological Resou Resources

Category DNA or Protein Sequence Sequence

Databases and URLs DDBJ http.//www.ddbj.nig.ac.jp/ DDBJ (DNA (DNA Data Bank Bank of Japan): http://www.ddbj.nig.ac.jp/

dbEST (Expressed (Expressed Sequence Sequence Tags Tags Database): http://www.ncbi.nih.gov/dbEST http'//www.ncbi.nih.gov/db EST EMBL (Nucleotide Sequence Sequence Database): EMBL http://www.ebi.ac.uk/embl/index.html http'//www.ebi.ac.uk/embl/index.html
GenBank GenBank and the NCBI Nucleotide Database: http://www.ncbi.nlm.nih.gov/Genbank http.//www.ncbi.nlm.nih.gov/Genbank GenPept (protein database translated from the last release of GenBank): GenBank). ftp://www.infobiogenfrlpubldblgenpeptl ftp.//www, in f o b iog en. f r/pub /db /g enp ep t/ or ftp://ftp.ncbi.nih.govlgenbankl ftp'//ftp.ncbi.nih.gov/genbank/ GSDB Genome Sequence GSDB ((Genome Sequence DataBase): http://wehih.wehi.edu.au/gsdb/gsdb.html http.//wehih.wehi.edu.au/gsdb/gsdb.html PIR (Protein Information Information Resource): http://pir.georgetown.edu/ http.//pir.georgetown.edu/ RefSeq (comprehensive integrated nonRefSeq redundant set of sequences): sequences): http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html Swiss-Prot Swiss-Prot (protein knowledgebase): http://us.expasy.org/sprot/ http ://us.expasy. org/sprot/ Genomes: Complete genome sequences sequences and related information for specific specific organisms EBI EBI complete genomes: http://www.ebi.ac.uk/genomes/ http://www.ebi.ac.uk/genomes/ EcoCyc Genome of Escherichia coli): EcoCyc ((Genome http://biocyc.org/ecocyc http ://biocy c. org/ecocyc FlyBase FlyBase (Database of Drosophila Genome): Genome): http://flybase.bio.indiana.edu/ http.//flybase.bio.indiana.edu/ GDB (Genome database): http://www.gdb.org http-//www.gdb.org Institut Pasteur complete genomes: genomes: http://www.pasteur.fr/ http://www.pasteur.fr/ actulpresse/comldossierslG actu/presse/com/dossiers/G BgenomicslG Bgenomics/G Bintro.html B intro.html MGD (Mouse Genome Database): http://www.informatics.jax.org/ h ttp ://www. inf ormatics.jax, org/ NCBI complete genomes: genomes: http://www.ncbi.nlm.nih.gov/Genomes/index.html h ttp.//www, ncb i. nlm. n ih. g o v/G enomes/index.h tml RatMap RatMap (Rat Genome Database) Database):: http://ratmap.gen.gu.se http.//ratmap.gen.gu.se SGD SGD (Saccharomyces (Saccharomyces Genome Database): http://genome-www.stanford.edu/Saccharomyces/ h ttp-//genome-www.stanf ord. edu/Sacch arom y ces/ UCSC Genome Bioinformatics: http://genome.ucsc.edu/ UCSC http-//genome.ucsc.edu/ WormBase Genome and Biology WormBase ((Genome Biology of C. Elegans): http://www.wormbase.org/ h ttp'//www, wormbase, org/

A. 1 A.1 TABLE TABLE

Continued. Continued.

Appendix: Ap.p _ e i nd cix:aB io g. ~ Io~ / R

Resou e so u rces rceso~.~,,~.,~,==~o,~=~..~.~.,=.

"

~ 399

399

Category Category

Databases Databases and URLs URLs All Genes (predicted AllGenes (predicted human and mouse genes): genes): http://www.allgenes.org h ttp ://www.allgenes, org GeneCards (human genes): genes): http://bioinfo.weizmann.ac.illcards http://bioinfo.weizmann.ac.il/cards

Genetics: Genetics: Gene Gene mapping, mutations, and diseases diseases

GeneLynx (human genes): genes): http;llwww.genelynx.org http://www.genelynx.org GeneLynx


Genew (database of approved HUGO symbols): symbols): http;llwww.gene.ucl.ac.uklnomenclaturel h ttp ://www. gene.ucl.ac.uk/nomenclature/

GDB (Genome (Genome Database): http;llgdbwww.gdb.orglgdbl http://gdbwww.gdb.org/gdb/ HGMD (Human Gene Gene Mutation Database): http;llarchive.uwcm.ac.ukluwcmlmglhgmdO.html http ://archive.uwcm.ac.uk/uwcm/m g/hgmdO.html
OMIM (Online (Online Mendelian Inheritance in Man): http;llwww.ncbi.nlm.nih.govlentrezlqueryfcgi?db=OMIM h ttp ://www. ncb i.nlm, nih. g ov/entr ez/query, f cgi ?db= O MIM Gene Gene Expression: Microarray and cDNA gene gene expression ArrayExpress (microarray data): http;llwww.ebi.ac.uklarrayexpress http.//www.ebi.ac.uk/arrayexpress

BodyMap (expression (expression information about human and mouse genes): http://bodymap.ims.u-tokyo.ac.jp/ genes): http;llbodymap.ims.u-tokyo.ac.jpl
dbEST (Expressed dbEST (Expressed Sequence Sequence Tag Database): http;llwww.ncbi.nlm.nih.govldbESTlindex.html http'//www.ncbi.nlm.nih.gov/db EST/index.html GeneX (gene (gene expression database): http;llwww.ncgr.orglgenex http://www.ncgr.org/genex GEO (Gene Expression Omnibus): (Gene Expression http;llwww.ncbi.nlm.nih.govlgeol http'//www.ncbi.nlm.nih.gov/geo/ Gene Expression Database): MGED (Microarray Gene http;llwww.mged.org http.//www.mged.org UniGene UniGene (partition of GenBank into clusters clusters that sequences that represent contain the sequences a unique unique gene): gene): http;llwww.ncbi.nlm.nih.govIUniGenel http://www.ncbi.nlm.nih.gov/UniGene/ Structure: Three-dimension structures of small molecules, molecules, proteins, DNA CSD (Cambridge Structural Database): http;llwww.ccdc.cam.ac.uklprodslcsdlcsd.html http://www.ccdc.cam.ac.uk/prods/csd/csd.html HSSP (database of Homology-derived Secondary HSSP Secondary Structure of Proteins): h ttp'//www.hgmp.mrc.ac.uk/ Proteins): http://www.hgmp.mrc.ac.uk/ BioinformaticsIDatabaseslhssp-help.html Bioinformatics/Databases/hssp-help.html NDB (Nucleic (Nucleic Acid Acid Database): http;llndbserver.rutgers.eduINDBlndb.html http.//ndbserver.rutgers.edu/ND B/ndb.html

PDB (Protein (Protein Data Bank): Bank): http;llwww.rcsb.orglpdblindex.html http.//www.rcsb.org/pdb/index.html

A. 1 A.1

TABLE

4 El!] !iHl ID)

Continued. Continued.

400 400

~,~=~~=~

rces Appendix: Biological Resou Resources

Category
Classification Classification of Protein Family and Protein Domains

Databases and URLs

Blocks database (protein blocks): blocks): http://www.blocks.fhcrc.org/ http://www.blocks.fhcrc.org/ Blocks (Protein Structure Classification Classification Database): CATH (Protein http://www.biochem.ucl.ac.uk/bsm!cath_new/index.html http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html
InterPro (resource (resource for whole genome analysis): analysis): http://www.ebi.ac.uk/interpro/index.html http ://www.ebi.ac.uk/interpro/index.html Pfam Pfam (database of protein families): families): http://pfam.wustl.edu/ http.//pfam.wustl.edu/ PRINTS PRINTS (Protein (Protein Fingerprint Database): http://www.bioinfman.ac.uk/dbbrowserlPRINTS/ http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/ ProDom (protein domain families): families): http://prodes. toulouse.inra.frlprodomI2 002 . 1 Ihtmllhome.php http://prodes.toulouse.inra.fr/prodom/2OO2.1/html/home.php PROSITE PROSITE (database of protein families families and domains): http://www.expasy.ch/prosite/ http://www.expasy, ch/prosite/ SCOP (Structure SCOP (Structure Classification Classification of Proteins): Proteins): http://scop.mrc-lmb.cam.ac.uk/scop/ http.//scop.mrc-lmb.cam.ac.uk/scop/

Protein Pathway Pathway Protein-Protein Interactions and Metabolic Pathway

BIND BIND (Biomolecular (Biomolecular Interaction Network Database): http://www.binddb.org/ http ://www. binddb, org/ DIP (Database of Interacting Proteins): Proteins): http://dip.doe-mbi.ucla.edu/ http-//dip.doe-mbi.ucla.edu/ EcoCyc Encyclopedia of E. coli EcoCyc ((Encyclopedia coli Genes Genes and Metabolism): http://biocyc.org/ecocyc KEGG (Kyoto Encyclopedia Encyclopedia of Genes KEGG (Kyoto Genes and Genomes): Genomes): http://www.genome.ad.;p/kegg/kegg2.html#pathway http ://www.genome.ad.jp/kegg/kegg2.html#pathway

WIT WIT (Metabolic Pathway): Pathway): http://wit.mcs.anl.gov/WIT2/ http-//wit.mcs.anl.gov/WIT2/

Proteomics: Proteomics: Proteins, Protein family

AfCS AfCS (Alliance (Alliance for Cellular Signaling): Signaling): http://cellularsignaling.orgl http-//cellularsignaling, org/ ]CSG JCSG (Joint (Joint Center for Structural Genomics): Genomics): http://www.;csg.org/scripts/prod/home.html http-//www.jcsg.org/scripts/prod/home.html PKR (Protein (Protein Kinase Kinase Resource): Resource): http://pkr.sdsc.edu/html/index.shtml http.//p kr.sdsc.edu/html/index.shtml

Single Pharmacogenomics, Pharmacogenetics, Single Nucleotide Polymorphism (SNP), (SNP), Genotyping

ALFRED ALFRED (Allele (AlleleFrequency Frequency Database): http://alfred.med.yale.edu/alfredlindex.asp http'//alfred.med, yale.edu/alfred/index.asp CEPH (Centre (Centre d'Etude du Polymorphisme Humain genotype database): http://www.cephb.fr!cephdb/ http-//www.cephb.fr/cephdb/

A. 1 A.1 TAB LE TABLE

Continued.

Resources A p p e n d i x : Biological Resource

. . . . . . . . . . . . . . . . . . .

401 40 1

Category Category

Databases and and URLs URLs Databases dbSNP (Single Nucleotide Nucleotide Polymorphism Polymorphism Database): Database): dbSNP http://www.ncbi.nlm.nih.gov/SNP/ http,//www.ncbi.nlm.nih.gov/SNP/ http://www.ncbi.nlm.nih.gov/LocusLink LocusLink: http.//www.ncbi.nlm.nih.gov/LocusLink (Pharmacogenetics Knowledge Base): PharmGKB (Pharmacogenetics http://pharmgkb.org http://pharmgkb.org http://snp.cshl.org SNP consortium: http.//snp.cshl.org

Tissues, Organs, and Organisms

(Brain Image Database): Database): BRAID (Brain http://braid.rad.jhu.edu/interf ace.html http.//braid.rad.jhu.edu/interface.html Database): NeuroDB (Neuroscience Federated Database): http://www.npaci.edu/DICE/Neuro/ http,//www.npaci.edu/D I CE/Neuro/ Visible Human Project: http://www.nlm.nih.gov/research/visible/visible.human.html http'//www.nlm.nih.gov/research/visib le/visib le_human.html

Whole Brain Atlas: http://www.med.harvard.edu/AANLIB/home.html http,//www.med.harvard.edu/AANLIB/home.html


Literature Literature Reference PubMed (MED LINE bibliographic bibliographic database): (MEDLINE database): http://www.ncbi.nlm.nih.gov/entrez/ http.//www.ncbi.nlm.nih.gov/entrez/ (US Patent and Trademark Office): USPTO (US Office): http://www.uspto.gov/ h ttp.//www, usp to. g ov/

402 402

~ ........ ,

...

Appendix:

Biological

Resou rces Resources

Organization

Descriptions Descriptions
HGNC is responsible responsible for the approval of a unique symbol for each gene and designates descriptions descriptions of genes. genes. Aliases Aliases for genes are also listed in the database. GO GO is is to to develop develop ontologies ontologies describing describing the the molecular molecular function, biological process, and cellular component of genes and gene products for eukaryotes. Members include bases of include genome genome data databases of fly, fly, yeast, yeast, mouse, mouse, worm, worm, and Arabidopsis. Produce structured, controlled vocabularies vocabularies applied to plant-based database information. The MGED group facilitates facilitates the adoption of standards standards for DNA-microarray experiment annotation and data representation, as well as the introduction of standard experimental controls controls and and data data normalization normalization methods. methods. NBII NBII provides provides links links to to taxonomy taxonomy sites sites for for all all biological disciplines.

Human Genome Organization (HUGO) Gene Nomenclature Committee (HGNC) Gene http.//www.gene.ucl.ac.uk/nomenclature/ http://www.gene.ucl.ac.uk/nomenclaturel
Gene Ontology Consortium (GO) http'//www.geneontology.org http://www.geneontology.org

Plant Ontology Consortium http.//p lantontology, org http://plantontology.org

Society (MGED) Microarray Gene Expression Data Society http ://www.mged.org/ http://www.mged.org/

NBII (National Biological Information Infrastructure)

http.//www.nbii.gov/disciplines/systematics.html http://www.nbii.gov/disciplines/systematics.html
ITIS (Integrated Taxonomic Information ITIS System) http'//www.itis.usda.gov/ http://www.itis.usda.gov/ MeSH (Medical Subject Headings)
ITIS ITIS provides provides taxonomic taxonomic information information on on plants, plants, ani animals, and and microbes of of North North America America and and the the world. world. National National Library Library of of Medicine Medicine (NLM) controlled controlled vocabulary vocabulary used used for for indexing articles, articles, cataloging books books and and other other holdings, and and searching searching MeSH MESHindexed databases, including MED LINE. MEDLINE. SNOMED SNOMED is recognized recognized globally globally as as a a comprehensive, comprehensive, precise controlled controlled terminology terminology created created for for the the indexing indexing of of the the entire entire medical medical record. record. ICD-9-CM is ICD-9-CM is the the official official system system of of assigning assigning codes codes to to diagnoses diagnoses and and procedures procedures associated associated with with hospital hospital utilization utilization in in the the United States. It is published by the U.S. U.S. National National Center Center for for Health Health Statistics. Statistics. the

h ttp.//www, nlm. nih.gov/mesh/mesh h ome.h tml http://www.nlm.nih.gov/mesh/meshhome.html

SNOMED (Systematized Nomenclature of Medicine) Medicine) h ttp'//www.snomed, org/ http://www.snomed.org/ International Classification Classification of Diseases, Ninth Revision, Clinical Modification Modification (ICD-9-CM) (ICD-9-CM)

http'//www.cdc.gov/nchs/about/otheract/icd9/abticd9.htm http://www.cdc.gov/nchslabout/otheractlicd9/abticd9.htm

A.2 A.2 TABLE TAB LE

Biological Biological ontology ontology resources. resources.

Appendix: Biological Biological Resources Resou rces Appendix:

403 403

Organization Organization International Union Union of of Pure Pure and and International AppliedChemistry (IUPAC) (IUPAC) AppliedChemistry International Union of Biochemistry Biochemistry and Molecular (IUBMB) Nomenclature Nomenclature Committee Biology (IUBMB) http://www.chem.qmul.ac.uk/iubmb/ http://www.chem.qmul.ac.uk/iubmb/ PharmGKB (Pharmacogenetics (Pharmacogenetics Knowledge Knowledge Base) Base) PharmGKB http://pharmgkb.org/ http.//pharmgkb.org/ mm CIF (The macromolecular Crystallographic Crystallographic mmCIF Information File): File); http.//pdb.rutgers.edu/mmcif/or http://pdb.rutgers.edulmmcif / or Information http://www.iucr.ac.uk/iucr-top/cif /index.html http.//www.iucr.ac.uk/iucr-top/cif/index.html LocusLink LocusLink http://www.ncbi.nlm.nih.gov/LocusLink/. h ttp ://www. ncb i. nlm. n ih. g o v/L o cus L ink/. RiboWeb RiboWeb http://riboweb.stanford.edu/riboweb/login-frozen.html http://riboweb.stanford.edu/riboweb/login-frozen.html ENZYME ENZYME http://us.expasy.org/enzyme/ http-//us.expasy.org/enzyme/ ImMunoGeneTics information IMGT ( (ImMunoGeneTics information system) http://imgt.cinesfr/ h ttp ://im gt. cines, fr/

Descriptions Descriptions IUBMB make recommendations on IUPAC and IUBMB organic, biochemical, biochemical, and molecular biology terminology. nomenclature, symbols, and terminology.

PharmGKB develops an ontology for pharmacogenetics and pharmacogenomics. mmCIF is sponsored by IUCr (International Union of data Crystallography) to provide a dictionary for for data items relevant to macromolecular crystallographic experiments. LocusLink contains gene-centered gene-centered resources resources including including nomenclature and aliases for genes.
The RiboWeb RiboWeb provides access to a knowledge base containing a standardized representation of ribosomal structural information. ENZYME is a repository of information relative to the nomenclature of enzymes.

IMGT is a high-quality integrated information system specializing in immunoglobulins (IG), T-cell receptors (TR), major major histocompatibility complex (MHC), and related proteins of the immune system of human and other vertebrate species.

404 404

rces Appendix: Biological Resou Resources

Category

Application Application Names Names and URLs

Microarray analysis

MAS (Affymetrix MicroArray Suite): Suite): http://www.aff ymetrix.com/products/software/specificlmas.aff x http.//www.affymetrix.com/products/software/specific/mas.affx


ImaGene (BioDiscovery): (BioDiscovery): http://www.biodiscovery.com/ http'//www.biodiscovery.com/ BLAST (Basic Local BLAST (Basic Local Alignment Search Search Tool): http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html http.//www.ncbi.nlm.nih.gov/B LAST/blast_overview.html FASTA FASTA (Sequence (Sequence similarity and homology search): http://www.ebi.ac.uk/fasta33/index.html http'//www.ebi.ac.uk/fasta3 3/index.html and http://www.ebi.ac.uk/fasta33/genomes.html http.//www.ebi.ac.uk/fasta3 3/genomes.html SMART (Simple (Simple Modular Architecture Research Tool): Tool): http://smart.embl-heidelberg.de/ h ttp.//sma rt. em b l-heide lberg. de/ WU-BLAST WU-BLAST (Washington University BLAST) BLAST) http://blast.wustl.edu/blast/README.html http.//b last.wustl.edu/b last/READ ME.h tml

Sequence similarity search

Multiple sequence alignment

CAP (Contig Assembly Program): http://fenice.tigem.it/bioprg/interf aceslcap3.html http'//fenice.tigem.it/bioprg/interfaces/cap3.html ClustalW: http://www-igbmc.u-strasbg.fr/Biolnfo/ClustaIW http://www-igbmc.u-strasbg.fr/Biolnfo/ClustaIW ClustalX: http://www-igbmc.u-strasbg.fr/Biolnfo/ClustaIX http.//www-igbmc.u-strasbg, fr/B io Info/ClustalX LASSAP Scale Sequence also known as BioFacet: LASSAP (LArge (LArge Scale Sequence compArison Package), Package), also BioFacet: http://www.gene-it.com/index.html http.//www.gene-it.com/index.html MEGA: http://www.megasoftware.net/ http.//www.megasoftware.net/

MultAlin: http://prodes.toulouse.inra.fr/multalin/multalin.html http-//prodes.toulouse.inra.fr/multalin/multalin.html


PAUP Using Parsimony): PAUP (Phylogenetic (Phylogenetic Analysis Analysis Using http://paup.csitfsu.edu/paupfaq/faq.html http.//paup.csit.fsu.edu/paupfaq/faq.html Phylip: http://evolution.genetics.washington.edu/phylip.html http.//evolution.genetics.washington.edu/phytip.html TMAP: http://www.mbb.ki.se/tmap/ http.//www.mbb.ki.se/tmap/ Analysis EMBOSS EMBOSS (European Molecular Biology Biology Open Software Suite): Suite): http://www.hgmp.mrc.ac.uk/Software/EMBOSS/ http.//www.h gmp.mrc.ac.uk/So ftware/EMB O SS/ HMMER (Profile (Profile hidden Markov models for biological sequence sequence analysis): analysis): http://hmmer.wustl.edu/ http.//hmmer, wustl.edu/

GeneSpring (Silicon (Silicon Genetics): Genetics): http://www.silicongenetics.comlcgi/SiG.cgi/Products/GeneSpringlindex.smf h ttp.//www.s ilicongenetics, com/cgi/SiG. cgi/Products/G eneSpring/index.sm f
PSORT: PSORT: http://psort.nibb.ac.jp/ http://psort.nibb.ac.jp/ Spotfire: http://spotfire.com http.//spotfire.com StackPACK: StackPACK: http://www.sanbi.ac.za/Dbases.html h ttp'//www.sanbi.ac.za/D bases.html Wise (Genewise): (Genewise): http://www.ebi.ac.uk/Wise2/index.html http://www.ebi.ac.uk/Wise2/index.htmI Sequence folding Structure prediction prediction Mfold: http://www.bioinfo.rpi.edu/applications/mfold/ http.//www.bioinfo.rpi.edu/applications/mfold/ NNPREDICT (Protein (Protein Secondary Structure Prediction): http://www.cmpharm.ucsf.edu/nomi/nnpredict.html http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html

A.3 A.3 TAB LE TABLE

Biological Biological tools tools and and systems. systems.

R Appendix: e s oBiol u ogical r c Reso e u rces

................

405 405

Category

Application Application Names and URLs Partek: http://www.partek.com http'.//www.partek.com BioRS BioRS (Biomax): (Biomax): http://www.biomax.de/index.html http-//www.biomax.de/index.html DBGETlLinkDB: DBGET/LinkDB: http:www-genome.ad.jpldbget http:www-genome.ad.jp/dbget AceDB: AceDB: http://www.acedb.org/ http-//www.acedb.org/ GUS (Genomics GUS (Genomics Unified Unified Schema Schema platform): http://www.gusdb.org http'//www.gusdb.org GIMS (Genome Information Management System): System): http://www.cs.man.uk/img/gims/ h ttp.//www, cs.ma n. u k/img/gims/ Informax: http://www.informaxinc.com http.//www.informaxinc.com MySQL (open source DBMS): DBMS): http://www.mysql.com/ http'//www.mysql.com/ SeqStore: SeqStore: http://www.accelrys.com/dstudio/ds http://www.accelrys.corn/dstudio/ds_seqstore/ -seqstore/ Tripos: http://www.tripos.com/ http.//www.tripos.com/

Pattern recognition recognition


Retrieval systems

Database systems

This Page Intentionally Left Blank

G l ossary Glossary

AADM The Affymetrix Affymetrix Analysis Data Model. Model. The The relational relational database database schema schema the the AAD M The Analysis Data Affymetrix MicroDB systems GeneChip expression Affymetrix LIMS LIMS and and MicroDB systems use use to to store store GeneChip expression results. results.
aggregation A A computation computation whose whose result result value value depends depends on on a a stream stream of of input input values, values,

such such as as an an average, average, sum, sum, or or standard standard deviation. deviation.
API (application programming interface) This This is is composed composed of of any any set set of of routines routines

generally available for generally available for use use by by programmers programmers to to provide provide portable portable code. code. The The pro programmer grammer only only has has to to worry worry about about the the call call and and its its parameters parameters and and not not the the details details of of implementation, implementation, which which may may vary vary from from system system to to system. system.
ASNl ASN1 Abstract Abstract Syntax Syntax Notation Notation One. One. This This is is an an ISO ISO standard standard for for open open systems systems

interconnection. interconnection.
automatic automatic summary table A A special special table table created created to to cache cache the the results results of of a a specific specific

query tables. Subsequently, query against against other other tables. Subsequently, when when another another query query is is submitted, submitted, the the query processor processor may able to query may be be able to deduce deduce that that the the new new query query can can be be rewritten rewritten as as a against the cached result. result. Using pre-computed result a query query against the cached Using the the pre-computed result in in this this way way can can have downside is have a a large large performance performance benefit. benefit. The The downside is that that the the automatic automatic summary summary table table must must be be maintained maintained as as the the underlying underlying tables tables are are updated. updated.
autonomy of databases Degree Degree of of control control of of the the database database into into an an integration integration architecture includes what architecture that that includes what transactions transactions are are permissible, permissible, how how it it executes executes trans transactions, and on. In resources, examples actions, and so so on. In the the context context of of integration integration of of distributed distributed resources, examples of of integration integration affecting affecting the the autonomy autonomy of of each each resource resource are are tight tight integration, integration, semi semiautonomous integration, isolation. They autonomous integration, and and total total isolation. They respectively respectively characterize characterize a a lack lack of of control, control, moderate moderate control, control, and and total total control. control. Autonomy Autonomy is is the the second second charac characterization terization of of integration integration with with distribution. distribution.

408

--'"---"-'---'"---~~~~

..................

Glossary

bag bag A A data data type type that that represents represents a a homogeneous homogeneous collection collection of of objects objects such such that that

the the order order of of appearance appearance of of these these objects objects in in the the collection collection is is unimportant, unimportant, but but the the number of occurrences occurrences is is important. important. Unlike Unlike a a set, set, a a value value may may occur occur multiple multiple times times number of in a a bag. bag. in
bindjoin A A federated federated join join algorithm algorithm in in which which the the federated federated server server ships ships values values of of

the the join join column(s) column(s) from from one one of of the the tables tables to to the the remote remote data data source source that that stores stores the the other other table. table. The The remote remote source source searches searches its its table table for for rows rows with with matching matching values, values, and and returns returns these these to to the the federated federated server. server.
BLAST The The Basic Basic Local Local Alignment Alignment Search Search Tool. Tool. Used Used to to compare compare a a gene gene or or protein protein sequence against against other other sequences. sequences. sequence blastn An An implementation implementation of of BLAST BLAST used used for for nucleotide-nucleotide nucleotide-nucleotide comparisons. comparisons. blastp An An implementation implementation of of BLAST BLAST used used for for protein-protein protein-protein comparisons. comparisons.

Binary Large Large Object. Object. A A data data type type for for representing representing a a long long string string of of binary binary BLOB Binary

data data (e.g., (e.g., an an image image or or a a video) video) whose whose internal internal structure structure is is unknown unknown to to the the database database Due to management system. system. Due to their their potential potential for for great great size, size, database database systems systems typi typimanagement cally manage BLOB data cally manage BLOB data with with special special techniques techniques to to eliminate eliminate unnecessary unnecessary copying copying and allow allow random random access access to to sub-pieces. sub-pieces. Unlike CLOB, a BLOB is is not not associated associated and Unlike a a CLOB, a BLOB with a particular character character set set or or encoding. encoding. with a particular
Boolean A family Boolean circuits an infinite collection of Boolean circuit A family of of Boolean circuits is is an infinite collection of acyclic acyclic

Boolean circuits circuits made made up up of AND, OR, OR, and and NOT NOT gates. gates. Boolean of AND,
box plots An An excellent and variation information in in excellent tool tool for for conveying conveying location location and variation information

data sets, sets, particularly detecting and changes data particularly for for detecting and illustrating illustrating location location and and variation variation changes between of data. data. between different different groups groups of The act act of of accessing accessing information information available available on on the the World World Wide Wide Web. Web. browsing The This is is typically typically an an interactive interactive process, process, with with a a person person examining examining Web Web pages pages and and This following links. following links. to data data types that are are collections of objects. objects. Examples Examples of of bulk data type Refers Refers to types that collections of bulk data data types types are are sets, bags, lists, lists, and and arrays. arrays. bulk sets, bags,
CDATA Textual Textual portion portion of of an an XML XML document that is is ignored ignored by by the the parser. parser. CDATA document that

G l o s s............................................................................................................................. ary
~``~```~`~``~.```~`~``~`~`~``~ ~:~`~`~`~ ~ ~ \ ~ ~\\\\~ ~~ \ ~ ` ~ ~ : ~ ~ ~\~: ~ \ ~ ~ \ \ ~ ~ ~ ~ ~ ~ ~ ~ ..~

: : : : :. ~: : : : :.; ~: : :.~:.~~

409 409

cDNA Complementary Complementary DNA. DNA. DNA DNA copies copies of of the the mRNA mRNA expressed expressed in in a a specified specified cDNA tissue. tissue. CDS Coding Coding sequences. CDS sequences. CGI Common Common Gateway Interface. CGI Gateway Interface. CLI Call-Level Call-Level Interface. Interface. A A general-purpose general-purpose interface interface to to IBM IBM DB2 DB2 that that conforms conforms CLI to O ODB C 2.0 2.0 level level 2 2 and and ODBC ODBC 3.0 3 .0 level level 1, 1 , but but can can be be used used without without an an ODBC ODBC DBC to driver. It It also also supports supports some ODBC 3.0 3.0 level level 2 2 functions, functions, as as well well as as some DB2driver. some ODBC some DB2specific functions. functions. specific CLOB Character Large Large Object. Object. A A data data type long string string of of char CLOB Character type for for representing representing a a long character (e.g., a a text or genomic acter data data (e.g., text document document or genomic sequence) sequence) whose whose internal internal structure structure is unknown unknown to the database database management management system. system. Due Due to to their potential for great is to the their potential for great size, database typically manage CLOB data with special special techniques techniques to to size, database systems systems typically manage CLOB data with eliminate unnecessary and allow allow random to sub-pieces. eliminate unnecessary copying copying and random access access to sub-pieces. A A CLOB CLOB is associated with a specific specific character character set, character codes will be translated is associated with a set, and and character codes will be translated appropriately when it is appropriately when it is retrieved. retrieved. CNS tissue Central Central nervous nervous system system tissue. tissue. CNS co-clustered co-clustered fragment fragment Gene Gene fragments fragments derived derived from from the the same same UniGene UniGene cluster cluster or or

consensus consensus sequence sequence cluster. cluster.


CPL (Collection Programming Programming Language) Language) A A high-level high-level query query language language based based on on the the comprehension comprehension syntax syntax and and supported supported by by Kleisli. Kleisli. 1 1 comparative human genetics comparative genomics genomics The The study study of of human genetics by by comparisons comparisons with with model model

organisms E. coli. coli. organisms such such as as mice, mice, fruit fruit flies, flies, and and the the bacterium bacterium E.
complex value value data data Data Data whose whose type type system system includes includes not not only only simple simple types types such such

as as strings, strings, and and numbers, numbers, but but also also arbitrarily arbitrarily nested nested sets, sets, lists, lists, bags, bags, records, records, and and variants. variants.

1 . L. 1. L. Wong. Wong. "Kleisli: "Kleisli: A A Functional Functional Query Query System." System." Journal Journal of Functional Functional Programming Programming 10, no. no. 1 1 (2000): 1 9-56. 19-56.

410

Glossary

Conceptual Model Model (CM) An An abstraction abstraction of of the the objects objects represented represented in in an an applica applica-

tion, as as well well as as their their properties properties and and their their relationships, relationships, that that provides provides a a conceptual conceptual tion, representation class and representation of of the the application. application. CMs CMs typically typically capture capture the the class and object object struc structure modeled world. world. CMs ture as as well well as as domain-specific domain-specific relationships relationships of of the the modeled CMs can can be be expressed in ways such such as expressed in a a variety variety of of ways as through through entity-relationship entity-relationship diagrams diagrams (ER), class class diagrams diagrams in in the the Unified Unified Modeling Modeling Language Language (UML), (UML), or or by by using using formal formal ap approaches proaches based based on on first-order first-order predicate predicate logic. logic.
CORBA CORBA Common Common Object Object Request Request Broker Broker Architecture. Architecture. An An OMG OMG standard standard for for an an

architecture architecture and and infrastructure infrastructure that that allows allows computer computer applications applications to to work work together together over over networks. networks.
CPU Central Central Processing Processing Unit. Unit.

database (DB (DB) A collection collection of of information information organized organized in in such such a a way way that that a a com comdatabase ) A
puter puter program program can can quickly quickly select select desired desired pieces pieces of of data data (see (see database management system).
database management system A A collection collection of of programs programs that that enables enables storing, storing,

accessing, modifying, accessing, modifying, and and extracting extracting information information in in a a database. database.
data cleansing (Also (Also called called data data scrubbing) scrubbing) This This is is the the process process of of amending amending or or

removing removing data data in in a a database database that that is is incorrect, incorrect, incomplete, incomplete, improperly improperly formatted, formatted, or also data curation. or duplicated. duplicated. See See also
data curation The The process process of of storing storing and and checking checking the the accuracy accuracy of of data data so so they they remain remain accessible accessible indefinitely. indefinitely. When When applied applied in in the the context context of of multiple multiple data data sources, sources, this this also also implies implies the the reconciliation reconciliation of of semantic semantic conflicts conflicts that that may may arise arise from from con conflicting flicting information. information. data fusion fusion The The process process of of deriving deriving insight insight from from information information acquired acquired from from mul mul-

tiple tiple sources sources (sensor, (sensor, databases, databases, information information gathered gathered by by human, human, etc.) etc.) of of which which data data integration integration is is a a key key step. step. The The term term was was first first used used by by the the military military to to correlate correlate and and analyze analyze information information in in time time and and space, space, to to identify identify and and track track individual individual objects objects (equipment and units), determine threats, threats, and (equipment and units), to to assess assess the the situation, situation, to to determine and to to detect detect patterns in patterns in activity. activity.
data integration A from multiple, A process process that that combines combines data data from multiple, possibly possibly heteroge heteroge-

neous and and inconsistent, single, consistent neous inconsistent, data data sources sources into into a a single, consistent source. source.

41 1
data mining Analyzing, Analyzing, exploring, exploring, or or clustering clustering a a data data set set with with statistical statistical tech techniques. niques. data model model Provides Provides the the means means for for specifying specifying particular particular data data structures, structures, for for con con-

straining straining the the data data associated associated with with these these structures, structures, and and for for manipulating manipulating the the data data within within a a database database system. system. To To handle handle data data outside outside the the database database system, system, this this tradi traditional tional definition definition is is extended extended to to include include a a data data exchange exchange format, format, which which is is a a means means for for bringing bringing data data outside outside the the database database system system into into it it and and also also for for moving moving data data inside inside the the database database system system to to the the outside. outside.
data-shipping Within Within the the client/server client/server context, context, data-shipping data-shipping consists consists of of transfer transfer-

ring the data from server and and performing ring the data from the the client client to to the the server performing the the execution execution of of the the query for an an alternate alternate approach. approach.)) query at at the the server. server. (See (See query-shipping for
data source Any Any data data repository repository (e.g., (e.g., database, database, flat flat files). files). data type Classifies Classifies a a particular particular type type of of information. information. Examples Examples of of data data types types are: are:

integer, integer, floating floating point, point, number, number, character, character, and and string. string. (See (See bulk data type.)
data warehouse A and concon A collection collection of of data data integrated integrated from from multiple multiple sources sources and

tained within unique system, needs to tained within a a unique system, usually usually a a database. database. Data Data needs to be be translated translated to to a a common common format, format, cleansed, cleansed, and and reconciled reconciled before before being being integrated integrated into into the the data data warehouse. warehouse. It It constitutes constitutes a a subject-oriented, subject-oriented, integrated, integrated, time time variant, variant, and and nonvolatile nonvolatile data data repository. repository.
Datalog query language users to and manipulate Datalog A A query language that that allows allows users to access access and manipulate data data con con-

tained tained in in predicates predicates through through if-then-else if-then-else rules. rules.
DB MS (See ) DBMS (See database management system. system.) description logics Knowledge Knowledge representation representation languages languages tailored tailored for for expressing expressing knowledge knowledge about about concepts concepts and and concept concept hierarchies. hierarchies. distributed database systems. A A collection collection of of logically logically interrelated interrelated databases, databases, dis distributed tributed at at multiple multiple sites sites and and connected connected by by a a computer computer network network such such that that each each database autonomous processing processing capability database has has autonomous capability and and participates participates in in the the execution execution of split across across multiple multiple sites. sites. Distribution of queries queries that that are are split Distribution of of databases databases charac characterizes are split terizes the the fact fact that that the the data data are split over over several several databases. databases. Distribution Distribution is is the the second characterization characterization of bases with second of data databases with autonomy. autonomy.

4 12 412

......................

~ ~

Glossary

DNA DNA Deoxyribonucleic Deoxyribonucleic acid. acid. A A linear linear nucleic nucleic acid acid polymer polymer composed composed of of four four kinds kinds

of nucleotides: nucleotides: Adenine, Adenine, Thymine, Thymine, Guanine, Guanine, Cytosine. Cytosine. In In native native form form inside inside the the of nucleus, it is a double-helix of two anti-parallel strands held together by hydrogen nucleus, it is a double-helix of two anti-parallel strands held together by hydrogen bonds. DNA bonds. DNA is is the the carrier carrier of of genetic genetic information information for for many many species. species. DNA micro microarray A mechanism mechanism for for massively massively parallel parallel gene gene expression expression and and gene gene dis disDNA array A

covery covery studies studies in in which which probes probes (or (or oligonucleotide oligonucleotide sequences) sequences) with with known known identity identity are nylon substrates used to determine complementary are placed placed on on glass glass or or nylon substrates and and used to determine complementary bind binding through through hybridization. hybridization. A A synonym synonym for for this this is is probe probe array. array. ing process of The experimental experimental process of determining determining the the nucleotide nucleotide sequence sequence DNA sequencing The DNA of labelling each of a a region region of of DNA. DNA. This This is is done done by by labelling each nucleotide nucleotide (A, (A, C, C, G, G, or or T) T) with with either either a a radioactive radioactive or or fluorescent fluorescent marker marker that that identifies identifies it. it. There There are are several several methods methods of of applying applying this this technology, technology, each each with with its its advantages advantages and and disadvantages. disadvantages. For For more more information, information, refer refer to to a a current current textbook. textbook. High High throughput throughput laboratories laboratories frequently automated sequencers, capable of rapidly reading frequently use use automated sequencers, which which are are capable of rapidly reading large large numbers of of templates. templates. Sometimes Sometimes the the sequences sequences may may be be generated generated more more quickly quickly numbers than than they they can can be be characterized. characterized.
domain domain map map (DM) A A kind kind of of ontology ontology to to denote denote semantic semantic networks networks of of terms terms

and relationships. A and their their relationships. A precise precise meaning meaning can can be be associated associated to to DMs DMs via via a a logic logic formalization. formalization. DMs DMs are are used used to to express express terminological terminological knowledge. knowledge.
EDB Extensional Extensional database. database.

model Entity-relationship Entity-relationship model. model. A A data data model model consisting consisting of of entity entity classes classes and and ER model relationships relationships traditionally traditionally used used to to describe describe relational relational database database schema. schema.
enzyme A A biological biological macromolecule, macromolecule, usually usually a a protein, protein, that that acts acts as as a a catalyst. catalyst. Enzyme Enzyme Nomenclature Nomenclature Committee Committee classifies classifies these these molecular molecular activities activities by by assigning assigning a a unique unique Enzyme Enzyme Catalogue Catalogue (EC) (EC) number. number. EST sequence Sequence Tags. Short sequence sequence fragments 200 base sequence Expressed Expressed Sequence Tags. Short fragments (<200 base

pairs) pairs) that that are are known known to to express express collectively collectively in in a a given given tissue tissue or or a a pool pool of of tissue. tissue. Clusters assembled into Clusters of of these these sub-fragments sub-fragments assembled into consensus consensus sequences sequences act act as as identi identifiers transcripts expressed fiers of of genes genes or or transcripts expressed in in that that tissue. tissue.
extensional extensional database database (EDB) Refers Refers to to (i) (i) the the set set of of tuples tuples (i.e., (i.e., "facts" "facts")) stored stored in in

a a database database and/or and/or (ii) (ii) the the relational relational schema schema of of the the tupleslfacts tuples/facts which which are are stored stored directly in a database. directly in a database.

413
FDM Functional Data Model. FDM Functional Data Model.
federation A bases in A collection collection of of semi-autonomous, semi-autonomous, distributed distributed data databases in which which each each database database has has significant significant autonomy autonomy while while still still providing providing the the capability capability to to access access integrated integrated resources resources in in a a unified unified manner. manner. First Order logic (FO) is logic that only quantify is the the logic that can can only quantify over over sets sets of of values. values. Second Second-

order order logic logic can can quantify quantify over over functions, functions, and and higher-order higher-order logic logic can can quantify quantify over over any any type type of of entity. entity.
is a a field in a a relational relational table that matches the primary key column column foreign key This This is field in table that matches the of another another table. table. The The foreign foreign key key can can be be used used to to cross-reference cross-reference tables. tables. of

FTP FTP File File Transfer Transfer Protocol. Protocol.


functional genomics The The study study of of genes, genes, their their resulting resulting proteins, proteins, and and the the role role played by by the the proteins proteins in in the the body's body's biochemical biochemical processes. processes. played functional programming languages Programming Programming languages languages that that emphasize emphasize a a par par-

ticular technique known " ticular paradigm paradigm of of programming programming technique known as as "functional "functional programming. programming." I n this s mathematical In this paradigm, paradigm, all all programs programs are are expressed expressed a as mathematical functions functions and and are are generally generally free free from from side side effects. effects. Examples Examples of of functional functional programming programming languages languages are Haskell, and are LISP, LISP, Haskell, and SML. SML.
gene An An abstract abstract entity entity that that is is the the fundamental fundamental physical physical and and functional functional unit unit of of

heredity. heredity. A A gene gene is is an an ordered ordered sequence sequence of of nucleotides nucleotides located located in in a a particular particular position position on on a a particular particular chromosome chromosome that that encodes encodes a a specific specific functional functional product product (i.e., molecule).. (i.e., a a protein protein or or RNA molecule)
GeneChip Affymetrix Affymetrix whole whole genome genome arrays arrays or or dynamic dynamic custom custom arrays. arrays. gene expression The The process process by by which which a a gene's gene's coded coded information information is is converted converted into into

the operating in Expressed genes include those the structures structures present present and and operating in the the cell. cell. Expressed genes include those that that and then translated into are then translated into protein protein and and those those that that are are are transcribed transcribed into into mRNA and transcribed translated into transcribed into into RNA RNA but but not not translated into protein protein (e.g., (e.g., transfer transfer and and ribosomal ribosomal RNAs). RNAs).
gene fragment An An abstract abstract sub-sequence sub-sequence fragment fragment of of a a representative representative target target tran tranprobes or script (mRNA)) from from which which the the individual individual probes or oligo oligo sequences sequences are are derived. derived. script (mRNA

41 4

G l ossa ry Glossary

Synonyms Synonyms include include composite composite target target sequence, sequence, probe probe set, set, sequence sequence fragment, fragment, and and target sequence.
gene product product The The biochemical biochemical material, material, either either RNA RNA or or protein, protein, resulting resulting from from expression expression of of a a gene. gene. The The amount amount of of gene gene product product is is used used to to measure measure how how active active a amounts can a gene gene is; is; abnormal abnormal amounts can be be correlated correlated with with disease-causing disease-causing alleles. alleles. genome genome All All the the genetic genetic material material in in the the chromosomes chromosomes of of a a particular particular organism; organism; its its

size is is generally generally given given as as its its total total number of base pairs. size number of base pairs. See Human Genome Initiative. genome project See

genomics The The study of genes function. study of genes and and their their function.
gene chip microarray microarray technology Development Development of of cDNA cDNA microarrays microarrays from from a a large large

number number of of genes. genes. Used Used to to monitor monitor and and measure measure changes changes in in gene gene expression expression for for each gene gene represented represented on on the the chip. each chip.
global global schema A A single, single, unifying, unifying, semantically semantically consistent consistent view view of of data data contained contained

in multiple, distributed, heterogeneous heterogeneous data data sources. sources. in multiple, distributed,
global-as-view ) An An integration integration approach ( GAV) when global-as-view (GAV (GAV) approach is is global-as-view global-as-view (GAV) when the the global is expressed to the the source schemas (the (the schemas schemas of of the global schema is expressed with with respect respect to source schemas the integrated sources). asked against integrated sources). Queries Queries asked against the the global global schema schema are are easily easily translated translated into queries by by replacing replacing the the meaning of into source source queries meaning of of each each relation relation and and attribute attribute of the global schema with with its its definition definition in in terms terms of of the the source schemas. GAV GAV is is an the global schema source schemas. an alternative to alternative to LAV. LAV. grid Grid computing is is a computing that coordinating Grid computing a form form of of distributed distributed computing that involves involves coordinating

and or network resources across across and sharing sharing computing, computing, application, application, data, data, storage, storage, or network resources dynamic and and geographically geographically dispersed dispersed organizations. organizations. dynamic GUI Graphical Graphical User User Interface. Interface. GUI
heterogeneous databases Used Used in in the the context context of of distributed distributed databases databases when when syssys heterogeneous tems differ differ in in some some way, way, such such as as data data representation, representation, query query language, language, or or semantics. semantics. tems host variable The The SQL SQL representation representation for for an an application application program program variable. variable. A A host host variable variable can can be be the the container container for for data data inserted inserted into into or or retrieved retrieved from from the the host

r G Iossa ~

.....................................................................................................................

415 41 5

database, represent a query parameter whose value value will will be database, or or it it can can represent a query parameter whose be supplied supplied by by the the application application just just prior prior to to execution execution of of the the query. query.
HTTP HTTP Hypertext Hypertext Transfer Transfer Protocol. Protocol. Human Human Genome Genome Initiative Collective Collective name name for for several several projects projects begun begun in in 1986 1986 by to create by the the U.S. U.S. Department Department of of Energy Energy to create an an ordered ordered set set of of DNA DNA segments segments from locations, to computational methods from known known chromosomal chromosomal locations, to develop develop new new computational methods for for analyzing genetic DNA sequence data, and analyzing genetic map map and and DNA sequence data, and to to develop develop new new techniques techniques and DOE initiative and instruments instruments for for detecting detecting and and analyzing analyzing DNA. DNA. This This DOE initiative is is now now known joint national known as as the the Human Human Genome Genome Program. Program. The The joint national effort, effort, led led by by DOE DOE and and National National Institutes Institutes of of Health, Health, is is now now known known as as the the Human Human Genome Genome Project. Project. Human Human Genome Genome Project Project (HGP) Formerly Formerly titled titled Human Human Genome Genome Initiative. Initiative.

which two hybridization The The biochemical biochemical process process by by which two complementary, complementary, single singlestranded chain. DNA DNA stranded nucleic nucleic acid acid chains chains form form a a stable, stable, double-stranded double-stranded helix helix chain. microarrays microarrays use use hybridization hybridization reactions reactions to to assay assay target target transcripts transcripts extracted extracted from from the the samples. samples.
IDB Intensional Intensional database. database. intensional database (IDB ) Refers (IDB) Refers to to (i) (i) the the set set of of tuples tuples (i.e., (i.e., virtual virtual relations) relations)

in in a a database, database, which which are are defined defined by by means means of of logic logic rules rules (e.g., (e.g., Datalog Datalog formulas formulas or or SQL SQL "create "create view" view" statements) statements) and/or and/or (ii) (ii) the the relational relational schema schema of of the the virtual virtual relations relations defined defined by by those those rules. rules.
ISA relationship relationship When When between between two two entities, entities, captures captures the the notion notion of of generalization generalization The opposite or inverse of generalization is called specialization. If "A" ISA The opposite or inverse of generalization is called specialization. If "A" ISA "B", "B", then "B" is more generic then "B" is the the more generic concept concept and and A A is is the the specific specific concept. concept. The The most most significant significant property property of of an an ISA ISA relationship relationship is is that that of of inheritance. inheritance. All All that that is is specified specified to be to be true true about about the the generic generic concept concept is is also also true true for for the the specific specific concept. concept. That That means means that constraint (rules) (rules) are that all all attributes, attributes, their their values, values, and and constraint are inherited inherited from from the the more more generic level concept generic level concept down down to to the the more more specific specific level level concept concept as as are are all all relationships relationships in in which which the the more more generic generic level level concept concept participates. participates. ISDK InSilico InSilico Discovery Discovery Kit Kit describes describes experimental experimental steps steps carried carried out out in in computers computers

the same the same way way an an experimental experimental protocol protocol describes describes the the steps steps carried carried out out in in a a wet wet laboratory. laboratory.

416 4 16

Glossary

ISO ISO International International Organization Organization for for Standardization. Standardization. An An international international standards standardsmaking body.
JDBe JDBC Java Java DataBase DataBase Connectivity. Connectivity. JDBC JDBC technology technology is is an an application application program program-

ming ) that (API) that provides provides cross-database cross-database connectivity connectivity to to a a wide wide range range of of ming interface interface (API relational database systems from the Java programming language. It also provides relational database systems from the Java programming language. It also provides access access to to other other tabular tabular data data sources, sources, such such as as spreadsheets spreadsheets or or flat flat files. files.
K2MDL K2MDL The The K2 K2 mediator mediator definition definition language, language, a a high-level high-level language language that that extends extends

ODMG's ODL with with OQL definitions definitions and and variants. variants.

KEGG KEGG Kyoto Kyoto Encyclopedia Encyclopedia of of Genes Genes and and Genomes. Genomes. An An effort effort to to computerize computerize current molecular and cellular biology biology in current knowledge knowledge of of molecular and cellular in terms terms of of the the information information pathways consist of pathways that that consist of interacting interacting molecules molecules or or genes genes and and to to provide provide links links from from the the gene gene catalogs catalogs produced produced by by genome genome sequencing sequencing projects. projects.
known known gene Refers Refers to to officially officially approved approved genes genes by by the the model model organism organism nomencla nomencla-

ture humans. ture committee. committee. For For example, example, HUGO HUGO is is for for humans.
Kripke structure structure Modal Modal logics logics provide provide a a general general framework framework for for reasoning reasoning about about what what is is necessarily necessarily or or possibly possibly true, true, in in particular particular when when dealing dealing with with several several "pos "possible worlds" that are reachable reachable from by a a temporal accessibility sible worlds" that are from one one another another by temporal accessibility relation. Kripke Kripke structures are families families of of conventional conventional first-order first-order logic relation. structures are logic structures structures (one for for each each "possible "possible world"), world"), which which may be reachable reachable from from one another as as (one may be one another described R. described by by the the accesibility accesibility relation relation R.

LAV See local-as-view. LAV See

LIMS Laboratory Information Management Management System. System. Software helps manage LIMS Laboratory Information Software that that helps manage the workflow data associated associated with laboratory. the workflow and and data with a a laboratory.
of databases databases This This type type of of federation federation allows allows a a set set of of data data link-driven federation of sources to to be be browsed browsed by by a a user user who who asks asks a a single single retrieval retrieval query query and and then then explores explores sources the output output by by browsing browsing from from one one source source to to the the other other via via hyperlinks. hyperlinks. An An example example the of of a a link-driven link-driven federation federation is is SRS. SRS.

LISP This This is is a a programming programming language language invented invented by by John John McCarthy McCarthy in in the late 1950s 1950s LISP the late as a a formalism formalism for reasoning about use of of recursion recursion equations equations as as a a model model for for as for reasoning about the the use computation. computation.

Glossary
9 9 . . . . .

.
~ 9 9 _ .

.
.

.
.

.
.

.
. _ 9 . . . . . . . . . .

417 4 17

list A represents a such that A data data type type that that represents a homogeneous homogeneous collection collection of of objects objects such that both both the order of appearance and the number of occurrences of objects in the collection the order of appearance and the number of occurrences of objects in the collection are are important. important. local-as-view V) An local-as-view (LA (LAV) An integration integration approach approach is is local-as-view local-as-view when when the the source source

schemas Queries asked asked against against schemas are are expressed expressed by by means means of of the the global schema. Queries the the global global schema schema are are translated translated into into source source queries queries by by replacing replacing the the sub-query sub-querydefining schema components source schemas. defining schema components of of source schemas. An An alternative alternative approach approach to to LAV LAV is is global-as-view. LOGSPACE class of problems solvable logarithmic space. LOGSPACE The The class of problems solvable in in deterministic deterministic logarithmic space.
materialized query query table See See automatic automatic summary summary table. materialized view cached result query against against a query can view The The cached result of of a a query a database. database. The The query can restructure automatic restructure or or be be intended intended to to load load data data into into a a data data warehouse. warehouse. (See (See also also automatic summary summary table.) mediator A A middleware middleware component component of of a a database database integration integration infrastructure infrastructure that that

translates data heterogeneous data translates data from from fully fully autonomous autonomous distributed distributed heterogeneous data sources sources to to a a semantically semantically consistent consistent representation. representation. Mediators Mediators do do not not assume assume that that in integrated tegrated sources sources will will all all be be relational relational databases; databases; instead, instead, they they can can be be various various database (relational, object-relational, database systems systems (relational, object-relational, object, object, XMLetc.), XMLetc.), flat flat files, files, and and so on. so on. MGED MGED The The Microarray Microarray Gene Gene Expression Expression Data Data society. society. An An international international organi organization array data zation for for facilitating facilitating the the sharing sharing of of micro microarray data from from functional functional genomics genomics and and proteomics proteomics experiments. experiments. MIAME MIAME Minimum Minimum information information about about a a microarray microarray experiment. experiment. This This is is a a set set of of guidelines guidelines developed developed by by the the MGED MGED Society Society to to outline outline the the minimum minimum information information required to required to unambiguously unambiguously interpret interpret microarray microarray data data and and subsequently subsequently to to allow allow independent independent verification verification of of this this data data at at a a later later stage stage if if required. required.
micro array See microarray See DNA D NA microarray. microarray. middleware set of middleware Connectivity Connectivity software software that that consists consists of of a a set of enabling enabling services services that that

allow multiple multiple processes allow processes running running on on one one or or more more machines machines to to interact interact across across a a network. network.

418 4 18

Glossary

(meta language) language) A A functional functional programming programming language. language. ML (meta
model-based mediation (MBM) A A wrapper/mediator wrapper/mediator approach approach and and architecture architecture

for information information integration integration in in which which representations representations of of domain domain semantics semantics (do (dofor main main maps, maps, process process maps, maps, and and semantic semantic integrity integrity constraints) constraints) are are used used to to facilitate facilitate queries sources. quenes across across sources.
mRNA mRNA A A single-stranded single-stranded ribonucleic ribonucleic acid acid molecule molecule derived derived from from the the DNA DNA tem tem-

plate plate of of a a gene gene when when the the gene gene is is transcribed transcribed during during the the gene gene expression expression process, process, which which takes takes place place in in the the nucleus nucleus of of the the cell. cell. mRNA mRNA specifies specifies the the order order of of the the amino amino acids acids to to be be coded coded in in a a protein protein by by the the translation translation process process which which takes takes place place inside inside the the cytoplasm cytoplasm of of cell. cell. Its Its role role is is to to transmit transmit instructions instructions from from DNA DNA sequences sequences in in the the nucleus nucleus to to the the protein-making protein-making machinery machinery in in the the cytoplasm cytoplasm of of the the cell. cell.
multi-database A bases. The A system system consisting consisting of of fully fully autonomous autonomous distributed distributed data databases. The

integration component is providing the integration component is in in charge charge of of providing the user user with with a a query query language language to to query query integrated integrated resources, resources, executing executing the the query query by by collecting collecting needed needed data data from from each integrated integrated resource, and returning returning the each resource, and the result result to to the the user. user.
non-materialized view The query that that is and restructure restructure a The result result of of a a query is not not cached cached and a

database. query that database. The The query that defines defines the the non-materialized non-materialized view, view, usually usually is is stored stored as as a a functional definition definition of the data data contained is this this function function that is functional of the contained within within it, it, and and it it is that is used to to recreate recreate the the view used view dynamically dynamically on on demand. demand.
NP (NPTIME) set of of problems solvable in in non-deterministic non-deterministic polynomial polynomial time NP (NPTIME) The The set problems solvable time that cannot solved deterministically in polynomial polynomial time. that cannot be be solved deterministically in time. NP-complete set of of problems in NP NP such such that that any any NP problem reduces to it. it. NP-complete A A set problems in NP problem reduces to

NRC Nested Nested Relational Relational Calculus. NRC Calculus.

An approach to programming programming and and data data storage storage in in which which object-oriented model model An approach to objects are are the the primary primary concepts. concepts. In In this this approach, approach, data data and and functionality functionality are are objects tightly coupled. coupled. Methods Methods are are associated associated with with an an object object and and are are the the only only way way to to tightly manipulate or or access access the the data data contained contained within within that that object. object. This This approach approach also also manipulate makes use use of of concepts concepts such such as as object object inheritance, inheritance, which which may may not not be be available available in in makes other models. other models.
Object (OQL) A A query query language that allows allows users users to to access and Object Query Language (OQL) language that access and

manipulate data data contained in object-oriented object-oriented databases data bases such such as as those those formalized formalized manipulate contained in ODMG. by the the ODMG. by

Glossary

~,

419 419

ODBC Open Open DataBase DataBase Connectivity. Connectivity. A A widely widely accepted accepted application application programming programming ODBC interface for for database database access. access. It It is is based based on on the the call-level call-level interface interface specifications specifications interface from X/Open XlOpen and and ISO/IEC ISOIIEC for for database database APIs, APIs, and and SQL SQL is is its its database database access access from language. language. ODL Object Object Definition A standard object definition by ODL Definition Language. Language. A standard for for object definition specified specified by the ODMG. ODMG. the ODMG Object Object Data Management Group. Group. A A standard-making standard-making body for objectobject ODMG Data Management body for oriented data bases. oriented databases. OIL A proposal proposal for a Web-based Web-based representation representation and and OIL (Ontology Inference Layer) Layer) A for a

inference layer layer for which combines combines the the widely widely used modeling primitives primitives inference for ontologies, ontologies, which used modeling from frame-based languages with the formal semantics and reasoning from frame-based languages with the formal semantics and reasoning services services provided by by description description logics. logics. provided
OLAP (on-line analytical processing) OLAP it reflects OLAP transforms transforms raw raw data data so so that that it reflects

the real real dimensionality dimensionality of of the enterprise as the user. the the enterprise as understood understood by by the user.
OMG Object Management Group. Group. A A standard-setting body focused OMG Object Management standard-setting body focused on on developing developing standards for interoperable enterprise standards for interoperable enterprise applications. applications. one-world/multiple-world scenarios "world" means scenarios Here Here "world" means a a coherent coherent fragment fragment of of

an application domain classes of an application domain (i.e., (i.e., classes of objects objects and and their their relationships, relationships, that that naturally naturally belong belong together together and and form form a a coherent coherent domain, domain, and and where where the the relationships relationships among among the the objects objects and and classes classes is is evident). evident). Thus, Thus, a a one-world one-world mediation mediation scenario scenario can can be be solved additional cross-world solved without without additional cross-world knowledge, knowledge, while while a a multiple-world multiple-world scenario scenario often often requires requires specialized specialized knowledge knowledge to to bridge bridge semantic semantic gaps. gaps.
ontology A A description description of of concepts concepts and and relationships relationships that that exist exist among among the the concepts concepts for particular domain for a a particular domain of of knowledge. knowledge. In In the the world world of of structured structured information information and and data bases, ontologies databases, ontologies in in life life science science provide provide controlled controlled vocabularies vocabularies for for terminol terminology ogy as as well well as as specifying specifying object object classes, classes, relations, relations, and and functions. functions. OQL OQL Object Object Query Query Language. Language. A A standard standard for for querying querying object-oriented object-oriented databases databases specified ODMG. specified by by the the ODMG. primary primary key Primary Primary key key of of a a relational relational table table uniquely uniquely identifies identifies each each record record in in the the

table. normal attribute table. It It can can either either be be a a normal attribute that that is is guaranteed guaranteed to to be be unique unique (such (such as as a a Social Social Security Security number number in in a a table table with with no no more more than than one one record record per per person), person), or or

420 420

Glossary
~ ~

it it can can be be generated generated by by the the DBMS (such (such as as a a globally globally unique unique identifier, identifier, or or GUlD, GUID, in in Microsoft Microsoft SQL SQL Server). Server).
probe The The individual individual 25mer 25mer sub-sequences sub-sequences that that are are tiled tiled on on a a microarray. microarray. These These

are are derived derived from from the the gene gene fragments fragments that that collectively collectively detect detect the the target target transcript. transcript. Synonyms Synonyms used used are are oligonucleotide oligonucleotide sequence sequence and and target target sequence. sequence.
process map (PM) A model-based mediation A kind kind of of ontology ontology used used in in model-based mediation to to describe describe

semantic semantic networks networks of of procedural procedural knowledge knowledge (i.e., (i.e., the the processes processes of of a a domain domain and and how also domain maps. how they they influence influence and and depend depend on on each each other) other).. (See (See also maps.) )
proteome Proteins r organ ta Proteins expressed expressed by by a a cell cell o or organ a at a particular particular time time and and under under

specific specific conditions. conditions.


proteomics The The study study of of the the full full set set of of proteins proteins encoded encoded by by a a genome. genome. pharmacogenomics The The study study of of the the interaction interaction of of an an individual's individual's genetic genetic makeup makeup

and and response response to to a a drug. drug.


P (PTIME) The The class class of of problems problems solvable solvable in in deterministic deterministic polynomial polynomial time. time. query A A program program written written in in a a database database query language language for for retrieving retrieving and and trans trans-

forming forming information information in in a a database. database.


querying Accessing Accessing and and manipulating manipulating a a data data source source using using a a query query language. language. query language A and manipulate A language language that that enables enables users users to to access access and manipulate data, data, usu usually ally stored stored within within a a database database management management system. system. Examples Examples of of query query languages languages are (SQL), the the database database are the the relational relational algebra, algebra, the the Structural Structural Query Query Language Language (SQL), OQL), and logic (Datalog), the the Object Object Query Query Language Language ((OQL), and XQuery. logic (Datalog), query-shipping In In the the client/server client/server context, context, query-shipping query-shipping consists consists of of partially partially or or

completely completely performing performing a a query query at at the the client client site site and and sending sending only only the the results results to to the the server. server. (See (See data-shipping data-shipping for for an an alternate alternate approach. approach.)) RDF Framework (RDF) integrates a RDF The The Resource Resource Description Description Framework (RDF) integrates a variety variety of of applica applications tions including including library library catalogs catalogs and and World World Wide Wide Web Web directories; directories; syndication syndication and and aggregation personal collections aggregation of of news, news, software, software, and and content; content; and and personal collections of of music, music, photos, and photos, and events events using using XML as as an an interchange interchange syntax. syntax. The The RDF RDF specifications specifications

42 1
provide support the exchange of knowledge on provide a a lightweight lightweight ontology ontology system system to to support the exchange of knowledge on the the Web. Web.
record A A data data type type that that represents represents an an object object comprising comprising several several data data fields. fields. Each Each

data data field field has has a a name name and and a a value. value.
relational relational algebra A A query query language language that that allows allows users users to to access access and and manipulate manipulate data data

contained in relations with union, intersection, contained in relations with algebraic algebraic operators: operators: union, intersection, difference, difference, selection, product, join, selection, projection, projection, Cartesian Cartesian product, join, and and renaming. renaming.
relational relational modelThe modelThe standard standard data data model model used used in in commercial commercial database database manage manage-

ment ment systems. systems. This This data data model model is is based based on on the the relational relational algebra algebra and and presents presents data data as a a collection of tables. Each table table represents represents a a complex complex data data type, type, and and each each col colas collection of tables. Each umn umn represents represents an an attribute. attribute. Each Each row row in in a a table table contains contains an an instance instance of of that that type. type. RNA Ribonucleic acid. RNA Ribonucleic acid.
schema The The physical physical data data representation representation in in a a database database system. system. It It characterizes characterizes the the way way the the data data is is organized organized in in the the system system (e.g., (e.g., tables, tables, relations, relations, classes, classes, entities, entities, concepts, concepts, etc.). etc.). schema integration integration The The process process of of mapping mapping source source schemas schemas to to a a global, global, integrated integrated schema. consists in 1 ) identifying database that schema. It It consists in ((1) identifying the the components components of of a a database that are are related related to schema, and 3) to one one another, another, (2) (2) selecting selecting the the best best representation representation for for the the global global schema, and ((3) mapping mapping and and integrating integrating the the components. components. searching through a searching Accessing Accessing a a data data source source through a phrase phrase (string (string of of characters, characters, keyword, keyword,

DNA phrases. The DNA sequence, sequence, etc.) etc.) or or a a Boolean Boolean expression expression of of phrases. The output output is is the the set set of of strings strings (documents, (documents, sequences, sequences, etc.) etc.) that that are are similar similar to to the the given given phrase. phrase.
semantic mediation. mediation. See See model-based model-based mediation. mediation. Semantic Web This This is W3C aiming is a a collaborative collaborative effort effort led led by by the the W3C aiming to to represent represent data Web. It based on Description Framework data on on the the World World Wide Wide Web. It is is based on the the Resource Resource Description Framework ((RDF), RDF), which applications using which integrates integrates a a variety variety of of applications using XML XML for for syntax syntax and and URls URIs for for naming. naming. set A A data data type type that that represents represents a a homogeneous homogeneous collection collection of of objects objects such such that that the order of appearance and the number of occurrences of these objects in the order of appearance and the number of occurrences of these objects in the the collection collection are are unimportant. unimportant.

422 422
SGML Standard Standard Generalized Generalized Markup Markup Language. Language. SGML

Glossary

Skolem functions symbols (after functions Or Or more more precisely, precisely, Skolem Skolem function function symbols (after Albert Albert

Thoralf Skolem, 1 987-1963) are symbolic names Thoralf Skolem, 1987-1963) are used used to to create create symbolic names when when eliminating eliminating existential quantifiers quantifiers in in first-order first-order logic logic statements. statements. For For example example the the formula: formula: existential 'iP YP person(P) person(P) ::::} =~ 3M 3M person(M) person(M) /\ A mother(P,M) mother(P,M) states states that that each each person person P P has has a a mother mother M. M. One One can can obtain obtain a a formula formula that that is is equiv equivalent alent with with respect respect to to satisfiability satisfiability by by replacing replacing the the existential existential quantifier quantifier by by a a new new unary denoting the unary function function symbol symbol Lm(X) f_m(X) denoting the mother mother of of X: X: ) /\ mother(P, 'iP YP person(P) person(P) ::::} =~ person(Lm(P) person(f_m(P))A mother(P, Lm(P)) f_m(P)) In In general, general, a a Skolem Skolem function function depends depends on on those those universal universal quantifiers quantifiers in in whose whose scope scope it it occurs occurs (here: (here: 'iP). YP).
SML Standard Standard Markup Markup Language. Language. A A programming programming language language based based on on the the func func-

tional all programs programs are tional programming programming paradigm. paradigm. In In this this paradigm, paradigm, all are expressed expressed as as mathematical functions and side effects. mathematical functions and are are generally generally free free from from side effects.
SNP Single polymorphism. Single nucleotide nucleotide polymorphism. SOAP The Object Access .2 provides The Simple Simple Object Access Protocol Protocol (SOAP) (SOAP) Version Version 1 1.2 provides the the defini definition XML-based information information which which can can be be used used for for exchanging exchanging structured structured tion of of the the XML-based and information between decentralized, distributed distributed environment. and typed typed information between peers peers in in a a decentralized, environment. sSQL Simplified Simplified SQL. SQL. An SQL-like query An SQL-like query language language supported supported by by Kleisli. Kleisli. It It ex extends SQL to model and tends SQL to the the nested nested relational relational data data model and to to multiple, multiple, heterogeneous, heterogeneous, distributed distributed data data sources. sources. stored stored procedure procedure A A piece piece of of application application code, code, typically typically including including one one or or more more

database database accesses, accesses, that that is is invoked invoked by by the the client-side client-side portion portion of of a a database database applica application tion but but is is executed executed on on the the database database server. server. Stored Stored procedures procedures typically typically are are used used to to reduce reduce client-server client-server communication communication when when multiple multiple accesses accesses to to the the database database are are required required between between interactions interactions with with the the user. user.
Structured Query Language (SQL) The query language Structured Query The standard standard query language for for expressing expressing

queries database management queries and and transformation transformation on on relational relational database management systems. systems. It It allows allows access manipulation of m-where statements. access and and manipulation of data data with with select-fro select-from-where statements.

423
systems biology A field in develop a A new new field in biology biology that that is is attempting attempting to to develop a system-level system-level understanding biological systems. understanding of of biological systems. System-level System-level understanding understanding requires requires under understanding the structures and behaviors of systems as well how to standing the structures and behaviors of systems as as a a whole, whole, as well as as how to control and design control and design them. them. table expression expression A A query query language language construct construct that that represents represents data data whose whose value value is is a a

table, rather than than a include a table, rather a scalar scalar value value or or row. row. Examples Examples include a reference reference to to a a database database table, the the result result of of a a table table function, function, and and a a subquery. subquery. table,
target sequence See See gene fragment. target transcript. transcript. See See mRNA. mRNA.
Tea TC ~ This This is is the the class class of of those those languages languages recognized recognized by by polynomial-size, polynomial-size, bouded bouded-

depth, depth, unbounded unbounded fan-in (e.g., (e.g., maximum maximum number number of of inputs) inputs) Boolean Boolean circuits circuits aug augmented mented by by threshold gates (i.e., (i.e., unbounded unbounded fan-in fan-in gates gates that that output output 1 i if if and and only only if more more than than half half of of their their outputs outputs are are non-zero). non-zero). if
transcription The The process process of of synthesizing synthesizing mRNA mRNA from from a a sequence sequence of of DNA DNA (a (a gene) gene)

template. template.
transcriptome The The full full complement complement of of activated activated genes, genes, mRNAs, mRNAs, or or transcripts transcripts in in

a a particular particular tissue tissue at at a a particular particular time. time.


translation The The process process by by which which the the genetic genetic code code carried carried by by mRNA mRNA directs directs the the

synthesis of proteins from amino acids. synthesis of proteins from amino acids.
UML UML (Unified Modeling Language) Language) The The industry-standard industry-standard language language for for specify specify-

ing, ing, visualizing, visualizing, constructing, constructing, and and documenting documenting the the artifacts artifacts of of software software systems. systems.
UMLS Unified Unified Medical Medical Language Language System. System. URI Uniform (also known Uniform Resource Resource Identifiers Identifiers (also known as as URLs). URLs). URL Uniform Uniform Resource Resource Locators Locators (also (also known known as as URls) URIs) are are short short strings strings that that

identify files, services, identify resources resources in in the the Web: Web: documents, documents, images, images, downloadable downloadable files, services, electronic electronic mailboxes, mailboxes, and and other other resources. resources. They They make make resources resources available available under under a a variety of naming schemes schemes and access methods Internet variety of naming and access methods such such as as HTTP, HTTP, FTP, FTP, and and Internet mail same simple simple way. mail addressable addressable in in the the same way.
variant A data type type representing that is is one one of A data representing an an object object that of several several types. types. Variant Variant

types types enables enables to to consider consider data data of of different different types types within within the the same same composed composed type. type.

424

G lossary Glossary

view A A structured presentation of of the the data data contained contained within a database. database. The The deview structured presentation within a de

fault the data defined by fault view view of of data data is is the the view view of of the data as as defined by the the global global schema. schema. However, However, alternative views (e.g., data summaries) may be presented to provide additional alternative views (e.g., data summaries) may be presented to provide additional insight into into the the data. data. insight
view view integration integration A A virtual virtual integration integration of of multiple multiple data data sources. sources. XA XA An An industry-standard industry-standard interface interface for for transaction transaction management management that that is is based based on on

the specification. XA compliant data the X/Open X/Open specification. XA allows allows multiple multiple compliant data managers managers to to cooper cooperate single transaction ensures that all updates updates in either ate in in a a single transaction and and ensures that all in the the transaction transaction are are either committed group, regardless which data made each each committed or or rolled rolled back back as as a a group, regardless of of which data manager manager made change. change. XML XML Extensible Extensible Markup Markup Language. Language. A A simple, simple, very very flexible flexible text text format format derived derived from from SGML SGML that that is is a a standard standard format format for for structured structured documents documents and and data data on on the the World Wide Wide Web. Web. World
XQuery XQuery A A standard standard query query language language that that allows allows users users to to access access and and manipulate manipulate

X M L documents. documents. data contained contained in in XML data

WWW ) Consortium. W3C W3C The The World World Wide Wide Web Web ( (WWW) Consortium.
warehouse warehouse See See data data warehouse. warehouse. workflow model, workflow Workflows Workflows are are used used in in business business applications applications to to assess, assess, analyze, analyze, model, define, (or other define, and and implement implement the the core core business business processes processes of of an an organization organization (or other busi business ness entity). entity). A A workflow workflow approach approach automates automates the the business business procedures procedures where where docu documents, ments, information, information, or or tasks tasks are are passed passed between between participants participants according according to to a a defined defined set business goal. goal. In set of of rules rules to to achieve, achieve, or or contribute contribute to, to, an an overall overall business In the the context context of of scientific applications, a flow approach address overall scientific applications, a work workflow approach may may address overall collaborative collaborative is issues sues among among scientists, scientists, as as well well as as the the physical physical integration integration of of scientific scientific data data and and tools. tools. wrapper wrapper A A wrapper wrapper is is generally generally used used within within a a mediator-wrapper mediator-wrapper architecture architecture for for

integrating data sources. data source source typically through an integrating multiple multiple data sources. Each Each data typically is is accessed accessed through an existing interface program is existing interface program. program. However, However, the the mediator mediator program is unable unable to to commu communicate nicate directly directly with with this this existing existing interface interface program, program, often often because because of of some some input inputoutput format incompatibility. A wrapper is a program that handles this incom output format incompatibility. A wrapper is a program that handles this incompatibility so patibility so the the mediator mediator program program can can communicate communicate with with the the interface interface program. program.

III

System Information Inform ation System

Chapter 5
Name Version of Name and and Version of System System

SRS SRS

SRS (Sequence (Sequence Retrieval System), System), version version 7.0. 7.0. (Additional information is available at
http://www.lionbioscience.com/solutions/products/srs. ) http.//www.lionbioscience.com/solutions/products/srs. ) SRS SRS is a commercial system and is being further developed wi,h 2 by LION bioscience with 2 major releases/year and and around four maintenance releases/year. SRS is is available to academics academics free free of charge. SRS available to of charge.

Status Development and and Status of of Development Maintenance Maintenance

Contact Contact

srs@lionbioscience.com

Chapter Chapter 6 6 Name Name and and Version Version of of System System Status Status of of Development Development and and Maintenance Maintenance

Kleisli Kleisli discoveryHub ) discoveryHub (version (version 5 5) Available for Solaris, Linux, and Windows platforms.

commercialized by Kleisli is being developed, maintained, and commercialized geneticXchange geneticXchange Inc. Inc. Academic Academic licenses licenses are are available available (e.g., (e.g., Stanford Stanford University University is is an academic customer).
Brian Brian Donnelly Donnelly GeneticXchange GeneticXchange Inc. Inc. 7 1 3 Santa 713 Santa Cruz Cruz Avenue Avenue Menlo Menlo Park, Park, California California 94025-4519, 94025-4519, USA USA Tel: 1 (650) Tel: + +1 (650) 321-9573 321-9573 Email: Email: info@geneticxchange.com info@geneticxchange.com URL: http://www.geneticxchange.com URL: http://www.geneticxchange.com European European Operations: Operations: Tel: 0 ) 1 296 660348 Tel: +44 +44 ((0)1296 660348 Email: Email: infoeurope@geneticxchange.com infoeurope@geneticxchange.com Asia-Pacific Asia-Pacific Operations: Operations: Tel: Tel: +61 (0)2 (0)2 6281 6281 7655 7655 Email: Email: infoapac@geneticxchange.com infoapac@geneticxchange.com

Contact Contact Person Person

426 426
Chapter7 Chapter
Name Name and and Version Version of of System System TAMBIS TAMBIS

nformation System IInformation

TAMBIS TAMBIS 0.96 0.96 A A demonstrational demonstrational Java Java applet applet and and video video examples examples are are publicly publicly accessible. accessible. (Additional (Additional information information is is available available at at http://imgproj.cs.man.ac.uk/tambis/index.html.) http://imgproj.cs.man.ac.uk/tambis/index.html. ) TAMBIS TAMBIS is is a a public public system system developed developed at at the the University University of Manchester in the UK with the support of the Bioinformatics programme of the British Biotechnology and Biological Sciences Sciences Research Research Council Council (BBSRC) (BBSRC) in in partnership partnership with with the the Engineering Engineering and and Physical Physical Sciences Sciences Research Research Council Council (EPSRC) (EPSRC) and Zeneca Pharmaceuticals. TAMBIS TAMBIS is is no no longer longer maintained. An An academic academic license license may may be be obtained. obtained. tambis-help@cs.man.ac.uk tambis-help@cs.man.ac.uk

Status Status of of Development Development and and Maintenance Maintenance

Contact Contact

Chapter Chapter8 8 Name Name and and Version Version of of System System

K2

K2 K2 0.5 0.5 alpha alpha K2 .2. It K2 is is implemented implemented in in pure pure Java, Java, under under JDK JDK 1 1.2. It is is provided provided as .jar file, 850 K, as a a .jar file, about about 850 K, and and requires requires ORO's ORO's Per! Perl module module for for doing doing regular regular expression expression matching matching and and JGL JGL for for handling handling collections. collections. Its Its OQUODL OQL/ODL implementation implementation is is based based on on the the ODMG ODMG 2.0 specification, specification, with with some some additions, and and a a few few portions that are portions that are not not yet yet implemented. implemented. (Additional http://db.cis.upenn.edu/K2/.) (Additional information information is is available available at at http://db.cis.upenn.edu/K2/.) K2IKLEISLI K2/KLEISLI was was developed developed at at the the University University of of Pennsylvania Pennsylvania and and is is currently currently maintained maintained by by Scott Scott Harker. Harker. Academic licenses licenses are Academic are available. available. Dr. Dr. Val Val Tannen Tannen Department Department of of Computer Computer and and Information Information Science Science University University of of Pennsylvania Pennsylvania 200 South 33rd Street 200 South 33rd Street Philadelphia, Pennsylvania 9 1 04-6389, USA Philadelphia, Pennsylvania 1 19104-6389, USA Tel: 898-2665 Tel: +1 +1 (215) (215) 898-2665 FAX: FAX: +1 +1 (215) (215) 898-0587 898-0587 Email: Email: val@cis.upenn.edu val@cis.upenn.edu

Status Development and Status of of Development and Maintenance Maintenance Contact Contact Person Person

nformation System IInformation

427
PIFDM P/FDM Mediator Mediator PIFDM P/FDM Mediator Mediator (Additional information information is available at http://www.csd.abdn.ac.uk/gjlk/mediator/. http://www.csd.abdn.ac.uk/~gjlk/mediator/. )) The PIFDM P/FDM Mediator, described in Chapter 9, is a research prototype prototype developed developed with with support support from from the the British British Biotechnology and Biological Sciences Research Council (BBSRC) in in partnership partnership with with the the Engineering Engineering and and Physical Physical Sciences Sciences Research Research Council Council (EPSRC). (EPSRC). This This system system is is not not currently currently developed developed or or maintained. maintained. Dr. Dr. Graham Graham J. J. L. L. Kemp Kemp Department Department of of Computing Computing Science, Science, Chalmers University of Technology SE-412 96, G6teborg, Sweden Tel: 1 Tel: (+46) (+46) 31-772-541 31-772-5411 FAX: 1 - 1 65655 FAX: (+46) (+46) 3 31-165655 Email: Email: kemp@cs.chalmers.se kemp@cs.chalmers.se URL: http://www.cs.chalmers.se/~kemp/ URL: http://www.cs.chalmers.se/kemp/

Chapter Chapter 9 9 Name Name and and Version Version of of System System

Status Status of of Development and Maintenance Maintenance

Contact Contact Person Person

10 Chapter Chapter 10

GeneExpress GeneExpress GX .4.2, Genesis Genesis 1 .1 GX Software Software System System 1 1.4.2, 1.1 (both so f November (both a as of November 2002) 2002) (Additional (Additional information information is is available available at at http://www.genelogic.com/products.cfm. http://www.genelogic.com/products.cfm. )) GX is a commercial system developed at Gene Logic Inc. Continuous maintenance maintenance and and software software upgrades upgrades are are provided. provided. The The next next major major release: release: GX GX 2.0/Genesis 2.0/Genesis 2.0 is is planned planned for for summer summer 2003. 2003. Academic Academic licenses licenses are are available. available. Dr. Dr. Victor Victor M. M. Markowitz Markowitz Gene Gene Logic, Logic, Inc. Inc. 2001 2001 Center Center Street Street Berkeley, Berkeley, California California 94704, USA USA Tel: 10) 981-3141 Tel: +(5 +(510) 981-3141 URL: URL: http://www.genelogic.com/products.cfm http'//www.genelogic.com/products.cfm

Name Name and and Version Version of of System System

Status Status of of Development Development and and Maintenance Maintenance

Contact Contact Person Person

428
Chapter 1 11 Chapter 1
Name Name and and Version Version of of System System DiscoveryLink DiscoveryLink

S\J'c:t,>", I System Information nformation

DiscoveryLink DiscoveryLink is is an an IBM IBM services services offering offering based based on on DB2 V7.2 and DB2 UDB UDB V7.2 and higher higher version version numbers. numbers. (Additional (Additional information information is is available available at at http://www.ibm.com/discoverylink. http-//www.ibm.com/discoverylink. )) DB2 DB2 UDB UDB is is supported supported via via IBM's IBM's normal normal customer customer support support channels. Additionally, channels. Additionally, customers customers may may contract contract for for services services to figure, tune, maintain the to install, install, set set up, up, con configure, tune, and/or and/or maintain the system, system, as as well well as as to to write write new new wrappers. wrappers. These These services services are are optional. optional. DiscoveryLink DiscoveryLink is is available available through through IBM's IBM's scholars scholars program, program, which which offers offers free free licenses licenses for for qualifying qualifying academic academic purposes. purposes. See See http://www-3.ibm.com/sofware.info/university/ http://www-3.ibm.com/sofware.info/university/ for for more more information. information. ls@us.ibm.com, Is@us.ibm.corn, or or visit visit the the Web Web site site at at http://www.ibm.com/discoverylink http-//www.ibm.com/discoverylink

Status Development and Status of of Development and Maintenance Maintenance

Contact Contact

Chapter 12 Chapter 12 Name Name and and Version Version of of System System

KIND KIND KIND Mediator Mediator (Knowledge-Based KIND (Knowledge-Based Integration Integration of of Neuroscience .0 1 Neuroscience Data) Data) version version 1 1.01 (Additional s available t http://www.nbirn.net.) http://www.nbirn.net.) (Additional information information iis available a at KIND KIND is is under under development development at at the the University University of of California, California, San Diego, for San Diego, for the the Biomedical Biomedical Informatics Informatics Research Research Network Network (BIRN), (BIRN), an an initiative initiative of of the the National National Center Center for for Research Resources Resources (NCRR), component of Research (NCRR), a a component of the the National National Institutes Institutes of of Health Health (NIH). (NIH). Earlier Earlier prototypes prototypes (KIND (KIND versions versions O.x) 0.X) have have been been demonstrated demonstrated at at various various conferences, conferences, including including the the Human Human Brain Brain Project 2000 and 1. Project meetings meetings 2000 and 200 2001. A A demonstration demonstration is is available available at at http://www.npaci.edulD/CE/Neuro/ http-//www.npaci.edu/DICE/Neuro/ but but is is no no longer longer actively actively maintained. maintained. Currently Currently the the system system is is completely completely redesigned redesigned and and maintained. maintained. The The system system will will be be public public but but access access is is currently currently limited limited to to the the BIRN BIRN research research group. group. In In the the future, future, a a license license may may be be available. available.

Status Status of of Development Development and and Maintenance Maintenance

..... \/<:T<. rTl

nformation System IInformation

429 429
KIN D KIND

Chapter 12
Contact People

Dr. Bertram Ludiischer Ludfischer San Diego Supercomputer Center University of California, San Diego 9500 Gilman Drive, MC 0505 La Jolla, California 92093-0505, 92093-0505, USA Tel: +1 (858) (858) 822-0864 822-0864 Tel: FAX: 113 FAX: +1 (858) (858) 534-5 534-5113 Email: ludaesch@sdsc.edu Dr. Amarnath Gupta San Diego Supercomputer Center University of California, San Diego 9500 Gilman Drive La La Jolla, Jolla, California California 92093, USA USA Tel: Tel: +1 (858) (858) 822-0994 FAX: +1 (858) 113 (858) 534-5 534-5113 Email: gupta@sdsc.edu gupta@sdsc.edu

This Page Intentionally Left Blank

~. . . . . . . . . . . . . . . .

Index Index

- A A
AADAM format, 8 1 , 292 format, 2 281,292 Abstract Syntax Notation Notation One (ASN1), 407 407 abstraction, process, 358-359 358-359 access, access, restriction, in Sequence Retrieval System, 128 accession number, 44 44 accuracy, accuracy, 25-27 25-27 AADM. See Affymetrix Analysis Data Model Affymetrix Analysis Data Model, 407 407 Affymetrix GeneChip microarray, 280 GeneExpress system and, and, 282 282 aggregation, 407 407 algebra, relational, 42-43, 42-43, 420 420 algorithm cell averaging, 280 280 gene expression data, 286-287 286-287 AllGenes project, 53-54 53-54 AllGenes query, 57, 58 ampersand, ampersand, 120 analysis, 404 404 complexity of, of, 6-7 6-7 analysis package, Kleisli query system and, 165 analysis program, program, sensitivity sensitivity of, 26 analysis software, 1 9 19 analysis tool in Sequence Sequence Retrieval System, 137-139 137-139 annotation, annotation, gene as integration challenge, 289-290 289-290 standardization involving, involving, 282 282 annotation annotation data data mapping, gene, 295-296 295-296 annotation annotation data space, gene, 279 279 annotation annotation pipeline, genome, 26 anomaly, update, update, 40 ANSISPARC ANSI-SPARC three-schema architecture, 254-257 254-257 API. See See application programming interface application programming interface (API), 3 8 1 , 407 381,407 application semantics, 19

architecture DiscoveryLink, 309-312 309-312 federated. See See federation grid, 91-92 91-92 K2, 23 1-232 231-232 of KIND model-based mediator, 3 61 361 mediator, 256-261 256-261 of Sequence Sequence Retrieval Retrieval System, 1 11 111 three-schema, 254-257 254-257 array different different versions of, 285-286 285-286 ASN I . See ASN1. See Abstract Syntax Notation One Atlas, SMART, 362-364 362-364 automated automated server server maintenance maintenance in Sequence Retrieval System, 141-143 141-143 automatic summary table. table, 407 407 autonomous 8 autonomous data data source, 1 18 autonomy o f data bases, 407 of databases, 407 probe, probe, 280

problem and scope of, 2-4 2-4

system development, 7-10 7-10


biological biological data, data, nature nature of, of, 15-17 15-17 biological biological data data integration, 7-10 7-10 biological biological database, Kleisli Kleisli query query system and, 165-166 165-166 biological 1 6-217 biological ontology, 2 216-217 biological biological resource, 397-405 397-405 query query processing and, 92-93 92-93 biological biological sample data space, 278-279 278-279 biological biological tool, legacy, legacy, 79-80 79-80 biology fusion with information science, 2-3 2-3 BLAST. BLAST. See See Basic Basic Local Alignment Search Tool blastn, 408 BLOB. BLOB. See See Binary Binary Large Object Boolean Boolean circuit, circuit, 408 Boolean query, 24 BorcomUpOnce BottomUpOnce strategy, 172 box box plot, 408 browsing definition of, 408 design of, of, 89-90 89-90 example of, 50-52 50-52 querying vs., vs., 46-48 46-48 scientific objects, objects, 100-101 semantic, in model-based mediation, 344 strengths and weaknesses of, 61-62 61-62 bulk data type, type, 408 blastp, 408 systems, 421

bag, 408 408 Basic Local Alignment Search Tool (BLAST), 37-1 38, 408 (BLAST), 25, 1 137-138, DiscoveryLink and, 11, 3 1 3-316 and, 3 311, 313-316 FASTA FASTAand, and, 146 functionality of, of, 383 integration of, of, 45-46 45-46 batch queue, 139 benchmarks in performance evaluation, 374-375 374-375 bi-valued semantics, 90 Binary Large Large Object Object (BLOB), (BLOB), 408 bindjoin, 408 bioinformatics bioinformatics biological data integration, 4-7 4-7 definition of, 3 design of system, 1 . See system, 75-10 75-101. See also also design of biological information system future of, 394-396 394-396 querying vs. vs. browsing, 47

c C
3 1 9-322 319-322

calcium channel protein, example using, Call-Level Interface (CLI), 409 canned query, query, 139 capability, capability, source, 93 capturing, relational schema, 125-126 125-126 capturing process knowledge, 340-341 340-341 CDATA, 408 cDNA, 409 CDS, 409

432
cell averaging averaging algorithm, algorithm, 280 280 cell Cell-Centered Database, Database, 345-347, 345-347, Cell-Centered 362-364 362-364 CGI, 409 409 CGI, of information information integration, integration, challenges of 1 1-31 11-31 data integration, 21-24 21-24 data integration, meta-data specification, specification, 24-25 24-25 meta-data ontology, 27-30 27-30 ontology, provenance and and accuracy, 25-27 25-27 provenance Web presentations, presentations, 30-31 30-31 Web Character Large Object Object (CLOB), 409 409 Character Large CLI. See Call-Level Interface Interface See Call-Level CLI. CLOB. See See Character Character Large Large Object Object CLOB. CLUSTAL, 137-138 137-138 CLUSTAL, clustering technique, technique, 13 clustering CM. See Conceptual Model Model CM. See Conceptual CNS tissue, tissue, 409 409 CNS co-clustered fragment, fragment, 409 409 co-clustered code 1 1 3-1 1 5 Icarus, 113-115 Perl, 167, 168 Perl, code generator, generator, 260 260 Collection Programming Language (CPL) definition 409 definition of, of, 409 DiscoveryLink and, and, 308 228 K2 system system and, and, 228 PIFDM mediator 267 P/FDM mediator and, and, 267 query processor and, and, 205 205 query processor combining and new data, 68 combining old and new data, Common Common Object Object Request Request Broker 91, Architecture (CORBA), (CORBA), 22, 22, 91, 141 140, 141 definition 10 definition of, of, 4 410 TAMBIS 15 TAMBIS and, and, 214-2 214-215 comparative comparative genomics, 409 compensation in query optimization, optimization, 3 1 7-318 317-318 compilation of domain domain maps, 354-355 354-355 compiler compiler condition, condition, 260 execution execution plan, in KIND KIND model-based mediator, mediator, 362 complex complex DTDs, DTDs, 121 121 complex multiple-world scenario, scenario, 336-337 336-337 complex complex objects objects in in Sequence Sequence Retrieval Retrieval System, System, 134 134 complex complex value value data, data, 233, 233, 409 409 composite composite structure, structure, links to to create, 136 composition, composition, view, view, 68 68 comprehension comprehension syntax-based syntax-based language, 1 51 151 Comprehensive 97 Comprehensive Data Data Center, Center, 3 397 computational 9 computational analysis analysis tool, tool, 1 19 concept concept definition definition of, of, 190 190 parameterized, parameterized, 356-357 356-357 recursive, recursive, 356 356 restricting restricting of, of, 200-201 200-201 role role as, as, 353 353 in in system system design, design, 85-86 85-86 concept concept description, description, query query as, as, 197-202 197-202 concept integration, integration, 4-5 4-5 concept concept overloading, overloading, 5 concept Conceptual Model Model (CM), (CMl, 410 410 Conceptual conceptual schema, 44, 44, 255 255 conceptual condition compiler, 260 260 condition consortium, Gene Ontology, Ontology, 29 29 consortium, construction, of of links, 131-132 131-132 construction, context-sensitive optimizations, optimizations, 171-174 171-174 context-sensitive contextual references, in model-based model-based contextual mediation, 349 349 mediation, contextualization, in model-based model-based contextualization, mediation, 344, 344, 350-351 350-351 mediation, controlled controlled vocabulary, vocabulary, 40 40 CORBA. Object Request Request See Common Common Object CORBA. See Architecture Broker Architecture cost model in performance performance evaluation, evaluation, cost 372-374 372-374 cost of of query processing, 96-97 cost processing, 96-97 DiscoveryLink DiscoveryLink and, and, 318, 3 1 8, 322-326 322-326 coverage coverage of information information sources, sources, 92 CPL. See See Collection Collection Programming Language CPUPeri, 76-179 CPL2Perl, 1 176-179 CPU, 410 410 CPU, creating wrapper n DiscoveryLink wrapper i in registration, 13 registration, 3 313 criterion, criterion, 193 database, 26 curated database, curated source, simple, 37-38 curated gene data data source, 37-38 curation, 10 curation, data, data, definition definition of, 4 410

Index I ndex

Daplex query language capabilities of, 264-265 264-265 example using, 26 1 , 262, 264 261,262, functional functional data model and, 252, 253 data model, in K2 information information integration integration system, 232-235 232-235 multimedia, 99-100 99-100 standardization involving, 282 data cleansing, 4 10 410 data curation, curation, 410 data dictionary, 22 data distribution, distribution, in system evaluation, 386-387 386-387 data-driven integration, 91-92 91-92 data driver decoupled, 242 integrated, 241 data data exchange for integration integration of of third-party gene gene expression data, data, 291-293 291-293 standards for, for, 282 data federation, use use case, case, 68-69 68-69 data format, updating updating of, of, 6 6 data 10 data fusion, 82, 82, 4 410 data integration. See See Integration, data data data data loading, 296-297 296-297 data data management, management, 35-69 basics, 36-39 36-39 gene expression, See also also expression, 277-299. 277-299. See gene gene expression expression data management

relational relational model, model, 41-44 41-44 retrieving retrieving genes, genes, 38-39 38-39 semi-structured semi-structured text text files, 40--41 40 1 simple simple curated curated gene gene data data source, source, 37-38 37-38 spreadsheets, spreadsheets, 39-40 390 traditional, traditional, 41-44 4 1 4 transforming of of database database structure, structure, 44 44 data data mapping, mapping, semantic, semantic, 293-296 293-296 data data mining, 87-89, 87-89, 411 411 data data model, model, 411 411 in K2 i nK 2 information information integration integration system, system, 232-235 232-235 non-relational, 64 non-relational, 64 relational, 41-44 relational, 414 strengths and weaknesses of, 64 strengths and weaknesses of, 64 data organization, data organization, traditional, traditional, 81 data data provenance, provenance, 25-27 25-27 data data provider in model-based model-based mediation, mediation, 343 data replication replication approach, approach, 250-251 250-251 data data repository, data repository, 4 data-shipping, 4 411 data-shipping, 11 data source data characteristics of, 17-19 17-19 of, 147, 147, 4 411 definition of, 11 DiscoveryLink registration and, and, 3 314 14 expression data data management management and, and, gene expression 290 290 in K2 information integration integration system, 240-242 240-242 Kleisli query query system and, and, 165-167 165-167 mediator and, 349-351 mediator and, 349-351 P/FDM mediator and, and, 265-266 PIFDM mediator 265-266 simple curated 37-38 curated gene, 37-38 65-67 Web, 65-67 gene expression, expression, 278-281 278-281 data space, gene 278-279 biological sample, 278-279 gene annotation, annotation, 279 expression measurement, measurement, gene expression 279-281 279-28 1 transformation, 5 data transformation, data eype, 11 type, 4 411 See Warehousing warehouse. See data warehouse. databank data bank definition of, 1 112 definition 12 relational, viewing entry from, 128-129 128-129 XML, loading from, 135 Databank, in Sequence Retrieval System, Databank, 109-1 16 109-116 database database autonomy autonomy of, of, 407 query system system and, and, biologic, Kleisli query 165-166 cell-centered, 362-364 362-364 cell-centered, definition of, of, 36, 36, 1 112, 410 definition 12, 4 10 Expressed Sequence Sequence Tag, Tag, 3 319 Expressed 19 flat files files vs., vs., 78 78 flat heterogeneous, definition definition of, of, 4 414 heterogeneous, 14 link-driven 1 5-416 link-driven federation federation of, of, 4 415-416 number of, of, 4 4 patent, Kleisli Kleisli query query system system and, and, 164 164 patent, query performance performance (0, to, 128 128 relational, query in DiscoveryLink, 305 305 virtual, in

IIndex n dex

433
database database management, traditional, traditional, 80-81 database database management management system system (DBMS), 36 definition of, 4 11 411 relational, 21-22 21-22 database structure, structure, transforming, transforming, 44 database system, 405 Datalog, 4 11 411 DCOM. DCOM. See See Microsoft Microsoft Distributed DBMS. See See database database management system 1 3-314 DDL statement, 3 313-314 declarative access, procedural access vs., vs., 49 declarative query query language, 63 decomposition, query, query, 68 decoupled data driver, 242 definition integrated integrated view, view, 345 intensional, 348-349 348-349 delivery pattern in query query processing, 93 Department of Energy Energy unanswerable query challenge, 226, 226, 228, 229, 375-376 375-376 deployment issues in GeneExpress system, system, 283-284 283-284 description, concept, query 97-202 query as, 1 197-202 description logic ontology, 194 description logics, 4 11 411 design o f biological information system, of 75-101 75-101 browsing, 89-90 89-90 concepts concepts and ontologies, ontologies, 85-86 85-86 data fusion, 82 engineering vs. vs. experimental science, 76-77 76-77 Component Component Object Model Model DB2 DataJoiner, 306 materialized vs. vs. non-materialized approach and, 386 query processing in, 1 6-326 in, 3 316-326 determining costs, 322-326 322-326 1 9-322 example of, 3 319-322 optimization and, 3 1 7-319 317-319 system information for, for, 428 428 distributed data, 45 distributed database systems, 4 11 411 distributed integration approach, 22 distributed object technology, 91 distribution, data, in system evaluation, 386-387 386-387 diversiry, diversity, 15-16, 15-16, 19-20 19-20 DNA, DNA, definition of, 412 DNA microarray, 412 DNA sequence, resources for, 397-398 397-398 DNA sequencing, 412 DNA domain, constantly changing, 80 80 domain domain map, 335 335 definition of, 412 412 for model-based mediator system, 352-357 352-357 compilation of, of, 354-355 354-355 definition of, of, 352-353 352-353 deriving deriving role hierarchy, 355-356 355-356 as logic logic rules, 354-355 354-355 parameterized parameterized role and concepts, 356-357 356-357 recursive concepts, 356 reified roles roles as concepts, 353 remarks, 355 role hierarchy, 354 domain domain semantics, 337 domain-specific domain-specific benchmark, 374 driver decoupled data, 242 integrated data, 241 DTD 21 DTD file, file, complex, 1 121 DTDGenerator, 120-121 120-121 ER model, 412 412 error propagation of, 26 in spreadsheet, spreadsheet, 40 EST sequence, sequence, definition of, 412 European Bioinformatics Institute (EBI), (EBI),

91
evaluation, query, 95, 96 evaluation matrix, 372 372 evaluation of data management system, system, 9-10, 9-10, 371-390 371-390 implementation criteria for, 376-3 81 376-381 efficiency, efficiency, 377-378 377-378 extensibility, extensibility, 378-379 378-379 functionality, 379 scalability, 379-380 379-380 understandability, 380 usability, 81 usability, 3 381 performance model for, for, 371-376 371-376 benchmarks, 374-375 374-375 cost model, 372-374 372-374 evaluation matrix, 372 372 tradeoffs in, 385-389 385-389 data data distribution and heterogeneity, 386-387 386-387 integrating applications, 389 materialized vs. vs. non-materialized approach, 3 85-386 385-386 semi-structured vs. vs. fully fully structured data, 387-388 387-388 user criteria for, for, 382-385 382-385 efficiency, efficiency, 382 extensibility, extensibility, 382-383 382-383 functionality, 383 scalability, 383 understandability, 384 usability, 84-385 usability, 3 384-385 evolution biology, 1 2 12 Excel, 3 3 9-40 9 -4 0 exchange format format Kleisli, Kleisli, 156, 157 self-describing, self-describing, 156 standards for, for, 282 for for third-party gene gene expression data integration, 291-293 291-293 execution plan compiler in KIND model-based mediator, 362 experimental science, engineering vs., vs., 76-77 76-77

fully structured vs. vs. semi-structured, 82-84 82-84 generic system vs. vs. query-driven, 77-78 77-78 legacy data and tools, 78-80 78-80 queries, 86-98. 86-98. See See also also Query scientific object identity, 84-85 identity, 84-85 searching, 87-89 87-89 tool-driven vs. vs. data-driven, 91-92 91-92 traditional database database management, 80-81 80-81 visualization, 98-101 development process, 9 dictionary data, 22 in K2 system, system, 233 difference difference operation, operation, 42 discovery discovery process, life sciences, sciences, 12-14 12-14 discoveryHub, efficiency of, of, 377 377 DiscoveryLink, 24, 24, 55-58, 55-58, 303-331 303-331 approach, 6 approach, 306-31 306-316 architecture, 309-312 309-312 registration, 3 1 3-316 313-316 ease ease of use, scalability, scalability, and and performance performance of, of, 327-329 327-329 efficiency efficiency of, 377 377 functionality functionality of, 383 Kleisli query system and, 181-182 181-182

EBI. EBI. See See European Bioinformatics Institute EcoCyc, EcoCyc, 216 efficiency as implementation criterion, 377-378 377-378 as user criterion, 382 elaboration, elaboration, process, 358-359 358-359 elaboration identifier, 358 EMBOSS, 38 EMBOSS, 1 138 Empty syntax, XML and, 1 1 8- 1 1 9 118-119 end user in model-based mediation, 344 engineering experimental science vs., vs., 76-77 7 6 -7 7 knowledge, 353 entity, 1 9-120 entity, general, 1 119-120 Entrez interface, 88-89 88-89 entry entry ID, hub hub table as, 126 environment, for for life science discovery, 14-15 14-15 ENZYME, ENZYME, 403 enzyme, enzyme, definition definition of, 412

explorer window in TAMBlS, TAMBIS, 195-197 195-197 exporter in PIFDM P/FDM mediator, 251 exporting from SRS to XML, XML, 136-137 136-137 Expressed Sequence Tag database, 3 19 319 expression shorthand, 1 1 9-120 119-120 table, table, 421 expression profile, 13 extensibility as implementation criterion, 378-379 378-379 as user criterion, 382-383 382-383 extensible markup markup language (XML), 43-44 43-44 for biological biological Web Web services, 30-31 30-31 browsing and, 90

434
extensible markup markup language (XML), fragment, gene, 289 definition of, of, 413 frame-based system, 217 frame of reference, terminological, 347 FTP, 413 fully structured data, semi-structured data vs., vs., 387-388 387-388 82-84 functional data model, 252-254 252-254 functional genomics, 4 13 413 functional programming language, 4 13 413 functionality as implementation criterion, 379 as user criterion, criterion, 383 fuser, result, 261 fusion data, 82 definition of, 4 10 410 vertical loop, 170 future future of bioinformatics, 394-396 394-396

IIndex ndex

gene expression measurement data space, 279-281 279-281 gene fragment, definition of, 4 13 413 Gene Logic, DiscoveryLink and, 3 08 308 Gene Nomenclature Nomenclature Committee (HGNC), (HGNC), 28, 402 Gene Oncology (GO) Consortium, Consortium, 29, 217 217 description of, 402 gene 13 gene product, product, 4 413 GeneCards, GeneCards, search in, 66-67 66-67 GeneChip, 4 13 413 GeneChip microarray, 2 80 280 GeneExpress, system information information for, 427 GeneExpress GeneExpress Data Warehouse Warehouse (GXDW), 283-284 283-284 gene annotation annotation component of, 290 GeneExpress system, 282-284 282-284 algorithms in, 286-287 286-287 components components of, of, 283 deployment and update update issues issues in, 283-284 283-284 integrating integrating third-parry third-party expression data

(cont.) (cont.)
categories of, 83 database database integration into Sequence Retrieval System, 1 1 6-124 116-124 challenge of, of, 122-124 122-124 procedure procedure for, 120-121 support support features, features, 121-122 121-122 uniqueness of, 1 8-120 of, 1 118-120 definition of, of, 423 423 exporting objects from SRS, 7 SRS, 136-13 136-137 loading from, from, 135 semi-structured vs. vs. fully structured data and, and, 387-388 387-388 Sequence Retrieval System and, 1 1 0, 110, 1 1 6-124 116-124 TAMBIS and, and, 215 wrapper, 12 wrapper, 3 312 external external schema, 254 navigational capabilities of, 90

fully structured structured information sytem, system, fully

~'~

F F

FASTA, 3 7-138, 146 FASTA, 1 137-138, Feature table of GenBank, 159 federation, federation, 22 definition of, 412 412 DiscoveryLink based on, 306 306 example of, 54-58 54-58 1 5-416 link-driven, 4 415-416 PIFDM P/FDM mediator and, and, 249-272 249-272 alternative architectures for integration, integration, 250-252 250-252 analysis, analysis, 266-272 266-272 data data sources, sources, 265-266 265-266 example of, 261-264 261-264 functional functional data data model, model, 252-254 252-254 mediator mediator architecture, architecture, 257-261 257-261 query capabilities, capabilities, 264-265 264-265 schemas in federation, federation, 254-257 254-257 Sequence Retrieval Retrieval System System and, 143 use use case, case, 68-69 68-69 warehousing vs., 49 warehousing vs., fields, SRS, SRS, 130 file hypertext markup markup language, language, 147-148 147-148 probe intensity, intensity, 281 semi-structured text, 40-41 40-41 filler, filler, 193 filter, 208 208 First Order 3 Order logic, logic, 41 413 flat file, file, database database vs., vs., 78 flat bank integration, 12-1 1 6 fiat file file data databank integration, 1 112-116 foreign key, 4 13 413 format data data semi-structured text, text, 40-41 40-41 updating updating of, 6 exchange Kleisli, 156, 157 self-describing, 156 standards standards for, 282 282 for third-party third-party gene expression data data integration, 291-293 291-293 self-describing exchange, 156

~ ,

G G

291-298 in, 291-298


sample data in, 288 general entity, 1 1 9-120 119-120 generator code, 260 260 logic plan, 360-361 360-361 generic approach, approach, 49-50 49-50 query-driven approach vs., vs., 77-78 77-78 strengths and weaknesses of, 63 generic benchmark, benchmark, 374 generic query optimization, optimization, 267-268 267-268 genetics, 399 Genetics Computer Group (GCG), 307-308 307-308 genome definition of, 4 14 414 resources of, 3 98 398 genome annotation 6 annotation pipeline, 2 26 Genome DataBase (GDB) Kleisli query system and, and, 150-151 materialized vs. vs. non-materialized approach approach and, 385-386 385-386 object object identity and, 84-85 84-85 genome project, 4 14 414 genomic data source a s integration as challenge, 289-290 289-290 Genomic Unified Schema, 385-386 385-386 genomics, 4 14 414 functional, 4 13 413 research needs of, 12-13 12-13 GenPept report, 153-154 153-154 creating warehouse of, 164-165 164-165 Glimpse search engine, 88 global-as-view technique, 216 definition of, 4 14 414 i n model-based mediation, 349, 350 in global global integration schema, 266 global global schema, schema, 45-46, 45-46, 414 Globus Pallidus Pallidus External, External, 351 GO data bank in Sequence databank Sequence Retrieval System, 126-127 126-127

Garlic project, project, 306-307 306-307

GOB. See Genome DataBase GDB. See GenAtlas, GenAtlas, querying in, 85 GenBank accession number, 44 feature feature table of, 159 identifiers n, 100-101 identifiers iin, 100-101

GCG. See See Genetics Genetics Computer Group Group

Kleisli query system and, and, 150 approach and, 385-386 385-386 search in, in, 66-67 66-67 gene, definition of, 4 13 413 gene annotation annotation

materialized vs. vs. non-materialized

as integration integration challenge, 289-290 289-290 standardization involving, 282 gene annotation annotation data mapping, 295-296 295-296 gene annotation annotation data space, 279 gene chip microarray technology, 414 gene data source, simple curated, curated, 37-38 37-38 gene discovery, 19 discovery, 3 319 gene expression, 399, 4 13 413 Gene Expression Array (GXA), 283-284 283-284 gene expression data management, 277-299 277-299 data spaces, spaces, 278-281 biological sample, sample, 278-279 278-279 gene gene annotation, annotation, 279 279 gene expression measurement, 279-2 81 279-281 GeneExpress system for, 282-284 282-284 integration integration in, 285-290 285-290 algorithms and normalization and,

286-287 array versions and, 285-286 285-286 gene annotation annotation and, 289-290 289-290 sample data data and, and, 288 of third-party third-party gene expression expression data, data, 291-298 291-298 variability and, and, 287-288 287-288

IIndex ndex

435
GRAIL, 202 GRAIL query, query, 205-206 205-206 query planner, 11 planner, 208-2 208-211 graphical interface, interface, 179 graphical user interface, interface, for PIFDM, P/FDM, 269, 269, 271 Grid, Grid, 414 grid grid architecture, architecture, 91-92 91-92 GUI, 4 14 414 GXA. See See Gene Expression Array GXDW. GXDW. See See GeneExpress GeneExpress Data Warehouse Warehouse
~ -

ImMunoGeneTics information information system,

hybrid approach to, to, 64-65 64-65 issues of, 4-7 4-7 procedural code, 63

H H

hard-coding, hard-coding, 49-50 49-50 legacy tools including, including, 80 strengths strengths and weaknesses weaknesses of, 63 hard wired access to data hardwired data sources, 304 304 hardwiring hardwiring of mapping mapping in GeneExpress system, 295 hash hash table, table, 321 heterogeneity in semantic data integration, integration, 58-59 58-59 syntactic and semantic, 212 heterogeneous 8, 1 9-20 heterogeneous data data format, format, 1 18, 19-20 heterogeneous heterogeneous database, database, definition of, 4 14 414 HGNC. HGNC. See See Gene Nomenclature Nomenclature Committee hierarchy, role, 355-356 355-356 hierarchy, hierarchy, in GeneExpress GeneExpress system, 293 host host variable, variable, 414 414 HTML. See hypertext hypertext markup markup language HTML. See file HTIP, 14 HTTP, 4 414 h u b table, hub table, 126-127 126-127 HUGO. See Human Human Genome HUGO. See Organization HUGO HUGO name, withdrawn withdrawn or approved, approved, 84-85 84-85 human human computer computer interaction, interaction, 375 Human 15 Human Genome Initiative, Initiative, 4 415 Human Human Genome Genome Project, 415 Human Human Genome Organization (HUGO), (HUGO), 28, 402 28,402 hybrid integration approach, approach, 64-65 64-65 hybridization, 15 hybridization, 4 415 hypertext hypertext markup markup language file file (HTML), 147-148 147-148 hypothesis as design step, 76

~'~

Icarus code, 1 1 3-1 1 4 113-114 ICode, 257-258, 261, 262-263 257-258,261,262-263 ICode ICode rewriter, rewriter, 260 ID, entry, entry, hub hub table table as, 126 identifier, identifier, elaboration, elaboration, 358 identity identity pre-defined, 8 1 81 scientific scientific object, 84-85 84-85 IBM DiscoveryLink DiscoveryLink middleware system, 24

403 implementation, experiment as, 76 implementation implementation criteria system evaluation, 376-381 376-381 efficiency, 377-378 377-378 extensibility, extensibility, 378-379 378-379 functionality, functionality, 379 scalability, scalability, 379-380 379-380 understandability, 380 usability, usability, 381 in 61, in silica silico discovery kit kit (ISDK), 160, 1 161, 415 indexing, SRS SRS support support for, 121-122 121-122 indexing tool 38 tool output, output, 1 138 industrial merger, 303 information information integration integration in bioinformatics, 213-215 213-215 biologic ontologies, 2 1 6-2 1 7 216-217 21-24 data challenges, 21-24 data provenance and accuracy, 25-27 25-27 knowledge based, 2 1 5-2 1 6 215-216 meta-data specification, 24-25 24-25 ontology, 27-30 27-30 Web presentations, presentations, 30-31 information information integration system, K2, 225-247. 225-247. See See also also K2 information integration system information science, science, fusion with biology, biology, 2-3 2-3 Informax, Informax, 307 307 Infosleuth, Infosleuth, 266 266 initial process semantics, 357 input, input, processing of, of, 138 input/output input/output format, 19 integrated data data driver, 241 Integrated Taxonomic Information System, System, 402 integrated view definition, 345 integrated view of biology, 12 integration integration schema, 421 in system evaluation, 389 view, 423 423 integration, integration, data, 4-10, 4-10, 60-69 60-69 browsing vs. quetying, querying, 46-48, 46-48, 61-62 61-62 browsing vs. as challenges, 21-24 21-24 challenges 1-31 challenges of, 1 11-31 concept, concept, 4-5 4-5 declarative query language, 63 definition, 410 development process, 9 evaluation evaluation of, 9-10 of flat file banks with SRS, file data databanks 1 1 2-1 16 112-116 of gene expression data, 285-290 285-290 algorithms and normalization normalization and, 286-287 286-287 array array versions and, 285-286 sample data and, 288 gene annotation annotation and, 289-290 289-290 variability and, and, 287-288 287-288 generic approach approach to, 63 hard-coded hard-coded approach to, 63

vs. non-relational, 64 relational vs.


semantic, semantic, 58-60 58-60 semantic semantic query planning, 65-67 65-67 specifications specifications for, 7-8 7-8 syntactic vs. semantic, 48--49 48-49 syntactic vs. technical technical approach, approach, 8-9 8-9 of third-party gene expression data, 291-298 291-298 data exchange formats for, 291-293 291-293 data data loading issues in, 296-297 296-297 semantic semantic data mapping issues in, 293-296 293-296

structural data data transformation transformation issues structural


in, 293 update issues in, 297-298 297-298 tool-driven vs. data-driven, data-driven, 91-92 tool-driven vs. use case for, 45--46 45-46 Web data sources, 66 integration integration schema, global, 266 intensional definitions, 348-349 348-349 intensity file, file, probe, probe, 281 interaction, human computer, 375 375 interface application programming, 407 407 Entrez, 88-89 88-89 graphical, 179 in K2 information information integration system, 243-244 243-244 keyword-search querying, 24 Kleisli query system system and, 166 for PIFDM, P/FDM, 268-271 268-271 to Sequence Sequence Retrieval System, 139-141 TAMBIS, 1 95-205 195-205 constructing constructing queries, 197-202 197-202

exploring ontology, 195-197 195-197


query processor, 205-2 12 205-212 reasoning in query formulation, 202-205 202-205 intermediary, intermediary, 8 internal language, of K2 information integration system, 239-240 239-240 internal internal schema, schema, 254, 256 International Classification of Diseases, Ninth Revision, 402 International International Organization Organization for Standardization, 415 International Union of of Biochemistry and Molecular Molecular Biology Biology (IUBMB), 28, 403 International Union Union of Pure and Applied Chemistry (IUPAC), 28,403 28, 403

is is a hierarchy, 192
ISA ISA relationship, 415 ISDK. See in in silico silico discovery kitlSO. See See ISDK. See International Organization Organization for Standardization Standardization iteration, 207 IUBMB. IUBMB. See See International Union of Biochemistry and Molecular Biology Biology

436 436
IUPAC. See See International International Union Union of of Pure Pure IUPAC. and Applied Applied Chemistry Chemistry and J J 268 268
Java Java DataBase DataBase Connectivity Connectivity (JDBC), (JDBC), 229, 229, gtaphical, graphical, 179 179 program 75-179 program language, language, 1175-179 warehousing warehousing capability capability of, of, 163-165 163-165 knowledge, knowledge, process, process, 340-341 340-341 knowledge knowledge base, base, 90 90 knowledge knowledge based based information information integration, 1 5-21 6 integration, TAMBIS, TAMBIS, 2 215-216 knowledge knowledge engineering, engineering, 353 353 knowledge knowledge representation representation in in model-based model-based mediator mediator system system domain domain maps maps fot, for, 352-357 352-357 compilation compilation of, of, 354-355 354-355 definition definition of, of, 352-353 352-353 deriving deriving role role hierarchy, hierarchy, 355-356 355-356

IIndex ndex

linking, linking, databank, databank, to to Sequence Sequence Retrieval Retrieval System, System, 130-133 130-133 LION, LION, 307 307 LISP, 16 LISP,4 416 list, 16 list, definition definition of, of, 4 416 list list comprehension, comprehension, 257 257 literature literature reference, reference, 401 401 loader, loader, object, object, in in Sequence Sequence Retrieval Retrieval System, System, 133-137 133-137 loading loading data, data, 296-297 296-297 from bank, 135 from XML XML data databank, 135 local-as-view local-as-view technique, technique, 216 216 definition 16 definition of, of, 4 416 in in model-based model-based mediation, mediation, 350-351 350-351 local local ontology, ontology, in in model-based model-based mediation, mediation, 344 344 local local schema, schema, 45-46 45-46 LocusLink, LocusLink, 403 403 logic logic First 13 First Order, Order, 4 413 temporal, 0 temporal, 9 90 logic logic plan plan generator, generator, 360-361 360-361 logic logic rule rule domain domain map map as, 354 354 process process map map as, as, 359-360 359-360 logics, 11 logics, description, description, 4 411 LOGSPACE, 4 16 416 long-term n nerve long-term potentiation potentiation iin nerve cell, cell, 340 loop loop design, design, 76 76 loosely loosely coupled coupled system, system, 250 250

Java-based visual visual interface, interface, for for PIFDM, P/FDM, Java-based

415 415 Java RMI, RMI, 241-242 241-242 Java JDBC. See See Java Java DataBase DataBase Connectivity Connectivity ]DBC. join, 42 42 join, joining data data in in DiscoveryLink DiscoveryLink query query joining processing, 3 317-318 processing, 1 7-3 1 8 joins, spatial, spatial, 337 337 joins,
Journal of of Nucleic Nucleic Acid Acid Research, Research, 17 Journal

as logic logic rules, rules, 354-355 354-355 as


parameterized parameterized role role and and concepts, concepts, 356-357 356-357 recursive recursive concepts, concepts, 356 356 reified reified roles roles as as concepts, concepts, 353 353 remarks, 355 355 role role hierarchy, hierarchy, 354 354 process maps maps for, for, 357-360 357-360 domain maps and, and, 358 358

information integration integration system, K2 information 225-247 225-247 approach in, 229-232 229-232 approach data model and and languages in, 232-235 232-235 data sources in, 240-242 240-242 data sources 235-239 example of, 235-239 of, 245-246 245-246 impact of,
vs., 228-229 228-229 Kleisli vs.,

process, 357 357 initial process,


as as logic logic rules, rules, 359-360 359-360 process elaboration elaboration and and abstraction, abstraction, 358-359 358-359 known 16 known gene, 4 416 KRAFT, 266 Kyoto Kyoto Encyclopedia of Genes Genes and Genomes (KEGG), 416

239-240 internal language of, 23 9-240

242-243 query optimization in, 242-243 244-245 scalability of, 244-245 for, 426 system information for,
user interfaces in, 243-244 243-244

231-232, 415 K2MDL, 23 1-232, 4 15 KEGG. See Kyoto of Kyoro Encyclopedia o f Genes and Genomes 81 key, primary, 8 1 keyword-search querying interface, 24 24 KIND KIND mediator prototype, 360-362 mediator prototype, 360-362 system information information for, 428-429 428-429 system understandability of, 381, 3 8 1 , 384 384 Kleisli query system, system, 23-24, 23-24, 147-184 147-184 approach of, 151-153 approach of, 151-153 data and representation data model model and representation in, 153-157 153-157 data sources in, data in, 165-167 165-167 DiscoveryLink and, and, 181-182 181-182 efficiency of, of, 377-378 377-378 functionality of, of, 383 38 3 K2 K2 information information integration integration system vs., vs., 228-229 228-229 motivating motivating example example for, for, 149-151 149-151 Object-Protocol Object-Protocol Model Model and, and, 182-183 182-183 optimizations, optimizations, 167-169 167-169 context-sensitive, context-sensitive, 171-174 171-174 monadic, monadic, 169-170 169-170 relational, relational, 174-175 1 74-175 query query capability capability of, of, 158-163 158-163 Sequence Sequence Retrieval Retrieval System System and, and, 179-181 1 79-1 8 1 system system information information for, for, 425 425 understandability understandability of, of, 384 384 user user interfaces, interfaces, 175-179 175-179

~'~'~

L L
~

Laboratory Information Laboratory Information Management Management System (LIMS), 13, 127 definition of, 16 of, 4 416 GeneChip, 281 output, output, 20 20 language Daplex, 253 extensible markup. markup. See See extensible markup language (XML) markup programming, 413 413 functional programming, o f K2 K 2 information information integration system, system, of 232-235, 239-240 232-235,239-240 query 419 definition of, 419 86-87 limitations of, 86-87 129-130 SRS, 129-130 legacy data data and and tools tools legacy biologic, 78-79 78-79 biologic, workflows, 79-80 79-80 workflows, LENS, 86 LENS, library, subentry, 116 116 library, life sciences sciences discovery process, process, 12-14 12-14 life LIMS. See Laboratory Information Information LIMS. See Laboratory Management System System Management link link browsing, 89-90 89-90 browsing, in browsing browsing scientific scientific objects, objects, 100 100 in link-driven federation federation of of databases,416 databases,4 1 6 link-driven link operator operator in i n SRS query query language, language, link 132-133 132-133

M M
Sequence Sequence Retrieval Retrieval System, System, 141-143 141-143

maintenance, maintenance, automated automated server, server, in

management management data, 35-69. See data, 35-69. See also also data data management management multimedia, 99-100 99-100 schema, schema, 67-69 67-69 373 space, 373 time, 372-373 time, 372-373 traditional database, database, 80-81 80-81 traditional map map domain, 335 domain, 335 412 definition of, of, 412 339-342 in neuroscience, 339-342 process, 335 335 process, definition of, of, 419 419 definition simple process, process, 342 342 simple subprocess, 359 359 subprocess, mapped role, role, 208 208 mapped mapping mapping PIFDM mediator mediator and, and, 263 263 P/FDM schema, 68 68 schema, semantic data, data, in in integration integration of of semantic third-part expression expression data, data, third-part 293-296 293-296 markup language, language, extensible. extensible. See See markup extensible markup markup language language extensible MAS. See See microarray microarray suite, suite, GeneChip GeneChip MAS. MAS algorithm, algorithm, 286-287 286-287 MAS

IIndex ndex

437
materialized approach, approach, 385-386 385-386 materialized view, 44, 416 matrix matrix evaluation, 372 GXA, 283-284 283-284 MBM. See See model model-based based mediation measurement data data space, gene expression, 279-28 1 279-281 mediation, mediation, semantic, 364 364 mediator definition of, 4 17 417 sources and, and, 349-351 349-351 model-based model-based mediator system, system, 335-366 335-366 background of, of, 336-337 336-337 CellCentered Cell-Centered Database Database and and SMART Atlas, 362-364 362-364 challenges from from neurosciences, 338-342 338-342 conceptual models and source

51-52 NCBI Entrez, 51-52


NCMIR, NCMIR, 338-339 338-339 nested object in Sequence Retrieval System, 134 Nested Relational Calculus (NRC), 152, 152, 163, 418 163,418 nested relationalized version o f SQL, of 151-153 151-153 nested structure in K2 system, 226 neuroinformatics, 2 neuroinformatics, 1 12 neuroscience, data data integration in, 338-339 338-339 nomenclature, sample data data mapping, 294-295 294-295 nondatabased 175-176 non-databased query, query, 175-176 non-materialized non-materialized approach, approach, 385-386 385-386 non-materialized view, 44, 4 18 418 non-relational data model, 6 4 64 relational data model vs., 0 vs., 5 50

344-349 registration at, 344-349


for for Cell-Centered Database, Database, 345-347 345-347 contextual contextual references, 349 creating creating terminological frame frame of reference, 347 ontological grounding of OM OM 348 semantics of relationships in, 347-348 347-348 domain maps for, 352-357 352-357 compilation of, 354-355 354-355 definition definition of, 352-353 352-353 deriving deriving role role hierarchy, 355-356 355-356 as logic rules, rules, 354-355 354-355 parameterized role role and concepts, 356-357 356-357 recursive concepts, 356 reified reified roles roles as concepts, 353 remarks, 355 role role hierarchy, 354 interplay between mediator and and sources, 349-351 349-351 KIND KIND mediator prototype, prototype, 360-362 360-362 process maps for, 357-360 357-360 domain domain maps maps and, 358 initial initial process, 357 as logic logic rules, rules, 359-360 359-360 process elaboration and abstraction, abstraction, 358-359 358-359 protagonists protagonists in, 343-344 343-344 reason-able reason-able meta-data, meta-data, 365-366 365-366 related related work, work, 364-365 364-365 model based mediation model-based mediation (MBM), 417 module optimizer, optimizer, 260 reordering, 260 monad monad approach, approach, 228 monadic monadic optimizations, 169-170 motif, 192, 204 204 motivating use case, case, 45-46, 45-46, 47 Mouse Genome Database syntactic vs. vs. semantic integration, integration, 48-49 48-49 use case for integration, 45-46 45-46 mRNA, 17 mRNA, 4 417 multi-database 17 multi-database approach, approach, 251-252, 251-252, 4 417 multidisciplinary approach, 1 5 15 multimedia data, 99-100 99-100 multiple sequence alignment, 404 intensional definitions, 348-349 348-349

mediator architecture, architecture, 256-261 256-261 mediator


mediator database database system, 22-24 22-24

mediator system K2, 230-231 230-231 description of, 237-239 237-239 model based, 335-366. 335-366. See See also also model-based, model-based mediator mediator system PIFDM, P/FDM, 249-272. 249-272. See See also also PIFDM P/FDM mediator mediator prototype, prototype, 261-266 261-266 MEDLINE, MEDLINE, 66 MED LINE report, MEDLINE report, 153 merger, industrial, industrial, 303 meta-data, meta-data, 56 Sequence Retrieval System and, 109-110, 11 109-110, 1 111 meta-data specification, 24-25 24-25 meta language (ML), 4 17 417 MGED. MGED. See See Microarray Microarray Gene Expression Database society MIAME. See minimum information information MIAME. See about microarray experiment microarray different different versions of, 285-286 285-286 DNA, 4 1 1-412 411-412 microarray microarray analysis, 404 Microarray Microarray Gene Expression Database Database society 81, 417 society (MGED), 2 281,417 microarray microarray suite algorithm, 286-287 286-287 microarray microarray suite (MAS), GeneChip, 280 microarray microarray technology, gene gene chip, 414 Microsoft Microsoft Distributed Distributed Component Component Object Object Model Model (DCOM), 91 Microsoft Microsoft Visual Basic, 40 middleware, middleware, 417 417 middleware system, DiscoveryLink, 24. see see a/so also DiscoveryLink minimum information information about about a microarray experiment experiment (MIAME), 28 1-282, 4 17 281-282, 417 mining, data, 11 data, 87-89, 87-89, 4 411 mismatch probe, probe, 280 280 ML. ML. See See meta language model conceptual, 10 conceptual, 4 410 cost, 372-374 372-374 data, data, relational, relational, 41-44 41-44 ER, 4 12 412 functional data, data, 252-254 252-254 object-oriented, 18 object-oriented, 4 418 relational, relational, 420 420 sources sources and services, services, 206-208 206-208

(5), (S),

nonsensical question, 201-202 201-202

syntax, XML and, 1 118 normal syntax, 18 expression data and, normalization, gene expression
286-287 286-287 novel 19 novel gene gene discovery, discovery, 3 319 N P (NPTIME), 4 18 NP 418 NPcomplete, 18 NP-complete, 4 418 NRC. See See Nested Nested Relational Calculus

number, accession, accession, 44 number, : ~


object browsing of, 100-10 1 100-101 complex and nested, 134 Sequence Sequence Retrieval System, 140-141 140-141 Object Data Data Management Group Group (ODMG), 23 1-233, 4 1 8 231-233,418 Object Definition Language (ODL), 4 18 418 object identity, identity, scientific, 84-85 84-85 object object loader loader in Sequence Sequence Retrieval System, 133-137 133-137 complex and and nested objects, 134 exporting objects to to XML, 136-137 136-137 links to to create composite structures, 136 support for, 135 Object Management Group (OMG), 22, 28, 419 28,419 object model, 344 object-oriented database, 308 object-oriented interface to Sequence Retrieval System, 140-141 140-141 object-oriented model, 4 18 418 object-oriented programming, 253, 254 253,254 object-oriented object-oriented technology, 22 ObjectProtocol Object-Protocol Model Model (OPM), 24 DiscoveryLink DiscoveryLink and, 308 Kleisli query system system and, 162, 182-183 182-183 system based on, on, 85-86 85-86 TAMBIS and, 2 1 3-214 213-214 Object Object Query Query Language (OQL), 86, 419 definition of, 4 18 418 K 2 system and, 228 K2 ODBTools, ODB-Tools, 365

o O

OASIS, 3 1 31

name, HUG 0, withdrawn HUGO, withdrawn or approved, approved, 84-85 84-85 National National Biological Biological Information Infrastructure, 402

438
ODBC. See See Open Open DataBase DataBase Connectivity Connectivity ODBe.
See Object Definition Definition Language Language ODL. See ODL. See Ontology Ontology Inference Layer OIL. See OIL. See Ontology for Molecular Molecular OMB. See OMB.
Biology parameterized parameterized roles roles and and concepts, concepts, 356-357 356-357 parser parser module, module, 257 257 parsing 38 parsing tool tool output, output, 1 138 patent database, 66 database, 1 166 pattern, in in query query processing processing delivery, delivery, 93 93 statistical, statistical, 93 pattern recognition, 405 405 perfect-match probe, probe, 280 280 performance model model for for system system evaluation, evaluation, 371-376 371-376 benchmarks, benchmarks, 374-375 374-375 cost model, model, 372-374 372-374 evaluation evaluation matrix, matrix, 372 372 performance of of DiscoveryLink, 327-329 327-329 Per! Perl codes, codes, 167, 168 pharmacogenomics, 400-401 400-401 definition definition of, of, 420 420 pharmacology research, research, 304 phrase-based phrase-based system, system, 217 phylogeny and and evolution biology, biology, 12 pipeline, genome annotation, annotation, 26 planning, query, 94-95 94-95 Plant Ontology Consortium, Consortium, 402 platform, platform, establishing, 8 pre-defined 1 pre-defined identity, 8 81 pre-processing, 1 38 138 precision, o f text retrieval, 388-389 of 388-389

IIndex ndex

programming 13 programming language, language, functional, functional, 4 413 projection, projection, 42 42 Prolog, Prolog, 254 254 propagation propagation of of errors, errors, 26 26 protein, 1 9-322 protein, calcium channel, channel, 3 319-322 protein domain, domain, 400 protein protein family, family, 400 protein protein sequence, sequence, resources resources for, for, 397-398 397-398 proteome, definition 19 definition of, of, 4 419 proteomics, 19 proteomics, 400, 400, 4 419 prototype prototype mediator, mediator, 261-266 261-266 KIND, KIND, 360-362 360-362 provenance, 25-27 25-27 provider provider data, data, 343 343 view, view, 343-344 343-344 Public Public Catalog Catalog of of Databases, Databases, 17 17 public data data source, source, 17-18 17-18 PubMed PubMed identifiers in, 100-101 100-101 search in, 51-52, 51-52, 66-67, 66-67, 89

See Object Object Data Data Management Management ODMG. See

Group Group

OLAP. See See on-line analytical analytical processing processing OLAP. OMG. See See Object Management Group OMG.
on-line analytical processing (OLAP),

419 4 19 of OM OM (5), (S), 348 ontological grounds of 27-30 ontology, 27-30 216-217 biological, 2 1 6-217 definition of, 4 419 definition 19 mediation, 344 iin n model-based mediation, neuroscience, 339 neuroscience, in system system design, 85-86 in 192-197, 219-220 TAMBIS, 1 92-197, 214, 219-220 418 Ontology Inference Layer (OIL), 4 18 Molecular Biology (OMB), Ontology for Molecular 217 217 Open DataBase DataBase Connectivity (ODBC), Open OPM. See See Object-Protocol Object-Protocol Model OPM.
95-98 optimization, query, query, 95-98 264 Daplex and, 264 31%319 in DiscoveryLink, 3 1 7-3 19 267-268 generic, 267-268 information integration integration system, in K2 information 242-243 242-243 16%169 Kleisli query query system and, and, 167-169 monadic, 169-170 169-170 monadk, relational, relational, 174-175 174-175 semantic, 258,267 semantic, 258, 267 optimizer module, module, 260 260 OQL. See See Object Object Query Query Language OQL. Oracle, 308 308 Oracle, Oracle wrapper, wrapper, 3 311 Oracle 11 organ resources, organ resources, 401 organism resources, organism resources, 401 organization, data, data, 78-79 organization, 78-79 traditional, traditional, 81 81 output, output, processing of, 138 138 overloading, concept, concept, 5

one-world/multiple-world scenarios, scenarios, 4 419 one-worldlmultiple-world 19

Q
query, query, 86-98 All Genes, 57, 58 AllGenes, Boolean,24 Boolean, 24 browsing, 89-90 89-90 cost of processing, 322-326 322-326 Daplex, 252, 262, 264 252, 261, 261,262, capabilities of, 264-265 264-265 definition definition of, of, 420 420 DiscoveryLink and, 305-306, 1 6-326 305-306, 3 316-326 architecture architecture and, and, 309-310 309-310 determining determining costs, costs, 322-326 322-326 example 1 9-322 example of, of, 3 319-322 optimization and, 1 7-319 optimization and, 3 317-319 efficiency efficiency of, of, 377-378 377-378 old and new new data, 68 old and data, 68 reasoning in formulation reasoning formulation of, 202-205 202-205 in relational relational database, database, 128 searching and mining, 87-89 searching and mining, 87-89 semantics semantics of, of, 90 90 in Sequence Retrieval System, System, 128, 128, 129-130 129-130 SQL, SQL, 127 127 in 191, 1 9 7-202 in TAMBIS, TAMBIS, 191,197-202 unanswerable, 226, 228, 229, 375-376 375-376 to interface, 1 39 to Web Web interface, 139 query 8 query decomposition, decomposition, 6 68 query-driven approach, approach, 77-78 77-78 query-driven query execution execution plan, plan, 65 query query language query declarative, 63 63 declarative, definition of, of, 420 420 definition SRS, 129-130 129-130 standard, 43-44 43-44 standard, query optimization query optimization in K2 K2 information information integration integration system, system, in 242-243 242-243 semantic, 258 258 semantic, query processing, processing, 92-98 92-98 query 92-93 biological resources resources in, 92-93 optimization 95-98 optimization in, 95-98 planning in, 94-95 94-95 planning

418 4 18

81,419 primary key, 8 1, 419


Prisma, SRS, SRS, 141-143 141-143 probe, 19 probe, definition of, 4 419 probe array, 280 probe array version, 285 probe probe data, data, 280 280 probe intensity file, 281 probe probe pair, probe pair, 280 procedural access, declarative access vs., procedural access vs., 49 49 procedural code, 63 process process 12-14 life sciences sciences discovery, discovery, 12-14 map, 19 map, definition definition of, of, 4 419 n d abstraction, process elaboration elaboration a and abstraction, 358-359 358-359 process knowledge, capturing, capturing, 340-341 340-341 process map, map, 335 335 in neuroscience, 339 339 simple, 342 process mediator process maps for for model-based model-based mediator 357-360 system, 357-360 domain maps and, and, 358 358 domain maps process, 357 357 initial process, logic rules, rules, 359-360 359-360 as logic process elaboration elaboration and and abstraction, abstraction, process 358-359 358-359 process semantics, semantics, initial, initial, 357 357 process processing, query, query, 92-98 92-98 processing, processor, query, 205-212, 220. 220. See See also also processor, query, 205-212, processor query processor 7-8 profile, user, 7-8 program, structural structural recursion, recursion, 162-163 162-163 program, programming, object-oriented, object-oriented, 253 253 programming, programming interface, interface, application, application, 407 407 programming

p p

P (PTIME), (PTIME), 420 420 P/FDM PIFDM mediator, mediator, 249-272 249-272 alternative architectures architectures for for alternative integration, integration, 250-252 250-252 analysis, 266-272 266-272 analysis, optimization, 267-268 267-268 optimization, scalability, 271-272 271-272 scalability, user user interface, interface, 268-271 268-271 data sources, sources, 265-266 265-266 data example of, of, 261-264 261-264 example functional functional data data model, model, 252-254 252-254 mediator mediator architecture, architecture, 257-261 257-261 query capabilities, capabilities, 264-265 264-265 query schemas in federation, federation, 254-257 254-257 system information system information for, 427 427 package, analysis, package, analysis, 165

IIndex ndex

439 439
query processor, processor, TAMBIS, TAMBIS, 205-212, 205-212, 220 query planner, 11 planner, 208-2 208-211 sources sources and and services model, 206-208 206-208 syntactic syntactic and semantic heterogeneity, heterogeneity, 2 12 212 wrappers, 1-212 wrappers, 21 211-212 query query rewriter rewriter in KIND KIND model-based mediator, mediator, 362 query-shipping, query-shipping, 420 420 query query splitter, splitter, 260, 260, 268 query system, system, Kleisli, Kleisli, 147-184. 147-184. See See also also

relevance relevance semantic, 364-365 364-365 semantic,


source, source, 92-93 92-93 reliability, reliability, data provenance provenance and,

schema conceptual, conceptual, 44 in database database federation, federation, 258 definition definition of, 41-42, 41-42, 421 global global integration, integration, 266 relational, capturing, capturing, 125 three-schema three-schema architecture, architecture, 254-257 254-257 whole schema integration, integration, 124-125 124-125 schema integration, integration, 421 schema management, 67-69 67-69 schema schema mapping, mapping, 68 76-77 76-77 scientific analysis program, sensitivity of, 26 scientific scientific analysis tool in Sequence Retrieval Retrieval System, 137-139 137-139 scientific object, object, browsing of, of, 100-101 100-101 scientific object scientific object identity, 84-85 search, search, spreadsheet, spreadsheet, 40 search search engine, engine, Glimpse, Glimpse, 88 searching searching definition definition of, 421 design design of, of, 87-89 87-89 and mining, 87-89 87-89 selection, 42 self-describing self-describing exchange exchange format, format, 156 156 semantic semantic browsing browsing in in model-based model-based semantic semantic data data integration, integration, 58-60 58-60 mediation, mediation, 344 344 science, experimental, engineering engineering VS., vs.,

26-27 26-27
reordering reordering module, 260

approach, data, 250-251 250-251 replication approach,


report report GenPept, GenPept, 153-154 153-154 creating creating warehouse warehouse of, of, 164-165 164-165 MEDLINE, 153

repository, data, data, 4 repository,


research research and development, development, revolution in, 2-3 2-3

Kleisli query system querying, 420 browsing browsing vs., vs., 46-48 46-48 object object identity and, and, 84-85 84-85 SRS support support for, 121-122 121-122 strengths and and weaknesses weaknesses of, 61-62 61-62 querying interface, interface, keyword-search, keyword-search, 24 question, nonsensical, 201-202 question, nonsensical, 201-202 queue, 39 queue, batch, 1 139
....

R R
management system

resolution, concept integration integration and, 4-5 4-5 resource, biological resource, biological 397-405 list of, 397-405 processing, 92-93 92-93 in query processing, Resource Resource Description Description Framework (RDF), 420 420 restriction restriction access, 128 concept, concept, 200-201 200-201
result result fuser, fuser, 261 261

See relational relational database database RDBMS. See RDBMS.

retrieval, text, 388-389 388-389 retrieval, retrieval system, 405 retrieval


rewriter rewriter ICode, ICode, 260 260 query, query, in in KIND KIND model-based model-based mediator, mediator, RiboWeb, 216, 403 RiboWeb, 216, 403 RNA, RNA, 420 420 362 362

management system RDF. Description RDE See See Resource Resource Description Framework Framework reason-able meta-data, 365-366 reason-able meta-data, 365-366 reasoning, in in query query formulation, formulation, 202-205 202-205 reasoning, record, definition definition of, of, 420 420 record, recursion program, program, structural, structural, 162-163 162-163 recursion recursive concept, concept, 356 356 recursive reductionist molecular biology, 12 reductionist registration registration in DiscoveryLink, 309 in DiscoveryLink, 309 process of, 313-316 process of, 3 1 3-316 in model-based model-based mediation, mediation, 344-349 344-349 in reified roles as concepts, 353 reified roles as concepts, 353
relational algebra, 423, 420 42-43,420

semantic semantic data data mapping mapping in in integration integration of of 293-296 third-party third-party expression expression data, data,

role, 193 as as concept, concept, 353 353 mapped, mapped, 208 208 parameterized, parameterized, 356-357 356-357 in in TAMBIS, TAMBIS, 207-208 207-208 role role hierarchy, hierarchy, 355-356 355-356 rule
Icarus, 1 3- 1 1 5 Icarus, 1 113-115

role, 193

semantic semantic heterogeneity, heterogeneity, 212 212 semantic semantic query query optimization, optimization, 258 258 semantic semantic relevance, relevance, 364-365 364-365 Semantic Web, 421 semantics semantics application, 1 9 application, 19 semantic semantic vs. vs. syntactic syntactic integration, integration, 489 48-49 semantic semantic mediation, mediation, 364 364

293-296

relational data data model, model, 414 41-44 relational relational database, database, 153 relational 153

non-relational model model liS., vs., 50 50 non-relational

strengths and and weaknesses weaknesses of, of, 64 64 strengths integration into into Sequence Sequence Retrieval Retrieval integration System, 124-129 capturing capturing relational relational schema, schema, 125-126 hub hub table table selection, selection, 126-127 126-127
query query performance, performance, 128 128 restricting restricting access, access, 128 128 SQL SQL generation, generation, 127 127 summary of, of, 129 125-126 System, 124-129

logic domain map domain map as, as, 354 354 process process map map as, as, 359-360 359-360 in in query query optimization, optimization, 96 96 rule-based rule-based rewriter, rewriter, 258 ~'~ S s

logic

o f biological of biological data, data, S 5 in in model-based model-based mediation, mediation, 347-348 347-348 of query, 90 semi-structured data, fully structured semi-structured structured vs., 387-388 387-388 data vs., data initial initial process, process, 357 357

sample sample data data gene gene expression, expression, 288 288 sample sample data data mapping mapping nomenclature, 295 standardization standardization involving, involving, 282 282

semi-structured information system, semi-structured


82-83

82-83

semi-structured text text file, file, advantages advantages and and semi-structured
SeqStore, 307 sequence

disadvantages, 40-41 disadvantages, 40 1 SeqStore, 307

viewing entries, entries, 128-129 128-129 viewing schema integration, integration, 124-125 124-125 whole schema
query query performance performance to, to, 128 128 relational relational database database management management system system (RDBMS), 1-22 (RDBMS), 2 21-22 Kleisli query system and, and, 165 relational relational model, model, 420 420 relational relational optimizations, optimizations, 174-175 174-175 relational relational schema, schema, capturing, capturing, 125-126 125-126 relationships, relationships, semantics semantics of, of, in in model-based mediation, mediation, 347-348 347-348 viewing viewing entry entry from, from, 128-129 128-129

studies of, 294 studies sample data data space, space, biological, biological, 278-279 278-279 sample
sanctioning, sanctioning, 203 203 scalability scalability of DiscoveryLink, DiscoveryLink, 327-329 327-329 as as implementation implementation criterion, criterion, 379-380 379-380 of of K2 K2 information information integration integration system, system, 244-245 244-245 PIFDM and, 271-272 P/FDM and, 271-272 as as user user criterion, criterion, 383 383 scaling scaling factor, 287 287

sequence DNA or protein, resources for, DNA or protein, resources for, 397-398 397-398 EST, definition definition of, of, 412 412 EST, sequence data source, source, searching searching against, against, sequence data 87 87 sequence folding, folding, 404 404 sequence
Sequence Retrieval Retrieval System (SRS), (SRS), architecture of, 1 1 1

109-144 109-144 architecture of, 111 automated server server maintenance, maintenance, automated 141-143 141-143

440
Sequence Retrieval Retrieval System System (SRS), (SRS),(cont.) (cont.) Sequence subentry libraries, libraries, 1116 subentry 16 token server, server, 1113-115 token 13-115 integrating flat flat file file databanks, databanks, 1 112-116 integrating 12-116 interfaces to, to, 139-141 139-141 interfaces
SNOMED. SNOMED. See See Systematized Systematized SOAP. See Simple Simple Object Object Access Access SOAP.See software, 9 software, analysis, analysis, 1 19 source, source, data data Protocol Protocol software software benchmark, benchmark, 374-375 374-375 characteristics characteristics of, of, 17-19 17-19 definition 11 definition of, of, 147, 147, 4 411 290 290 Nomenclature Nomenclature of of Medicine Medicine structure structure prediction, prediction, 404 404 definition definition of, of, 422 422 mining mining and, and, 87 87

IIndex ndex

Structured Structured Query Query Language Language (SQL), (SQL), 43, 43, 86 86 generation generation of, of, 127 127 DiscoveryLink 11 DiscoveryLink and, and, 3 311

Kleisli query query system system and, and, 1 179-181 Kleisli 79-1 8 1 object loader, loader, 133-13 133-137 object 7 linking databanks, databanks, 130-133 130-133 linking complex and and nested nested objects, objects, 134 134 complex 136 136

plan plan generator, generator, 362 362 subentry subentry library library in in integrating integrating flat flat file file summary summary table, table, automatic, automatic, 407 407 survey, 18 survey, TAMBrS, TAMBIS, 2 218 Swiss-Prot Swiss-Prot accession accession number, number, 44 44 subprocess subprocess map, map, 359 359 data banks, 1 16 databanks, 116

exporting objects objects to to XML, XML, 136-137 136-137 exporting links to to create create composite composite structures, structures, links query language language of, of, 129-130 129-130 query relational database database integration, integration, relational 124-129 124-129 125-126 125-126 capturing relational relational schema, schema, capturing hub table table selection, selection, 126-127 126-127 hub 128 query performance, 128 128 restricting access, 128 SQL generation, 127 127 SQL summary of, of, 129 129 124-125 124-125 support for, for, 1 135 support 35

gene gene exptession expression data data management management and, and, in in K2 K2 infotmation information integration integration system, system, Kleisli Kleisli query query system system and, and, 165-167 165-167 simple simple curated curated gene, gene, 37-38 37-38 Web, Web, 65-67 65-67 types types of, of, 78 78 P/FDM P/FDM mediator mediator and, and, 265-266 265-266 mediator mediator and, and, 349-351 349-351 240-242 240-242

YNAPSE, 338-339 SYNAPSE, 338-339 S

query query optimization optimization and, and, 97-98 97-98

syntactic 12 syntactic heterogeneity, heterogeneity, 2 212 123-124 123-124

syntactic vs. semantic semantic integration, integration, 48-49 48-49 syntactic vs. syntactical syntactical problem, problem, SRS SRS solution solution of, of, synthetic synthetic approach approach to to biology, biology, 12 12 system system evaluation, evaluation, 9-10 9-10 system system requirements requirements determining, determining, 7-8 7-8

source 91 source dependent dependent query query plan, 1 191 sources sources and and services services model, model, 206-208 206-208 spatial spatial joins, joins, 337 337 space management, 373 373 source relevance, relevance, 92-93 92-93

viewing entries, 128-129 128-129 schema integration, whole schema scientific analysis tools, tools, 137-139 137-139 scientific TAMBIS and, and, 2 213 TAMBrS 13 information for, 425 425 system information 122-124 challenge of, 122-124 XML database database integration, integration, 1 116-124 XML 16-124 procedure for, 120-121 procedure 120-121 support features, features, 12 121-122 support 1-122 sequencing, definition of, of, 412 sequencing, 412 server DiscoveryLink, 309 309 query processing 318 processing and, and, 3 18 GeneExpress system on, on, 283 283 and, 214-215 SOAP, TAMBIS and, 214-215 token, token, 113-115 1 1 3-115 server in in Sequence Retrieval Retrieval System, System, 111-112 1 1 1- 1 1 2 maintenance maintenance of, 141-143 141-143 set, of, 421 set, definition definition of, 421 shorthand shorthand expression, expression, 119-120 1 1 9-120 simple curated gene data data source, simple curated gene source, 37-38 37-38 simple simple multiple-world multiple-world scenario, scenario, 336 336 Simple Object Object Access Access Protocol Protocol (SOAP), (SOAP), Simple 141,421 1 4 1 , 421 TAMBIS TAMBrS and, and, 214-215 214-2 1 5 simple simple one-world one-world scenario, scenario, 336 336 simple simple process process map, map, 342 342 simplified simplified SQL, SQL, 148-149 148-149 simplified simplified Structured Structured Query Query Language Language (sSQL), 148-149, (sSQL), 148-149, 151-152, 1 5 1-152, 421 421 simplifier, simplifier, 257 257 single single channel channel gene gene expression expression microarray microarray system, system, 279-281 279-2 8 1 SMART SMART Atlas. Atlas. See See Spatial Spatial Markup Markup Rendering Rendering Tool Tool Atlas Atlas SML. SML. See See Standard Standard Markup Markup Language Language sequence similarity search, 404 sequence search, 404 of, 1 118-120 uniqueness of, 1 8-120

Spatial Markup Markup Rendering Tool Tool specification, specification, meta-data, meta-data, 24-25 24-25 specifications specifications determining, 7-8 7-8 8-9 8-9

(SMART) (SMART) Atlas, 360, 360, 362-364 362-364

translating into technical approach, approach, splitter, splitter, query, 260 260 spreadsheet, 39-40 39-40

translating into into technical technical approach, approach, 8-9 8-9 Systematized Systematized Nomenclature Nomenclature of of Medicine Medicine (SNOMED), 28, 288, 294-295, (SNOMED), 28,288,294-295, 402 402 systems 2 systems analysis, analysis, demands demands of, of, 1 12 systems systems biology, biology, 422 422
, , ~ . . ~ ~,,~,

SQL. See See Structured Query Language SRS. See See Sequence Sequence Retrieval System 141-143 SRS Prisma, 141-143 SRSCS, 140, 140, 141 141 sSQL. See See simplified simplified Structured Query Query Language Language stackPACK, 138 stackPACK, 138 Staged Prisma, 142 142 Markup Language (SML), Standard Markup definition of, 421 definition of, 421 standard query query language, language, 21-22 21-22 standard standardization standardization benefits and and limitations limitations of, 281-282 281-282 benefits of gene names, 28 28 of of Multiple Multiple Stanford-IBM Manager Manager of Information Information Sources (TSIMMIS), 24 24 statement, DDL, DDL, 313-314 3 1 3-314 statement, statistical pattern pattern in in query query processing, processing, 93 93 statistical statistical technique technique for for gene expression statistical gene expression data, 287-288 287-288 data, storage schema, schema, 256 256 storage stored procedure, procedure, 422 422 stored structural data data transformation transformation in in structural integration of of third-party third-party gene gene integration data, 293 293 expression data, expression structural recursion recursion program, program, 162-163 162-163 structural structure structure composite, links links to to create, create, 136 136 composite, database, transformation transformation of, of, 44 44 database, resources of, of, 399 399 resources

T T

automatic 407 automatic summary, summary, 407 hash, 321 321 hub, 126-127 hub, 126-127 table expression, expression, 422 422 tagged tagged union union type, type, 153 153 TAMBIS, 1 89-220 TAMBIS, 24, 24, 66, 66, 149, 149, 189-220 current in, current and and future future developments developments in, 2 1 7-219 217-219 308 DiscoveryLink and, and, 308 extensibility of, 378-379 extensibility of, 378-379 information integration, integration, 213-215 2 1 3-215 information biological ontologies, ontologies, 216-217 2 1 6-217 biological based, 215-216 2 1 5-216 knowledge based, ontology, 192-197 192-197 ontology, 267 PIFDM mediator mediator and, P/FDM and, 267 scalability of, of, 380 380 scalability semantic integration integration and, and, 60 60 semantic system information information for, for, 426 426 system tools-driven technology technology used used by, hy, 91 91 tools-driven understandability of, of, 384 384 understandability usability of, 381 usability of, 381 user interface interface user constructing queries, queries, 197-202 197-202 constructing exploring ontology, ontology, 195-197 195-197 exploring query processor, processor, 205-212 205-2 12 query reasoning in in query query formulation, formulation, reasoning 202-205 202-205 technology, gene gene chip chip microarray, microarray, 413 413 technology, temporal logic, logic, 90 90 temporal term, 85 85 term,

table table

Index Index

441
terminological terminological frame flame of reference, reference, 347 text text file, file, semi-structured, semi-structured, advantages advantages and Unified Modeling Language Language (UML), (UML), 422 Uniform Resource Resource Locators Locators (URL), (URL), 423 union, union, 42 Universe, Universe, Sequence Retrieval Retrieval System and, 1 1 0-1 1 1 110-111 update update anomaly, 40 updating updating GeneExpress GeneExpress system, 283-284 283-284 in integration integration of third-party third-party gene expression expression data, 297-298 297-298 URL. See See Uniform Resource Resource Locators Locators usability as implementation criterion, 81 criterion, 3 381 a s user criterion, 3 84--3 85 as 384-385 use case, 36-39 36-39 combining old and new new data, 68 data data federation, federation, 68-69 68-69 data warehousing, 68 for for integration, 45-46 45-46 visualization visualization browsing browsing scientific scientific objects, objects, 100-101 100-101 multimedia data, 99-100 99-100 vocabulary consistent, consistent, 30 controlled, 40

disadvantages, 40-41 disadvantages,


text retrieval, retrieval, in system system evaluation, evaluation, 388-389 388-389 third-party gene expression expression data, data, integration integration of, 291-298 291-298 data data exchange exchange formats for, for, 291-293 291-293 data data loading loading issues issues in, 296-297 296-297 semantic semantic data data mapping issues in, 293-296 293-296 update update issues in, 297-298 297-298 three-level three-level hierarchy, hierarchy, in GeneExpress GeneExpress system, system, 293 three-schema three-schema architecture, architecture, 254-257 254-257 tightly coupled system, system, 250 time management, 372-373 372-373 tissue tissue resources, resources, 401 token server, 1 1 3-115 113-115 tool legacy, 79-80 79-80 scientific scientific analysis, analysis, in Sequence Retrieval 3 7-139 Retrieval System, 1 137-139 tool-driven integration, integration, 91-92 91-92 traditional traditional database database management, 80-81 80-81 traditional traditional database database system, searching searching and mining in, 88 transcription, transcription, 422 transcriptome, transcriptome, 422 transformation transformation data, data, S 5 of database database structure, structure, 44 44 translation, translation, 422 Transparent Transparent Access Access to Multiple Bioinformatics Bioinformatics Information Soutces. Sources. See See TAMBIS TSlMMlS. TSIMMIS. See See Stanford-lBM Stanford-IBM Manager Manager of Multiple Multiple Information Sources tuple, 81 two two channel gene expression expression microarray system, 279-28 1 279-281 two-level two-level hierarchy in GeneExpress GeneExpress system, 293

w
warehousing, warehousing, 21-22 21-22 definition 11 definition of, 4 411 DiscoveryLink DiscoveryLink and, 307-308 307-308

example of, 52-54 52-54 example


federation vs., 49 federation vs., 290 GeneExpress GeneExpress system system and, 283 in K2 system, 229 in Kleisli Kleisliquery query system, 163-165 163-165 strengths strengths and and weaknesses weaknesses of, of, 62-63 62-63 use case, 68 Web data source, source, 65-67 65-67 Web Web interface interface for PIFDM, P/FDM, 268-269, 268-269, 270 to Sequence Retrieval 39 Retrieval System, 1 139 Web presentation, presentation, 30-31 Web Web services, 141 webomim-get-detail function in Kleisli Kleisli system, 166-167 166-167 whole schema integration, 124-125 124-125 window, explorer, explorer, in TAMBIS, TAMBIS, 195-197 195-197 withdrawn HUGO HUGO name, name, 84-85 84-85 workflow biological tools and, 80 definition of, 423-424 423-424 W orld Wide Web, 30-31 World data data sources on, 6, 17-18 wrapped sources, sources, 191 wrapper, 23, 49-50 49-50 BLAST, 15 BLAST, 3 315 iin n database database federation, federation, 260-261 260-261 definition definition of, 424 DiscoveryLink, 1 0-3 1 1 DiscoveryLink, 56, 308, 3 310-311 cost o f query processing of processing and, 322-326 322-326 registration 1 3-316 registration and, 3 313-316 TAMBlS, 1-212 TAMBIS, 21 211-212 gene expression expression data management and,

retrieving genes and associated associated retrieving


expression expression results, results, 38-39 38-39 simple simple curated curated gene gene data source, source, 37-38 37-38 user user interface interface in K2 information integration integration system, 243-244 243-244 for PIFDM, P/FDM, 268-271 268-271 TAMBIS, 220 in TAMBlS, constructing constructing queries, queries, 197-202 197-202 exploring ontology, ontology, 195-197 195-197 query processor, 12 processor, 205-2 205-212 reasoning n query formulation, reasoning iin 202-205 202-205 user profile, profile, 7-8 7-8 user survey, TAMBIS, 18 TAMBIS, 2 218

variability, 7 variability, 1 17 i n gene expression in expression data, 287-288 287-288 variant, variant, definition definition of, 423 vector, differing differing meanings meanings of, of, 29 vertical vertical loop loop fusion, fusion, 170 view definition definition of, 423 materialized, materialized, 416 non-materialized, 18 non-materialized, 4 418 view building, 8 building, 6 68 view composition, composition, 68 view integration, 423 integration, 228, 228,423 view provider provider in model-based model-based mediation, mediation, 343-344 343-344 viewing bank, viewing entry from relational data databank, 128-129 128-129 virtual database database in DiscoveryLink, DiscoveryLink, 305

~~"~'~* u U
UML. See See Unified Modeling Modeling Language Language UMLS ontology, 363 unanswerable unanswerable query challenge, challenge, 226, 226, 228, 228, 229, 229, 375-376 375-376 understandability understandability as implementation implementation criterion, criterion, 380 as user criterion, criterion, 384

~,~'~
XA,423 XA, 423

x X

XML. See See extensible extensible markup language XPath, 90 XQuery, 90, 423

This Page Intentionally Left Blank

Das könnte Ihnen auch gefallen