1.2 Objectives of Information Retrieval Systems 1.3 Functional Overview 1.4 Relationship to Database Management Systems 1.5 Digital Libraries and Data Warehouses 1.6 Summary
This chapter defines an Information Storage and Retrieval System (called
an Information Retrieval System for brevity) and differentiates between information retrieval and database management systems. Tied closely to the definition of an Information Retrieval System are the system objectives. It is satisfaction of the objectives that drives those areas that receive the most attention in development. For example, academia pursues all aspects of information systems, investigating new theories, algorithms and heuristics to advance the knowledge base. Academia does not worry about response time, required resources to implement a system to support thousands of users nor operations and maintenance costs associated with system delivery. On the other hand, commercial institutions are not always concerned with the optimum theoretical approach, but the approach that minimizes development costs and increases the salability of their product. This text considers both view points and technology states. Throughout this text, information retrieval is viewed from both the theoretical and practical viewpoint. The functional view of an Information Retrieval System is introduced to put into perspective the technical areas discussed in later chapters. As detailed algorithms and architectures are discussed, they are viewed as subfunctions within a total system. They are also correlated to the major objective of an Information Retrieval System which is minimization of human resources required in the 2 Chapter 1
finding of needed information to accomplish a task. As with any discipline,
standard measures are identified to compare the value of different algorithms. In information systems, precision and recall are the key metrics used in evaluations. Early introduction of these concepts in this chapter will help the reader in understanding the utility of the detailed algorithms and theory introduced throughout this text. There is a potential for confusion in the understanding of the differences between Database Management Systems (DBMS) and Information Retrieval Systems. It is easy to confuse the software that optimizes functional support of each type of system with actual information or structured data that is being stored and manipulated. The importance of the differences lies in the inability of a database management system to provide the functions needed to process “information.” The opposite, an information system containing structured data, also suffers major functional deficiencies. These differences are discussed in detail in Section 1.4.
1.1 Definition of Information Retrieval System
An Information Retrieval System is a system that is capable of storage,
retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video and other multi-media objects. Although the form of an object in an Information Retrieval System is diverse, the text aspect has been the only data type that lent itself to full functional processing. The other data types have been treated as highly informative sources, but are primarily linked for retrieval based upon search of the text. Techniques are beginning to emerge to search these other media types (e.g., EXCALIBUR’s Visual RetrievalWare, VIRAGE video indexer). The focus of this book is on research and implementation of search, retrieval and representation of textual and multimedia sources. Commercial development of pattern matching against other data types is starting to be a common function integrated within the total information system. In some systems the text may only be an identifier to display another associated data type that holds the substantive information desired by the system’s users (e.g., using closed captioning to locate video of interest.) The term “user” in this book represents an end user of the information system who has minimal knowledge of computers and technical fields in general. The term “item” is used to represent the smallest complete unit that is processed and manipulated by the system. The definition of item varies by how a specific source treats information. A complete document, such as a book, newspaper or magazine could be an item. At other times each chapter, or article may be defined as an item. As sources vary and systems include more complex processing, an item may address even lower levels of abstraction such as a contiguous passage of text or a paragraph. For readability, throughout this book the terms “item” and “document” are not in this rigorous definition, but used