Srinivasa U., Nepal S. - Managing Multimedia Semantics (2005)

Managing Multimedia Semantics
Uma Srinivasan CSIRO ICT Centre, Australia Surya Nepal CSIRO ICT Centre, Australia
Publisher of innovative scholarly and professional information technology titles in the cyberage
IRM Press
Hershey London Melbourne Singapore
Acquisitions Editor: Development Editor: Senior Managing Editor: Managing Editor: Copy Editor: Typesetter: Cover Design: Printed at:
Rene Davies Kristin Roth Amanda Appicello Jennifer Neidig Michael Jaquish Jennifer Neidig Lisa Tosheff Integrated Book Technology
Published in the United States of America by IRM Press (an imprint of Idea Group Inc.) 701 E. Chocolate Avenue, Suite 200 Hershey PA 17033-1240 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: cust@idea-group.com Web site: http://www.irm-press.com and in the United Kingdom by IRM Press (an imprint of Idea Group Inc.) 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 3313 Web site: http://www.eurospan.co.uk Copyright 2005 by Idea Group Inc. All rights reserved. No part of this book may be reproduced, stored or distributed in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Product or company names used in this book are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI of the trademark or registered trademark. Library of Congress Cataloging-in-Publication Data Managing multimedia semantics / Uma Srinivasan and Surya Nepal, editors. p. cm. Summary: "This book is aimed at researchers and practitioners involved in designing and managing complex multimedia information systems"--Provided by publisher. Includes bibliographical references and index. ISBN 1-59140-569-6 (h/c) -- ISBN 1-59140-542-4 (s/c) -- ISBN 1-59140-543-2 (ebook) 1. Multimedia systems. I. Srinivasan, Uma, 1948- II. Nepal, Surya, 1970QA76.575.M3153 2005 006.7--dc22 2004029850 British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library. All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not necessarily of the publisher.
Managing Multimedia Semantics

Table of Contents
Preface ........................................................................................................................... vi SECTION 1: SEMANTIC INDEXING AND RETRIEVAL OF IMAGES Chapter 1 Toward Semantically Meaningful Feature Spaces for Efficient Indexing in Large Image Databases ............................................................................................................. 1 Anne H.H. Ngu, Texas State University, USA Jialie Shen, The University of New South Wales, Australia John Shepherd, The University of New South Wales, Australia Chapter 2 From Classification to Retrieval: Exploiting Pattern Classifiers in Semantic Image Indexing and Retrieval ......................................................................................... 30 Joo-Hwee Lim, Institute for Infocomm Research, Singapore Jesse S. Jin, The University of Newcastle, Australia Chapter 3 Self-Supervised Learning Based on Discriminative Nonlinear Features and Its Applications for Pattern Classification ......................................................................... 52 Qi Tian, University of Texas at San Antonio, USA Ying Wu, Northwestern University, USA Jie Yu, University of Texas at San Antonio, USA Thomas S. Huang, University of Illinois, USA SECTION 2: AUDIO AND VIDEO SEMANTICS: MODELS AND STANDARDS Chapter 4 Context-Based Interpretation and Indexing of Video Data ............................................. 77 Ankush Mittal, IIT Roorkee, India Cheong Loong Fah, The National University of Singapore, Singapore Ashraf A. Kassim, The National University of Singapore, Singapore Krishnan V. Pagalthivarthi, IIT Delhi, India
Chapter 5 Content-Based Music Summarization and Classification ............................................. 99 Changsheng Xu, Institute for Infocomm Research, Singapore Xi Shao, Institute for Infocomm Research, Singapore Namunu C. Maddage, Institute for Infocomm Research, Singapore Jesse S. Jin, The University of Newcastle, Australia Qi Tian, Institute for Infocomm Research, Singapore Chapter 6 A Multidimensional Approach for Describing Video Semantics .................................. 135 Uma Srinivasan, CSIRO ICT Centre, Australia Surya Nepal, CSIRO ICT Centre, Australia Chapter 7 Continuous Media Web: Hyperlinking, Search and Retrieval of Time-Continuous Data on the Web ............................................................................................................. 160 Silvia Pfeiffer, CSIRO ICT Centre, Australia Conrad Parker, CSIRO ICT Centre, Australia Andre Pang, CSIRO ICT Centre, Australia Chapter 8 Management of Multimedia Semantics Using MPEG-7 ................................................ 182 Uma Srinivasan, CSIRO ICT Centre, Australia Ajay Divakaran, Mitsubishi Electric Research Laboratories, USA SECTION 3: USER-CENTRIC APPROACH TO MANAGE SEMANTICS Chapter 9 Visualization, Estimation and User Modeling for Interactive Browsing of Personal Photo Libraries .............................................................................................................. 193 Qi Tian, University of Texas at San Antonio, USA Baback Moghaddam, Mitsubishi Electric Research Laboratories, USA Neal Lesh, Mitsubishi Electric Research Laboratories, USA Chia Shen, Mitsubishi Electric Research Laboratories, USA Thomas S. Huang, University of Illinois, USA Chapter 10 Multimedia Authoring: Human-Computer Partnership for Harvesting Metadata from the Right Sources .......................................................................................................... 223 Brett Adams, Curtin University of Technology, Australia Svetha Venkatesh, Curtin University of Technology, Australia Chapter 11 MM4U: A Framework for Creating Personalized Multimedia Content ........................ 246 Ansgar Scherp, OFFIS Research Institute, Germany Susanne Boll, University of Oldenburg, Germany
Chapter 12 The Role of Relevance Feedback in Managing Multimedia Semantics: A Survey ........ 288 Samar Zutshi, Monash University, Australia Campbell Wilson, Monash University, Australia Shonali Krishnaswamy, Monash University, Australia Bala Srinivasan, Monash University, Australia SECTION 4: MANAGING DISTRIBUTED MULTIMEDIA Chapter 13 EMMO: Tradeable Units of Knowledge-Enriched Multimedia Content ......................... 305 Utz Westermann, University of Vienna, Austria Sonja Zillner, University of Vienna, Austria Karin Schellner, ARC Research Studio Digital Memory Engineering, Vienna, Austria Wolfgang Klaus, University of Vienna and ARC Research Studio Digital Memory Engineering, Vienna, Austria Chapter 14 Semantically Driven Multimedia Querying and Presentation ...................................... 333 Isabel F. Cruz, University of Illinois, Chicago, USA Olga Sayenko, University of Illinois, Chicago, USA SECTION 5: EMERGENT SEMANTICS Chapter 15 Emergent Semantics: An Overview ............................................................................... 351 Viranga Ratnaike, Monash University, Australia Bala Srinivasan, Monash University, Australia Surya Nepal, CSIRO ICT Centre, Australia Chapter 16 Emergent Semantics from Media Blending ................................................................... 363 Edward Altman, Institute for Infocomm Research, Singapore Lonce Wyse, Institute for Infocomm Research, Singapore Glossary ......................................................................................................................... 391 About the Authors .......................................................................................................... 396 Index .............................................................................................................................. 406
vi
Preface
Today most documented information is in digital from. Digital information, in turn, is rapidly moving from textual information to multimedia information that includes images, audio and video content. Yet searching and retrieving required information is a challenging and arduous task, because it is difficult to access just the required parts of information stored in a database. In the case of text documents, the table of contents serves as an index to different sections of the document. However, creating a similar index that points to different parts of multimedia content is not an easy task. Manual indexing of audiovisual content can be subjective, as there are several ways to describe the multimedia information depending on the user, the purpose of use, and the task that needs to be performed. The problem gets even murkier, as the purpose for retrieval is often completely different from the purpose for which the content was created, annotated and stored in a database. Work in the area of multimedia information retrieval started with techniques that could automatically index the content based on some inherent features that could be extracted from one medium at a time. For example, features that can be extracted from still images are colour, texture and shape of objects represented in the image. In the case of a video, static features such as colour, texture and shape are no longer adequate to index visual content that has been created using powerful film editing techniques that can shape viewers experiences. For audios, the types of features that can be extracted are pitch, tonality, harmonicity, and so forth, which are quite distinct from visual features. Feature extraction and classification techniques draw from a number of disciplines such as artificial intelligence, vision and pattern recognition, and signal processing. While automatic feature extraction does offer some objective measures to index the content of an image, it is insufficient for the retrieval task, as information retrieval is based on the rich semantic notions that humans can conjecture in their minds while retrieving audiovisual information. The other alternative is to index multimedia information using textual descriptions. But this has the problem of subjectivity, as it is hard to have a generic way to first describe and then retrieve semantic information that is universally acceptable. This is inevitable as users interpret semantics associated with the multimedia content in so many different ways, depending on the context and use of the information. This leads to the problem of managing multiple semantics associated
vii
with the same material. Nevertheless, the need to retrieve multimedia information grows inexorably, carrying with it the need to have tools that can facilitate search and retrieval of multimedia content at a semantic or a conceptual level to meet the varying needs of different users. There are numerous conferences that are still addressing this problem. Managing multimedia semantics is a complex task and continues to be an active research area that is of interest to different disciplines. Individual papers on multimedia semantics can be found in many journals and conference proceedings. Meersman, Tari and Stevens (1999), present a compilation of works that were presented at the IFIP Data Semantics Working Conference held in New Zealand. The working group focused on issues that dealt with semantics of the information represented, stored and manipulated by multimedia systems. The topics covered in this book include: data modeling and query languages for multimedia; methodological aspects of multimedia database design, information retrieval, knowledge discovery and mining, and multimedia user interfaces. The book covers six main thematic areas. These are: Video Data Modeling and Use; Image Databases; Applications of multimedia systems; Multimedia Modeling; Multimedia Information retrieval; Semantics and Metadata. This book offers a good glimpse of the issues that need to be addressed from an information systems design perspective. Here semantics is addressed from the point of view of querying and retrieving multimedia information from databases. In order to retrieve multimedia information more effectively, we need to go deeper into the content and exploit results from the vision community, where the focus has been in understanding inherent digital signal characteristics that could offer insights into semantics situated within the visual content. This aspect is addressed in Bimbo (1999), where the focus is mainly on visual feature extraction techniques used for content-based retrieval of images. The topics discussed are image retrieval by colour similarity, image retrieval by texture similarity, image retrieval by shape similarity, image retrieval by spatial relationships, and finally one chapter on content-based video retrieval. The focus here is on low-level feature-based content retrieval. Although several algorithms have been developed for detecting low-level features, the multimedia community has realised that content-based retrieval (CBR) research has to go beyond low-level feature extraction techniques. We need the ability to retrieve content at more abstract levels the levels at which humans view multimedia information. The vision research then moved on from low-level feature extraction in still images to segment extraction in videos. Semantics becomes an important issue when identifying what constitutes a meaningful segment. This shifts the focus from image and video analysis (of single features) to synthesis of multiple features and relationships to extract more complex information from videos. This idea is further developed in Dorai and Venkatesh (2002), where the theme is to derive high-level semantic constructs from automatic analysis of media. That book uses media production and principles of film theory as the bases to extract higher-level semantics in order to index video content. The main chapters include applied media aesthetics, space-time mappings, film tempo, modeling colour dynamics, scene determination using auditive segmentation, and determining effective events. In spite of the realisation within the research community that multimedia research needs to be enhanced with semantics, research output has been discipline-based. Therefore, there is no single source that presents all the issues associated with modeling,
viii
representing and managing multimedia semantics in order to facilitate information retrieval at a semantic level desired by the user. And, more importantly, research has progressed by handling one medium at a time. At the user level, we do know that multimedia information is not just a collection of monomedia types. Although each media type has its own inherent properties, multimedia information has a coherence that can only be perceived if we take a holistic approach to managing multimedia semantics. It is our hope that this book fills this gap by addressing the whole spectrum of problems that need to be addressed in order to manage multimedia semantics, from an application perspective, that adds value to the user community.
OUR APPROACH TO ADDRESS THIS CHALLENGE

The objective of the book managing multimedia semantics is to assemble in one comprehensive volume the research problems, theoretical frameworks, tools and technologies that contribute towards managing multimedia semantics. The complexity of managing multimedia semantics has given rise to many frameworks, models, standards and solutions. The book aims to highlight both current techniques and future trends in managing multimedia semantics. We systematically define the problem of multimedia semantics and present approaches that help to model, represent and manage multimedia content, so that information systems deliver the promise of providing access to the rich content held in the vaults of multimedia archives. We include topics from different disciplines that contribute to this field and synthesise the efforts towards addressing this complex problem. It is our hope that the technologies described in the book could lead to the development of new tools to facilitate search and retrieval of multimedia content at a semantic or a conceptual level to meet the varying needs of the user community.
ORGANISATION OF THIS BOOK

The book takes a close look at each piece of the puzzle that is required to address the multimedia semantic problem. The book contains 16 chapters organised under five sections. Each section addresses a major theme or topic that is relevant for managing multimedia semantics. Within a section, each chapter addresses a unique research or technology issue that is essential to deliver tools and technologies to manage the multimedia semantics problem.
Section 1: Semantic Indexing and Retrieval of Images

Chapters 1, 2 and 3 deal with semantic indexing, classification and retrieval techniques related to images. Chapter 1 describes a feature-based indexing technique that uses low-level feature vectors to index and retrieve images from a database. The interesting aspect of the architecture here is that the feature vector carries some semantic properties of the image along with low-level visual properties. This is moving one step towards semantic indexing of images using low-level feature vectors that carry image semantics.
ix
Chapter 2 addresses the semantic gap that exists between a users query and lowlevel visual features that can be extracted from an image. This chapter presents a stateof-the-art review of pattern classifiers in content-based image retrieval systems, and then extends these ideas from pattern recognition to object recognition. The chapter presents three new indexing schemes that exploit pattern classifiers for semantic indexing. Chapter 3 takes the next step in the object recognition problem, and proposes a self-supervised learning algorithm called KDEM - Kernel Discriminant-EM to speed up semantic classification and recognition problems. The algorithms are tested for image classification, hand posture recognition and fingertip tracking. We then move on from image indexing to context-based interpretation and indexing of videos.
Section 2: Audio and Video Semantics: Models and Standards

Chapter 4 describes the characterisation of video data using the temporal behaviour of features, using context provided by the application domain in the situation of a shot. A framework based on Dynamic Bayesian Network is presented to position the video segment within an application and provide an interpretation within that context. The framework learns the temporal structure through the fusion of all features, and removes the cumbersome task of manually designing a rule-based system for providing the high-level interpretation. Chapter 5 moves on to audio and presents a comprehensive survey of contentbased music summarisation and classification. This chapter describes techniques used in audio feature extraction, music representation, and summarisation for both audio and music videos. The chapter further identifies emerging areas in genre classification, determining song structure, rhythm extraction, and semantic region extraction in music signals. Chapter 6 takes a holistic approach to video semantics, presenting a multidimensional model for describing and representing video semantics at several levels of abstraction from the perceptual to more abstract levels. The video metamodel VIMET supports incremental description of semantics, and presents a framework that is generic and not definitive, while still supporting the development of application-specific semantics that exploit feature-based retrieval techniques. Although the chapter addresses video semantics, it provides a nice framework that encompasses several aspects of multimedia semantics. Chapter 7 presents Continuous Media Web an approach that enables the searching of time-continuous media such as audio and video using extensions to standard Web-based browsing tools and technology. In particular, the chapter presents the Annodex file format that enables the creation of webs of audio and video documents using the continuous media markup language (CMML). Annodex extends the idea of surfing the web of text documents to an integrated approach of searching, surfing and managing the web of text and media resources. Chapter 8 examines the new role of the new MPEG-7 standard in facilitating the management of multimedia semantics. This chapter presents an overview of the MPEG7 Content description Interface and examines the Descriptions Schemes (DS) and Descriptors (Ds) to address multimedia semantics at several levels of granularity and
abstraction. The chapter presents a discussion on application development using MPEG7 descriptions. Finally the chapter discusses some strengths and weaknesses of the standard in addressing multimedia semantics.
Section 3: User-Centric Approach to Manage Semantics

Chapters 9, 10, 11 and 12 move away from a media-centric approach and take a user-centric perspective while creating and interacting with multimedia content. Chapter 9 presents a user-centric algorithm for visualisation and layout for content-based image retrieval from a large photo library. The framework facilitates an intuitive visualisation that adapts to the users time-varying notions of content, context and preferences in navigation and style. The interface is designed as a touch-sensitive, circular table-top display, which is being used in the Personal Digital Historian project that enables interactive exploratory story telling. Chapter 10 deals with a holistic approach to multimedia authoring and advances the idea of creating multimedia authoring tools for the amateur media creator. The chapter proposes that in order to understand media semantics, the media author needs to address a number of issues. These involve a deep understanding of the media creating process; knowledge of the deeper structures of content; and the surface manifestations in the media within an application domain. The chapter explores software and human interactions in the context of implementing a multimedia authoring tool in a target domain and presents a future outlook on multimedia authoring. Chapter 11 presents MM4U, a software framework to support the dynamic composition and authoring of personalised multimedia content. It focuses on how to assemble and deliver multimedia content personalised to reflect the users context, specific background, interest and knowledge, as well as the physical infrastructure conditions. Further, the application of MM4U framework is illustrated through the implementation of two applications: a personalised city guide delivered on a mobile device, and a personalised sports ticker application that combines multimedia events (audio, video and text-based metadata) to compose a coherent multimedia application delivered on the preferred device. Chapter 12 considers the role of the mature relevance feedback technology, which is normally used for text retrieval, and examines its applicability for multimedia retrieval. The chapter surveys a number of techniques used to implement relevance feedback while including the human in the loop during information retrieval. An analysis of these techniques is used to develop the requirements of a relevance feedback technique that can be applied for semantic multimedia retrieval. The requirements analysis is used to develop a user-centric framework for relevance feedback in the context of multimedia information retrieval.
Section 4: Managing Distributed Multimedia

Chapters 13 and 14 explore multimedia content retrieval and presentation in a distributed environment. Chapter 13 addresses the problem that occurs due to the separation of content from its description and functionality while exchanging or sharing content in a collaborative multimedia application environment. The chapter proposes a content modeling
xi
formalism based on enhanced multimedia metaobjects (Emmo) that can be exchanged in their entirety covering the media aspect, the semantic aspect and the functional aspect of the multimedia content. The chapter further outlines a distributed infrastructure and describe two applications that use Emmo for managing multimedia objects in a collaborative application environment. Chapter 14 shows how even a limited description of multimedia object can add semantic value in the retrieval and presentation of multimedia. The chapter describes a framework DelaunayView that supports distributed and heterogeneous multimedia sources based on a semantically driven approach for the selection and presentation of multimedia content. The system architecture is composed of presentation, integration and data layers, and its implementation is illustrated with a case study.
Section 5: Emergent Semantics

The next two chapters explore an emerging research area emergent semantics where multimedia semantics emerges and evolves dynamically responding to unanticipated situations, context and user interaction. Chapter 15 presents an overview of emergent semantics. Emergence is the phenomenon of complex structures arising from interactions between simple units. Emergent semantics is symbiosis of several research areas and explores experiential computing as a way for users to interact with the system at a semantic level without having to build a mental model of the environment. Chapter 16 provides a practical foundation to this emerging research area. It explores the computation of emergent semantics from integrative structures that blend media into creative compositions in the context of other media and user interaction with the media as they deal with the semantics embedded within the media. The chapter presents a media blending framework that empowers the media producer to create complex new media assets by leveraging control over emergent semantics derived from media blends. The blending framework for discovering emerging semantics uses ontologies that provide a shared description of the framework, operators to manage the computation models and an integration mechanism to enable the user to discover emergent structures in the media.
CONCLUDING REMARKS
In spite of large research output in the area of multimedia content analysis and management, current state-of-the-art technology offers very little by way of managing semantics that is applicable for a range of applications and users. Semantics has to be inherent in the technology rather than an external factor introduced as an afterthought. Situated and contextual factors need to be taken into account in order to integrate semantics into the technology. This leads to the notion of emergent semantics which is user-centered, rather than technology driven methods to extract latent semantics. Automatic methods for semantic extraction tend to pre-suppose that semantics is static, which is counterintuitive to the natural way semantics evolves. Other interactive technologies and developments in the area of semantic web also address this problem. In future, we hope to see the convergence of different technologies and research disciplines in addressing the multimedia semantic problem from a user-centric perspective.
xii
Bimbo, A.D. (1999). In M. Kaufmann (Ed.), Visual information retrieval. San Francisco. Dorai, C., & Venkatesh, C. (2002). Computational media aesthetics. Boston: Kluwer Academic Publishers. Meersman, R., Scott, Z., & Stevens, M. (1999, January 4-8). Database semantics - Semantic issues in multimedia systems, IFIP TC2/WG2.6. Eighth Working Conference on Database Semantics (DS-8), Rotorua, New Zealand.
REFERENCES
xiii
Acknowledgments
The editors would like to acknowledge the help of a number of people who contributed in various ways, without whose support this book could not have been published in its current form. Special thanks go to all the staff at Idea Group, who participated from inception of the initial idea to the final publication of the book. In particular, we acknowledge the efforts of Michele Rossi, Jan Travers and Mehdi Khosrow-Pour for their continuous support during the project. No book of this nature is possible without the commitment of the authors. We wish to offer our heart-felt thanks to all the authors for their excellent contributions to this book, and for their patience as we went through the revisions. The completion of this book would have been impossible without their dedication. Most of the authors of chapters also served as referees for chapters written by other authors, and they deserve a special note of thanks. We also would like to acknowledge the efforts of other external reviewers: Zahar Al Aghbhari, Saied Tahaghoghi, A.V. Ratnaike, Timo Volkner, Mingfang Wu, Claudia Schremmer, Santha Sumanasekara, Vincent Oria, Brigitte Kerherve, and Natalie Colineau. Last but, not the least, we would like to thank CSIRO (Commonwealth Scientific and Industrial Research Organization) and the support from the Commercial group, in particular Pamela Steele, in managing the commercial arrangements and letting us get on with the technical content. Finally we wish to thank our families for their love and support throughout the project. Uma Srinivasan and Surya Nepal CSIRO ICT Centre, Sydney, Australia September 2004
Section 1 Semantic Indexing and Retrieval of Images
Efficient Indexing in Large Image Databases 1
Chapter 1
Toward Semantically Meaningful Feature Spaces for Efficient Indexing in Large Image Databases
Anne H.H. Ngu, Texas State University, USA Jialie Shen, The University of New South Wales, Australia John Shepherd, The University of New South Wales, Australia
The optimized distance-based access methods currently available for multimedia databases are based on two major assumptions: a suitable distance function is known a priori, and the dimensionality of image features is low. The standard approach to building image databases is to represent images via vectors based on low-level visual features and make retrieval based on these vectors. However, due to the large gap between the semantic notions and low-level visual content, it is extremely difficult to define a distance function that accurately captures the similarity of images as perceived by humans. Furthermore, popular dimension reduction methods suffer from either the inability to capture the nonlinear correlations among raw data or very expensive training cost. To address the problems, in this chapter we introduce a new indexing
ABSTRACT
Copyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
2 Ngu, Shen & Shepherd
technique called Combining Multiple Visual Features (CMVF) that integrates multiple visual features to get better query effectiveness. Our approach is able to produce lowdimensional image feature vectors that include not only low-level visual properties but also high-level semantic properties. The hybrid architecture can produce feature vectors that capture the salient properties of images yet are small enough to allow the use of existing high-dimensional indexing methods to provide efficient and effective retrieval.
INTRODUCTION
With advances in information technology, there is an ever-growing volume of multimedia information from emerging application domains such as digital libraries, World Wide Web, and Geographical Information System (GIS) systems available online. However, effective indexing and navigation of large image databases still remains one of the main challenges for modern computer system. Currently, intelligent image retrieval systems are mostly similarity-based. The idea of indexing an image database is to extract the features (usually in the form of a vector) from each image in the database and then to transform features into multidimensional points. Thus, searching for similarity between objects can be treated as a search for close points in this feature space and the distance between multidimensional points is frequently used as a measurement of similarity between the two corresponding image objects. To efficiently support this kind of retrieval, various kinds of novel access methods such as Spatial Access Methods (SAMs) and metric trees have been proposed. Typical examples of SAMs include the SS-tree (White & Jain, 1996), R+-tree (Sellis, 1987) and grid files (Faloutsos, 1994); for metric trees, examples include the vp-tree (Chiueh, 1994), mvptree (Bozkaya & Ozsoyoglu, 1997), GNAT (Brin, 1995) and M-tree (Ciaccia, 1997). While these methods are effective in some specialized image database applications, many open problems in image indexing still remain. Firstly, typical image feature vectors are high dimensional (e.g., some image feature vectors can have up to 100 dimensions). Since the existing access methods have an exponential time and space complexity as the number of dimensions increases, for indexing high-dimensional vectors, they are no better than sequential scanning of the database. This is the well-known dimensional curse problem. For instance, methods based on R-trees can be efficient if the fan-out of the R-tree nodes remain greater than two and the number of dimensions is under five. The search time with linear quad trees is proportional to the size of the hyper surface of the query region that grows with the number of dimensions. With grid files, the search time depends on the directory whose size also grows with the number of dimensions. Secondly, there is a large semantic gap existing between low-level media representation and high-level concepts such as person, building, sky, landscape, and so forth. In fact, while the extraction of visual content from digital images has a long history, it has so far proved extremely difficult to determine how to use such features to effectively represent high-level semantics. This is because similarity in low-level visual feature may not correspond to high-level semantic similarity. Moreover, human beings perceive and identify images by integrating different kinds of visual features in a nonlinear way. This implies that assuming each type of visual feature contributes equally to the recognition
of the images is not supported in the human perceptual system and an efficient contentbased image retrieval system cannot be achieved by considering independent simple visual feature. In terms of developing indexing methods for effective similarity searching in large image respository, we are faced with the problem of producing a composite feature vector that accurately mimics human visual perception. Although many research works have claimed to support queries on composite features by combining different features into an integrated index structure, very few of them explain how the integration is implemented. There are two main problems that need to be addressed here. The first one is that the integrated features (or composite features) typically generate very high-dimensional feature space, which cannot be handled efficiently by the existing access methods. The other problem is the discovery of image similarity measures that reflect semantic similarity at a high level. There are two approaches to solving the indexing problem. The first approach is to develop a new spatial index method that can handle data of any dimension and employ a k-nearest neighborhood (k-NN) search. The second approach is to map the raw feature space into a reduced space so that an existing access method can be applied. Creating a generalized high-dimensional index that can handle hundreds of dimensions is still an unsolved problem. The second approach is clearly more practical. In this chapter, we focus on how to generate a small but semantically meaningful feature vector so that effective indexing structures can be constructed. The second problem is how to use low-level media properties to represent high-level semantic similarity. In the human perceptual process, the various visual contents in an image are not weighted equally for image identification. In other words, the human visual system has different responses to color, texture and shape information in an image. When the feature vectors extracted from an image represent these visual features, the similarity measure for each feature type between the query image and an image in the database is typically computed by a Euclidean distance function. The similarity measure between the two images is then expressed as a linear combination of the similarity measures of all the feature types. The question that remains here is whether a linear combination of the similarity measures of all the feature types best reflects how we perceive images as similar. So far, no experiments have been conducted to verify this belief. The main contribution of this work is in building a novel dimension reduction scheme, called CMVF (Combining Multiple Visual Features), for effective indexing in large image database. The scheme is designed based on the observation that humans use multiple kinds of visual features to identify and classify images via a robust and efficient learning process. The objective of the CMVF scheme is to mimic this process in such a way as to produce relatively small feature vectors that incorporate multiple features and that can be used to effectively discriminate between images, thus providing both efficient (small vectors) and effective (good discrimination) retrieval. The core of the work is to use a hybrid method that incorporates PCA and neural network technology to reduce the size of composite image features (nonlinear in nature) so that they can be used with an existing distance-based index structure without any performance penalty. On the other hand, improved retrieval effectiveness can, in principle, be achieved by compressing more discriminating information (i.e., integrating more visual features) into the final vector. Thus, in this chapter, we also investigate precisely how much improvement in
retrieval effectiveness is obtained as more visual features are incorporated. Furthermore, humans are capable of correctly identifying and classifying images, even in the presence of moderate amounts of distortion. Since CMVF is being trained to classify images, this suggests that if we were to train it using not only the original image, but also distorted versions of that image, it might be more robust in recognizing minor variations of the image in the future. Another aspect of robustness in CMVF is how much it is affected by the initial configuration of the neural network. In this chapter, the robustness of CMVF in these two contexts is also investigated.
BACKGROUND
Image Feature Dimension Reduction
Trying to implement computer systems that mimic how the human visual system processes images is a very difficult task, because humans use different features to identify and classify images in different contexts, and do not give equal weight to various features even within a single context
This observation suggests that an effective content-based image retrieval system cannot be achieved by considering only a single type of feature and cannot be achieved by considering only visual content, without taking account of human perception. The first of these suggests multiple image features are required; the second suggests that semantic features, based on manual classification of images, are also required. However, creating an index based on a composite feature vector will typically result in a very high-dimensional feature space, rendering all existing indexing methods useless. At the same time, a simple linear combination of different feature types cannot precisely reflect how human beings perceive images as similar. The natural and practical solution to these problems lies in discovering a dimension reduction technique, which can fuse multiple visual content features into a composite feature vector that is low in dimensions and yet preserves all human-relevant information for image retrieval. There has been considerable research work on dimension reduction for image feature vectors. This work can be classified into two general categories: linear dimension reduction (LDR) and nonlinear dimension reduction (NLDR). The typical examples for LDR include SVD and PCA (Fukunaga & Koontz, 1970; Kittler & Young, 1973). These approaches assume that the variance of data can be accounted for by a small amount of eigenvalues. Thus, LDR works well only for data that exhibits some linear correlation. However, if the data exhibits some nonlinear correlation, the dimension reduction via LDR causes significant loss in distance information, which results in less effective query processing. Due to the complexity of image features, better query effectiveness can be achieved by using nonlinear dimension reduction. The basis of NLDR is the standard nonlinear regression analysis as used in the neural network approach, which has been widely studied in recent years. Systems based on NLDR can maintain a great deal of knowledge about distance information in the original data source. The information can
be represented as neural network weights between units in successive layers. NLDR typically performs better than LDR in handling feature vectors for image data. The only drawback of NLDR is that it requires a training process, which can be time consuming.
Image Similarity Measurement

A major task in content-based retrieval is to find the most similar images from a multimedia database with respect to a query object (image). Various kinds of features can be used for specifying query objects including descriptive concepts (keywords) and numerical specification (color, texture and shape). The feature vectors (mainly numerical) for the given query object are usually derived using basic image processing techniques such as segmentation and feature extraction. Calculating the similarity between a query object and an object in the multimedia database is reduced to computing the distance between two feature vectors. However, current research has been focused on finding a similarity function that corresponds only to a single feature (e.g., color information only). That is, only simple queries, such as how similar two images are in terms of color, are well supported. A typical example is the work carried out by Bozkaya and zsoyoglu (1997). In their work, the similarity measure of a pair of images based on composite feature vectors described by color and texture was proposed as a linear combination of the similarity measure of the individual single feature vector. Their proposal can be detailed as follows: Let {xc, yt} and {yc, yt} be the color and texture feature vectors that fully describe two images X and Y, then the similarity measure of images X and Y, denoted as S (X, Y), is given by
S = wc S c + wt S t
(1)
where the Sc and St are the color and texture similarity functions respectively; wc and wt are weighting factors. However, the criteria for selecting these weighting factors are not mentioned in their research work. From the statistics viewpoint, by treating the weighting factors as normalization factors, the definition is just a natural extension of the Euclidean distance function to a high-dimensional space in which the coordinate axes are not commensurable. The question that remains to be answered is whether a Euclidean distance function for similarity measures best correlates with the human perceptual process for image recognition. That is, when humans perceive two images as similar, can a distance function given in the form in Equation 1 be defined? Does this same function hold for another pair of images that are also perceived as similar? So far, no experiments have been conducted that demonstrate (or counter-demonstrate) whether linear combinations of different image features are valid similarity measures based on human visual perception. Also, the importance of designing a distance function that mimics human perception to approximate a perceptual weight of various visual features has not been attempted before. Thus, incorporating human visual perception into image similarity measurement is the other major motivation behind our work.
Distance-Based Access Methods

To efficiently support query processing in multidimensional feature space, several spatial access methods (SAMs) have been proposed. These methods can be broadly classified into the following types: point access methods and rectangle access methods. The point quad-tree, which was first proposed in Finkel (1974), is an example of a point access method. To handle complex objects, such as circles, polygons and any undefined irregularly shaped objects, minimum bounding rectangles (MBRs) have been used to approximate the representations of these objects. Thus, the name rectangle access method. The K-D-B tree (Robinson, 1981) and R+-tree (Sellis, 1987) are typical examples. However, the applicability of SAMs is limited by two assumptions: (1) for indexing purposes, objects are represented by means of feature values in a multidimensional space, and (2) a metric must be used as measure of distance between objects. Furthermore, SAMs have been designed by assuming that distance calculation has negligible CPU (Central Processing Unit) cost, and especially relative to the cost of disk I/O (Input/ Output). However, this is not always the case in multimedia applications (Ciaccisa & Patella, 1998). Thus, a more general approach to the similarity indexing problem has gained some popularity in recent years, leading to the development of so-called metric trees, which use a distance metric to build up the indexing structure. For metric trees, objects in a multidimensional space are indexed by their relative distances rather than their absolute positions. A vantage point is used to compute the distance between two different points and the search space is divided into two by the median value of this distance. Several metric trees have been developed so far, including the vp-tree (Chiueh, 1994), the GNAT (Brin, 1995), the mvp-tree (Bozkaya & Ozsoyoglu, 1997) and M-tree (Ciaccia, 1997). In this study, our goal is not to develop a new indexing structure for high-dimension image features but to use an existing one effectively. We choose the well-established Mtree access method as the underlying method for indexing our reduced composite image visual features. The M-tree is a balanced, paged metric tree that is implemented based on the GiST (Generalized Search Tree) (Hellerstein, 1995) framework. Since the design of M-tree is inspired by both principles of metric trees and database access methods, it is optimized with respect to both CPU (distance computations) and I/O costs.
HYBRID DIMENSION REDUCER

In this section, we present a novel approach to indexing large image databases that uses both low-level visual features and human visual perception. The scheme utilizes a two-layer hybrid structure that combines the advantages of LDR and NLDR into a single architecture. Before exploring the detailed structure, we give a brief overview of what kind of visual content our system considers.
Composite Image Features

In our work so far, we have considered three different visual features: color, texture and shape. Note that the CMVF is not limited to these three features and it can be further expanded to include spatial features for more effective indexing.
Color Features
It is known that the human eye responds well to color. In this work, the color feature is extracted using the histogram technique (Swain & Ballard, 1991). Given a discrete color space defined by some color axes, the color histogram is obtained by discretizing the image colors and counting the number of times each discrete color occurs in the image. In our experiments, the color space we apply is CIE L*u*v. The reason that we select CIE L*u*v instead of normal RGB or other color space is that it is more perceptually uniform. The three axes of L*u*v space are divided into four sections respectively, so we get a total of 64 (4x4x4) bins for the color histogram. However, for the image collection that we use, there are bins that never receive any count. In our experiments, the color features are represented as 37-dimensional vectors after eliminating the bins that have zero count.
Texture characterizes objects by providing measures of properties such as smoothness, coarseness and regularity. In this work, the texture feature is extracted using a filterbased method. This method uses amplitude spectra of images. It detects the global periodicity in the images by identifying high-energy, narrow peaks in the spectrum. The advantage of filter-based methods is their consistent interpretation of feature data over both natural and artificial images. The Gabor filter (Turner, 1986) is a frequently used filter in texture extraction. It measures a set of selected orientations and spatial frequencies. Six frequencies are required to cover the range of frequencies from 0 to 60 cycles/degree. We choose 1, 2, 4, 8, 16 and 32 cycles/degree to cover the whole range of human visual perception. Therefore, the total number of filters needed for our Gabor filter is 30, and texture features are represented as 30-dimensional vectors.
Texture Features
Shape Features
Shape is an important and powerful attribute for image retrieval. It can represent spatial information that is not presented in color and texture histograms. In our system, the shape information of an image is described based on its edges. A histogram of the edge directions is used to represent global information of shape attribute for each image. We used the Canny edge operator (Canny, 1986) to generate edge histograms for images in the prepropressing stage. To solve the scale invariance problem, the histograms are normalized to the number of edge points in each image. In addition, smoothing procedures presented in Jain and Vailaya (1996) are used to make the histograms invariant to rotation. The histogram of edge directions is represented by 30 bins. Shape features are thus presented as 30-dimensional vectors. When forming composite feature vectors from the three types of features described above, the most common approach is to use the direct sum operation. Let xc, xt and xs be the color, texture and shape feature vectors; the direct sum operation, denoted by the symbol , of these two feature vectors is defined as follows:
x xc xt xs
(2)
Figure 1. A hybrid image feature dimension reduction scheme. The linear PCA appears at the bottom, the nonlinear neural network is at the top, and the representation of lower dimension vector appears in the hidden layer.
The number of dimensions of the composite feature vector x is then the sum of those of the single feature vectors, that is, dim(x) = dim(xc) + dim(xt) + dim(xs).
Architecture of Hybrid Image Feature Dimension Reducer

Figure 1 shows the overall architecture of our hybrid method, which is basically a two-tier hybrid architecture: dimension reduction via PCA followed by a three-layer neural network with quickprop learning algorithm. Visual content for color, texture and shape is first extracted from each image. The dimensionality of raw feature vector in our system is 97-dimensional feature vectors (37 dimensions for color, 30 dimensions for texture and 30 dimensions for shape). PCA is useful as an initial dimension reducer while further dimension reduction for nonlinear correlations can be handled by NLDR.
PCA for Dimension Reduction
Mathematically, PCA method can be described as follows. Given a set of N feature
vectors {xk = ( xk1 , xk 2 ,...xkn ) R n | k = 1...N } and the mean vector x , the covariance matrix S can be calculated as
S=
1 N ( xk x)( xk x) N k =1
Let vi and i be a pair of eigenvector and eigenvalue of the covariance matrix S. Then vi and i satisfy the following:
i = (vi ( xk x)) 2
k =1
Since trace (S) =
vectors, and since i can be arranged in decreasing order, that is, 12...n0, if the m (where m < n) largest eigenvalues account for a large percentage of variance then, with an nm linear transformation matrix T defined as
i =1 i
accounts for the total variance of the original set of feature
T = [v1 , v2 ,..., vm ],
(3)
the nm transformation T T transforms the original n-dimensional feature vectors to mdimensional ones. That is
T ( xk x ) = y k ,
k = 1...N
(4)
where y k Rm, k. Then matrix T above has orthonormal columns because { vi | i = 1...n } form an orthonormal basis. The key idea in dimension reduction via PCA is in the computation of and the userdetermined value m, and finally the mn orthogonal matrix T T , which is the required linear transformation. The feature vectors in the original n-dimensional space can be projected onto an m-dimensional subspace via the transformation T T . The value of m is normally determined by the percentage of variance that the system can afford to lose. The i-th component of the yk vector in (4) is called the i-th principal component (PC) of the original feature vector x k. Alternatively, one may consider just the i-th column of the T matrix defined in (3), and the i-th principal component of xk is simply
yki = vi ( xk x)
where
vi is the i-th eigenvector of S.
PCA has been employed to reduce the dimensions of single feature vectors so that an efficient index can be constructed for image retrieval in an image database (Euripides & Faloutsos, 1997; Lee, 1993). It has also been applied to image coding, for example,, for removing correlation from highly correlated data such as face images (Sirovich & Kirby, 1987). In this work, PCA is used as the first step in the NLDR method where it provides optimal reduced dimensional feature vectors for the three-layer neural network, and thus speed up the NLDR training time.
Classification Based on Human Visual Perception

Gestalt psychologists (Behrens, 1984) have observed that the human visual system deals with images by organizing parts into wholes using perceptual grouping, rather than by perceiving individual image components and then assembling them. A consequence of this is that our mind perceives whole objects even when we are looking at only a part or some component of that object. The principles of perceptual organization proposed by Gestaltists include closure, continuity, proximity and similarity (Lowe, 1985), which have been applied successfully in feature detection and scene understanding in machine vision. With these principles, our perceptual system integrates low-level features into high-level structures. Then, these high-level structures will be further combined until semantic meaningful representation is achieved. Another fundamental and powerful Gestalt principle of visual perceptual organization is identification of objects from the surroundings. In the real world, when we are presented with an image, we tend to see things. Even when there may be little contrast between the objects and the background, our perceptual system does not seem to have any major difficulty in determining which is figure and which is background (Lerner et al., 1986). For example, a ship stands out against the background of sea and sky, a camel and a man stand out against a background of desert sand, or a group of people is easily distinguishable from a forest background. Furthermore, we would distinguish an image of a camel against a background of desert sand as more similar to an image of a camel and a man against the same background than to an image of a camel against a sandy beach. In general, we incorporate all the information in color, texture, shape and other visual or spatial feature under a certain context that is presented to us and classify the image into the appropriate category. In conducting our experiments on image classification based on human perception, we first prepared a set of images (163) that is called test-image from our 10,000-image collection. This set covers all the different categories (total of 14) of images in the collection. Amongst these images in the set, images in each category have similarity with each other in color, in texture and in shape. We set up a simple image classification experiment on the Web and asked seven people (subjects), all from different backgrounds, to do the experiments. At the beginning of each experiment, a query image was arbitrarily chosen from the test-images and presented to the subjects. The subjects were then asked to pick up the top 20 images that were similar in color, in texture and in shape to the query image from the test-images. Any image that was selected by more than three subjects was classified to the same class as the query image and was then deleted from the test-images. The experiment was repeated until every image in test-images had been categorized into an appropriate class. The end result of the experiments is that images that are similar to each other in color, in texture and in shape are put into the same class based on human visual perception. This classification results are used in the NLDR process described below.
Neural Network for Dimension Reduction

The advantage of using a neural network for NLDR is that the neural network can be trained to produce an effective solution. In the CMVF framework, a three-layer perceptron neural network with a quickprop-learning algorithm (Gonzalez & Woods, 2002) is used to perform dimension reduction on composite image features. The network
Figure 2. A three-layer multiplayer perceptron layout
in fact acts as a nonlinear dimensionality reducer. In Wu (1997), a special neural network called learning based on experiences and perspectives (LEP) has been used to create categories of images in the domains of human faces and trademarks; however, no details are given in his work on how the training samples were created. For our system, the training samples are tuples of the form (v, c) where v is a feature vector, which can be either a single-feature vector or a composite feature vector, and c is the class number to which the image represented by v belongs. We note that the class number for each feature vector is determined by the experiments mentioned in the previous subsection. Figure 2 depicts the three-layer neural network that we used. The units in the input layer accept the feature vector v of each training pattern; the number of units in this layer therefore corresponds to the number of dimensions of v. The hidden layer is configured to have less units. The number of units in the output layer corresponds to the total number of image classes M. Given that (v, c) is a training pattern, the input layer will accept vector v while the output layer will contain (0,...,0,1,0,...,0)T, which is a vector of dimension M that has a 1 for the c-th component and 0s everywhere else. Each unit i in the neural network is a simple processing unit that calculates its activation s i based on its predecessor units pi and the overall incoming activation of unit i is given as
neti =
s w
j p i j
ij
(5)
where j is a predecessor unit of i, the term wij is the interconnected weights from unit j to unit i, and i is the bias value of the unit i. Passing the value neti through a nonlinear activation function, the activation value si of unit i can be obtained. The sigmoid logistic function
si =
1 1 + e neti
(6)
is used as the activation function. Supervised learning is appropriate in our neural network system because we have a well-defined set of training patterns. The learning
process governed by the training patterns will adjust the weights in the network so that a desired mapping of input to output activation can be obtained. Given that we have a set of feature vectors and their appropriate class number classified by the subjects, the goal of the supervised learning is to seek the global minimum of cost function E
E=
2 1 (t pj o pj ) 2 p j
(7)
where t pj and opj are, respectively, the target output and the actual output for feature vector p at node j. The rule for updating the weights of the network can be defined as follows:
wij (t ) = d (t )
wij (t + 1) = wij (t ) + wij (t )
(8) (9)
where is the parameter that controls the learning rate, and d(t) is the direction along which the weight need to be adjusted in order to minimize the cost function E. There are many learning algorithms for performing weight updates. The quickprop algorithm is one of most frequently used adaptive learning paradigms. The weight update can be obtained by the equation
E (t ) wij wij (t 1) wij (t ) = E E (t 1) (t ) wij wij
(10)
The training procedure of the network consists of repeated presentations of the inputs (the feature vector vs in the training tuples) and the desired output (the class number c for v) to the network. The weights of the network are initially set to random small continuous values. Our network adopts the learning by epoch approach. This means that the updates of weights only happen after all the training samples have been presented to the network. In the quickprop-learning algorithm, there are two important parameters: the learning rate for the gradient descent and the maximum step size v. These two parameters govern the convergence of network learning. In general, the learning rate for gradient descent can vary from 0.1 to 0.9. In our system, the learning rate is kept as a constant value during network training. The step size v is 1.75. In every iteration of the training, the error generated will be in the direction of the minimum error function. This is due to the fact that the training starts in the direction of the eigenvectors associated with the largest eigenvalue for each feature. Thus, the network has less chance of being trapped in a local minimum. The total gradient error or the total number of error bits indicates the condition of network convergence. When this value does not change during network training, the
network is said to have converged. The total error is the sum of the total output minus the desired output. The total number of error bits can measure it, since the network also functions as a pattern classifier. In this case, the number of error bits is determined by the difference of the actual and the desired output. During the network training process, the network weights gradually converge and the required mapping from image feature vectors to the corresponding classes is implicitly stored in the network. After the network has been successfully trained, the weights that connect the input and hidden layers are entries of a transformation that map the feature vectors v to smaller dimensional vectors. When a high-dimensional feature vector is passed through the network, its activation values in the hidden units form a lower-dimensional vector. This lower-dimension feature vector keeps the most important discriminative information of the original feature vectors.
The Hybrid Training Algorithm

The complete training algorithm for this hybrid dimension reduction is given as follows: Step 1: For each type of feature vector; compute the covariance matrix of all N images. Step 2: Apply the eigen-decomposition to each of the computed covariance matrices from Step 1. This process yields a list of eigenvectors and eigenvalues (), which are normally sorted in decreasing order. Step 3: Compute the total variance s = i i and select the m largest eigenvalues whose sum just exceeds s % where is a predefined cut-off value. This step selects the m largest eigenvalues that account for the % of the total variance of the feature vectors. Step 4: Construct matrix T using the m corresponding eigenvectors as given in Equation 3. Step 5: Obtain the new representation yk for each image feature vector xk by applying the PCA transformation given in Equation 4. Step 6: Select the training samples from the image collection. Group these training samples into different classes as determined by the experiments described in Section 3.2.2. Step 7: Construct the composite feature vectors zk from the color, texture and shape feature vectors using the direct sum operation defined in Equation 2. Step 8: Prepare the training patterns (z k, ck) k where ck is the class number to which the composite feature vector z k belongs. Step 9: Set all the weights and node offsets of the network to small random values. Step 10: Present the training patterns z k as input and ck as output to the network. The training patterns can be different on each trial; alternatively, the training patterns can be presented cyclically until the weights in the network stabilize. Step 11: Use the quickprop-learning algorithm to update the weights of the network. Step 12: Test the convergence of the network. If the condition of convergence of the network is satisfied, then stop the network training process. Otherwise, go back to Step 10 and repeat the process. If the network does not converge, it needs a new starting point. Thus, it is necessary to go back to Step 9 instead of Step 10.
n
Steps 1~5 cover the PCA dimension reduction procedure, which was applied to all images in the data rather than only to the training samples. This has the advantage that the covariance matrix for each type of single feature vector contains the global variance of images in the database. The number of principal components to be used is determined by the cut-off value . There is no formal method to define this cut-off value. In Step 3, the cut-off value is set to 99 so the minimum variance that is retained after PCA dimension reduction is at least 99%. After the completion of PCA, the images are classified into classes in Step 6. Steps 7~12 then prepare the necessary input and output values for the network training process. The network training corresponds to Steps 8~11. As noted above, the weight of each link is initialized to a random small continuous value. In the quickprop-learning algorithm, the parameter that limits the step-size is set to 1.75, and the learning rate for the gradient descent can vary from 0.1 to 0.9. Each time we apply the quickproplearning algorithm, the weight of each link in the network is updated. After a specified number of applications of the quickprop-learning algorithm, the convergence of the network is tested in Step 12. At this point, it is decided whether the network has converged or a new starting weight is required for each link of the network. In the latter case, the process involved in Steps 9~12 is repeated.
EXPERIMENTS AND DISCUSSIONS

In the following section, we present experimental results to demonstrate the effectiveness of feature vectors generated by CMVF by comparing it to systems that generate reduced feature vectors based solely on PCA and based on a pure neural network without initial PCA. To further illustrate the advantage of CMVF, its robustness against various kinds of image distortion and initial setup of neural network is also presented.
The CMVF framework has been designed and fully implemented with the C++ and Java programming languages, and an online demonstration with a CGI-based Web interface is available for users to evaluate the system (Shen, 2003). Figure 3 presents the various components for this system. User can submit one image, which is from existing image database or other source, as a query. The system will search for the images that are most similar in visual content; the matching images are displayed in similarity-order, starting from the most similar, and users can score the results. The query can be executed with any of the following retrieval methods: PCA only, neural network only and CMVF with different visual feature combinations. Users can also choose a distorted version of the selected image as the query example to demonstrate CMVFs robustness against image variability.
The CMVF
Test Image Collection

To conduct the experiment, we constructed a collection of 10,000 images. These images were retrieved from different public domain sources, and can be classified under a number of high-level semantic categories that cover natural scenery, architectural
Figure 3. Overall architecture of a content-based image retrieval system based on CMVF
buildings, plants, animals, rocks, flags, and so forth. All images were scaled to the same size (128128 pixels). A subset of this collection was then selected to form the training samples (testimages). There were three steps involved in forming the training samples. Firstly, we decided on the number of classes according to the themes of the image collection and selected one image for each class from the collection of 10,000 images. This can be done with the help of a domain expert. Next, we built three M-tree image databases for the collection. The first one used color as the index, the second used texture as the index and the third one used shape as the index. For each image in each class, we retrieved the most similar images in color using the color index to form a color collection. We then repeated the same procedure to get images similar in texture and in shape for each image in each class to form a texture collection and a shape collection. Finally, we got our training samples1 that are similar in color, in texture and in shape by taking the intersection of images from the color, texture and shape collections. The training samples (test-images) were presented to the subjects for classification. To test the effectiveness of additional feature integration in image classification and retrieval, we use the same procedure as mentioned in the previous section for generating test-images with additional visual feature.
Evaluation Metrics
In our experiment, since not all relevant images are examined, some common measurements such as standard Recall and Precision are inappropriate. Thus, we select the concepts of normalized precision (Pn) and normalized recall (Rn) (Salton & Mcgill, 1993) as metrics for evaluation. High Precision means that we have few false alarms (i.e., few irrelevant images are returned) while high Recall means we have few false dismissals (i.e., few relevant images are missed). The formulas for these two measures are
Rn = 1
( N R)! R!
i =1
(ranki i )
Pn = 1
(log ranki log i ) N! log( ) ( N R)! R!)

i =1
where N is the number of images in the dataset and is equal to 10,000, R is the number of relevant images and the rank order of the i-th relevant image is denoted by ranki. During the test, the top 60 images are evaluated in terms of similarity.
Query Effectiveness of Reduced Dimensional Image Features

To compare the effectiveness of the three different methods for image feature dimension reduction, a set of experiments has been carried out. In these experiments, we use M-tree as implementation basis for indexing structure. The dimension of M-tree is set to 10, which corresponds to the number of hidden units used in the neural networks. In fact, every image from the collection can serve as a query image. We randomly selected 20 images from each category of the collection as queries. Figure 4 shows the results of queries posed against all the 14 classes of images using the three M-trees, which are used for indexing three feature spaces, generated by CMVF, pure Neural network and PCA As shown in Figure 4, the CMVF achieves a significant improvement in terms of similarity search over the PCA for any categories in the collection. The improvement for recall is from 14.3% to 30% and precision rate is from 23.2% to 37% dependent on image class. The reason for this better performance is that in CMVF, we build indexing vectors Figure 4. Comparing hybrid method with PCA and neural network on average normalized recall and precision rate. The result is obtained under visual combination including color, texture and shape.
0.9 0.7 Ave. Normalized Recall Rate Ave. Normalized Prec. Rate 0.8 0.7 0.6 0.5 0.4 0.3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Class ID recall of PCA recall of CMVF recall of neural network 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Class ID precision of PCA precision of CMVF precision of neural network
(a) Recall rate
(b) Precision rate
Table 1. Comparison of different dimensionality reduction methods in query effectiveness and training cost
Dimension Reduction Method PCA Neural Network CMVF Ave. Recall Rate (%) 63.1 77.2 77.2 Ave. Prec. Rate (%) 44.6 60.7 60.7 Training Cost (epochs) N/A 7035 4100
from high-dimensional raw feature vectors via PCA and a trained neural network classifier, which can compress not only various kinds of visual features but also semantic classification information into a small feature vector. Moreover, we can also see from Figure 4 that the recall and precision values of neural network and hybrid method are almost the same. The major difference between the two approaches is the time required to train the network. Based on Table 1, comparing with pure neural network, CMVF saves nearly 40% training time on learning process. This efficiency is gained by using a relative small number of neural network inputs. One can therefore conclude that it is advantageous to use a hybrid dimension reduction to reduce the dimensions of image features for effective indexing. An example to illustrate the query effectiveness of different dimension reduction methods is shown in Appendix A. We use an image with a cat as query example. Comparing with PCA, CMVF achieves superior retrieval results. In the first nine results, CMVF returns nine out of nine matches. PCA only retrieves two similar images from the top nine images. On the other hand, query effectiveness of reduced feature space by CMVF is very close to the one generated by pure neural network with nine out of nine matches. The major difference is the order of different images in the final result list. We conclude from this experiment that by incorporating human visual perception, CMVF indeed is an effective and efficient dimension reduction technique for indexing large image databases.
Effects on Query Effectiveness Improvement with Additional Visual Feature Integration

One of our conjectures is that it is possible to obtain effective retrieval result from low-dimensional indexing vector, if these vectors are constructed based on a combination of multiple visual features. Thus, when more discriminative information is integrated into the final vector, systematic performance improvement can be achieved. To find out how various visual feature configurations contribute to the improvement of query result, a series of experiments have been carried out, which progressively incorporated new visual features into CMVF and compared the results on a single set of queries. The system was tested based on four different visual feature combination: (color, texture), (color, shape), (shape, texture) and (color, texture, shape). As shown in Figures 5a and 5b, after the addition of shape feature into CMVF and Neural network, there is a significant improvement on the recall and precision rate. On the average, using color, texture and shape give additional 13% and 18% improvement in recall and precision rate over using the other three configurations, which only
Figure 5. Comparison of query effectiveness with different dimension reduction schemes Figure 5a. Comparing precision and recall rate of CMVF with different visual feature combinations
0.9
Ave. Normalized Recall Rate
0.7 0.8 0.7 0.6 0.5 0.4 0.3 0 1 2 3 4 5

Recall with color, texture and shape Recall with color and texture Recall with color and shape Recall with shape and texture
Ave. Normalized Pre. Rate
0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Class ID

Precision with color, texture and shape Precision with color and texture Precision with color and shape Precision with shape and texture
9 10 11 12 13 14 15
Class ID
(a) Recall rate
(b) Precision rate
Figure 5b. Comparing precision and recall rate of neural network with different visual feature combinations
0.9 0.7
0.8 0.7 0.6 0.5 0.4 0.3 0 1 2 3 4 5

R eca ll wi th co lo r, tex ture a nd sh ap e Reca ll wi th co lo r a nd tex t ure R eca ll wi th co lo r a nd s hap e R eca ll wi th sh ap e a nd t exture
0.6 0.5 0.4 0.3 0.2 0.1

Preci sio n with co lo r, tex t ure a nd sh ap e Preci sio n with co lo r a nd tex tu re Preci sio n with co lo r a nd sh ap e Preci sio n with sh ap e a nd tex tu re
10 11 12 13 14 15
10 11 12 13 14 15
Cla s s I D
Cla s s I D
(a) Recall rate
(b) Precision rate
Figure 5c. Comparing precision and recall rate of PCA with linear concatenation of different visual feature combinations
0.6
0.65
0.5 0.45 0.4
0.55
Recall with color, texture and shape Recall with color and texture Recall with color and shape Recall with shape and texture
0.35 0.3
Precision with color, texture and shape Precision with color and texture Precision with color and shape Precision with shape and texture
0.5
0.45 0 1 2 3 4 5
9 10 11 12 13 14 15
0.25 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Class ID
Class ID
(a) Recall rate
(b) Precision rate
considers two features, respectively. However, the advantage for CMVF over pure neural network is that it requires less training cost to achieve results with the same quality. On the other hand, from Figure 5c, we can see that query effectiveness of a feature vector generated by PCA doesnt show any improvement with additional visual feature integration. In contrast, there is a slight drop in terms of precision and recall rate for some cases. For example, in image class 5, if the system only uses color and texture, a 61% normalized recall rate can be achieved. Interestingly, normalized recall rate with a feature combination that includes color, texture and shape is only 60%, which remains close to that achieved using just color and texture. Appendix B shows an example of query effectiveness gain due to the addition of shape feature. Obviously, addition of shape feature resulted in better query result. We used an image with a cat as the query example. With the feature configuration including color, texture and shape, CMVF retrieved 12 images with cat on the first 12 matches. Without considering the shape, there are only seven images with cat returned on the top 12 matches.
Robustness
Robustness is a very important feature for a Content-Based Image Retrieval (CBIR) system. In this section, we investigate CMVF robustness against both image distortion and the initial configuration of neural network.
Image Distortion
Humans are capable of correctly identifying and classifying images, even in the presence of moderate amounts of distortion. This property is potentially useful in reallife image database applications, where the query image may have accompanying noise and distortion. The typical example for this case is the low-quality scanning of a photograph. Since CMVF is being trained to reduce the dimensionality of raw visual feature vectors, this process suggests that if we were to train it using not only the original image, but also distorted version of that image, it might be more robust in recognizing the image with minor noise or distortion. We modified image items with different kinds of alternatives as learning examples for training purpose and carried out a series of experiments to determine how much improvement would occur with this additional training. We randomly chose 10 images from each category in the training data, and applied a specific distortion to each image and included the distorted image in the training data. This process was repeated for each type of distortion, to yield a neural network that should have been trained to recognize images in the presence of any of the trained distortions. In order to evaluate the effect of this on query performance, we ran the same set of test queries to measure precision and recall rate. However, each query image was distorted before using it as query, and the ranks of the result images for this query were compared against the ranks of result images for the nondistorted query image. This was repeated for varying levels of distortion. Figure 6 summarizes the results and Appendix C shows a query example. With incorporation of human visual perception, CMVF is a robust indexing technique. It can perform well on different kinds of image variations including color distortion, sharpness changes, shifting and rotation (Gonzalez & Woods, 2002). The experiment shows that on
the average, CMVF is robust to blurring with an 11x11 Gaussian filter and Median filter, random spread by 10 pixels, pixelization by nine pixels and various kinds of noise including Gaussian and salt&pepper noise.
Neural Network Initialization
Another aspect of robustness to investigate in CMVF is the degree in which it is affected by the initial configuration of the neural network. In CMVF, the weights of the neural network are initially set to a small random continuous value, so the system may end up with different configurations for the same training data. It is thus important to know how much the final query effectiveness will be influenced by the initial choice of weights. In order to investigate this, we focused on how the initial weights would influence the final ranking of query results. We built twenty dimension reducers with a different initial configuration for each of them, and then ran the same set of query images for each resultant neural network, and compared the query result lists. First, we randomly selected a query image and performed a similarity search using system one. From the result list, we chose the top 60 results as reference images. We then ran the same query example on the other 19 systems and compared the ranks of these 60 reference images. Rank deviation, rank _ dev , was used to measure rank difference for the same reference image with different models:
rank _ dev =
s =1
N n =1
| rankns ini _ rankn | N
where N is the total number of reference images in the study list, ini_rankn is the initial rank for the reference image n, rankns is the rank for reference image n in system s, and the number of systems with different initial states is denoted by S. If the CMVF is insensitive to its initialization, reference images should have roughly the same ranking in each of the systems. Table 3 shows that this is not the case. The average rank_dev for all reference images is 16.5. Thus, in fact, overall the initialization of the neural network does influence the result. However, in order to study this effect in more detail, we divided the reference images into six groups (study lists) based on their initial position in system one: group 1 represents the top 10 (most similar) images (with initial rank from 1 to 10), group 2 contains the next most similar images (with initial rank from 11 to 20), and so on, up to group 6, which contains images initially ranked 51-60. If we look at the lower part of the reference image list (such as group 5 and group 6), we can see that rank_dev is quite large. This means the initial status of the neural network has a big impact on the order of results. However, the rank_dev is fairly small for the top part (such as group 1) of the ranked list. This indicates that for top-ranked images (the most similar images), the results are relatively insensitive to differences in the neural network initial configuration.
Analysis and Discussion

The results show that the proposed hybrid dimension reduction method is superior to the other two dimension reduction methods, PCA and pure neural network, that are
Figure 6. Robustness of the CMVF against various image alterations

70 60 Ga uss ian f ilt er Med ian f iter 50 45 Br ig ht en Da rkenr Sh arp en
Rank of Target Image
50 40 30 20 10 0 0 10 20 S iz e o f F i lt e r ( B l u r ) 30 40
40 35 30 25 20 15 10 5 0 0
10 20 P r e c en t ag e o f V a r iat io n
30
(a) Blur
70 60 Pixelize Ra nd o m s p read 90 80
(b) Brighten, Darken and Sharpen
70 60 50 40 30 20 10 0 0 10
Ga us sian no is e
50 40 30 20 10 0 0 10 20 30 40 50 60 P ix e ls o f V a r ia t io n
20
30
40
50
60
St a n d a r d D e v ia t io n
(c) Pixelize and Random Spread

100 90
(d) Gaussian Noise
80 70 Sa lt &p ep p er no is e 60 50 40 30 20 10 0 0 10 20 30 40 P r ece n t ag e o f N o i s e P ix e l
80 70 60 50 40 30 20 10 0 0 10
Mo re s at uratio n Less s at uratio n
20
30
40
50
60
P r ece n t ag e o f var ia t io n
(e) Salt and Pepper Noise
(f) More and Less Saturation
applied alone. In this section we present a discussion of the issues related to the performance of this hybrid method.
Parameters for Network Training

A wide variety of parameter values were tested in order to find an optimal choice for the network-learning algorithm in the experiments just discussed. However, in practice, it is often undesirable or even impossible to perform large parameter test series. Moreover, different practical applications may require different sets of parameters of the network. In our case, the optimal parameter for the quickprop algorithm is the step size of 1.75 and the learning rate 0.9. The number of the hidden units used can also affect the network convergence and learning time greatly. The more the number of hidden units, the easier it is for the network to learn. This is because more hidden units can keep more information. However, since the network is a dimension reducer, the number of hidden units is restricted to a practical limit.
Number of Principal Components used in Network Training

In the hybrid dimension reduction, the inputs to the network are not the original image features but the transformed image features from PCA. The number of the Principal Components (PCs) selected may affect the network performance. It may not be necessary to take too many PCs for network training. On the other hand, the network may not be trained well with too few PCs since some important information of the feature vectors may have been excluded in the network training process. To complement the study of efficiency of our techniques, we report the results of using different PCs for the hybrid
Table 3. Rank deviation comparison between different study lists

Class No rank_dev for all reference image rank_dev for group 1 rank_dev for group 2 rank_dev for group 3 rank_dev for group 4 rank_dev for group 5 rank_dev for group 6 Class No rank_dev for all reference image rank_dev for group 1 rank_dev for group 2 rank_dev for group 3 rank_dev for group 4 rank_dev for group 5 rank_dev for group 6 1 14.5 0.4 1.2 5.7 10.4 26.4 42.7 9 15.9 0.7 2.1 7.5 12.4 31.4 41.5 2 18.6 0.5 1.3 7.1 12.3 38.3 52.1 10 17.4 0.6 2.3 6.8 9.8 35.8 48.8 3 16.3 0.7 1.8 6.6 11.8 28.8 47.6 11 17.1 0.6 1.9 6.9 10.7 33.3 46.1 4 17.2 0.4 1.9 5.9 12.9 32.9 48.9 12 15.9 0.5 1.7 6.7 12.1 34.6 47.4 5 17.8 0.6 1.3 7.5 11.7 36.7 49.5 13 16.1 0.7 1.6 7.1 12.5 32.9 44.1 6 15.4 0.3 1.8 7.8 10.5 33.5 38.8 14 16.9 0.6 2.0 6.9 10.3 31.6 42.8 7 15.9 0.8 1.7 7.6 10.9 34.9 39.6 8 15.7 0.5 2.8 6.7 11.4 32.4 40.7
Average 16.5 0.6 1.8 6.9 11.4 33.1 45.1
dimension reduction for the collection of images in this section. Table 4 shows the learning time for different numbers of PCs. It can be seen that the numbers of PCs for the best network training in our application depends on their total variance. There are not significant differences in the time required for network training from 35 to 50 PCs since they account for more than 99% of the total variance. Moreover, since the eigenvalues are in decreasing order, increasing the number of PCs after the first 40 PCs does not require much extra time to train the network. For example, there are only 40 epochs difference between 45 PCs and 50 PCs. However, if we choose the number of PCs with a total variance that is less than 90% of the total variance then the differences are significant. It takes 7100 epochs for 10 PCs that account for 89.7% of the total variance to reach the ultimate network error of 0.02, which is far greater than the epochs needed for the number of PCs larger than 35.
Scalability and Updates

The number of images that we used in our experiments for testing our dimension reducer is 10,000, which is a reasonable large image database collection. From our experience, the most time-consuming part of the system is not the neural network training process itself, but the collection of training samples for the neural network system. For example, it took us around 40 hours to collect a suitable training samples (163) from the 10,000 images versus 8 minutes to train those samples using a SUN Sparc machine with 64MB RAM. The creation of training samples is a one-time job that is performed off-line. The indexing structure that we used is the well-known M-tree whose scalability has been demonstrated in many spatial information systems. If a new image needs to be added, the image features such as color, texture and shape should be extracted first, then combined together. The combined image features are passed through PCA and neural network for dimension reduction. The reduced feature vector can be easily inserted into the M-tree. However, if a new image class needs to be added, the neural network system has to be retrained and the indexes rebuilt. On the other hand, if an image needs to be deleted then all that is required is just the deletion of the corresponding index from the M-tree. That would be a lot simpler.
Table 4. Learning time for different number of PCs

Number of PCs 7 10 15 20 25 30 35 40 45 50 Total Variance % 81.5 89.7 93.8 95.5 97.5 98.1 99.1 99.4 99.7 99.8 Learning Errors 57.3 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 Learning Time (Epochs) >100,000 7100 4320 3040 1830 1440 1200 870 910 950
SUMMARY
To tackle the dimensionality curse problem for multimedia databases, we have proposed a novel indexing scheme by combining different types of image features to support queries that involve composite multiple features. The novelty of this approach is that various visual features and semantic information can be easily fused into a small feature vector that provides effective (good discrimination) and efficient (low dimensionality) retrieval. The core of this scheme is to combine PCA and a neural network into a hybrid dimension reducer. PCA provides the optimal selection of features to reduce the training time of neural network. Through the learning phase of the network, the context that human visual system used for judging similarity of the visual features in images is acquired. This is implicitly represented as the network weights after the training process. The feature vectors computed at the hidden units (which has a small number of dimensions) of the neural network become our reduced-dimensional composite image features. The distance between any two feature vectors at the hidden layer can be used directly as a measure of similarity between the two corresponding images. We have developed a learning algorithm to train the hybrid dimension reducer. We tested this hybrid dimension reduction method on a collection of 10,000 images. The result is that it achieved the same level of accuracy as the standard neural network approach with a much shorter network training time. We have also presented the output quality of our hybrid method for indexing the test image collection using M-trees. This shows that our proposed hybrid dimension reduction of image features can correctly and efficiently reduce the dimensions of image features and accumulate the knowledge of human visual perception in the weights of the network. This suggests that other existing access methods may be able to be used efficiently. Furthermore, the experimental results also illustrate that by integrating additional visual features, CMVFs retrieval effectiveness can be improved significantly. Finally, we have demonstrated that CMVF can be made robust against a range of image distortions, and is not significantly affected by the initial configuration of the neural network. The issue that remains to be studied is establishing a formal framework to study the effectiveness and efficiency of additional visual feature integration. There is also a need to investigate more advanced machine learning techniques that can incrementally reclassify images as new images are added.
Behrens, R. (1984). Design in the visual arts. Englewood Cliffs, NJ: Prentice Hall. Bozkaya, T., & zsoyoglu, M. (1997). Distance-based indexing for high-dimensional metric spaces. In Proceedings of the 16 th ACM SIGMOD International Conference on Management of Data (SIGMOD97), Tuscon, Arizona, USA (pp. 357-368). Brin, S. (1995). Near neighbor search in large metric spaces. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB95), Zurich, Switzerland (pp. 574-584). Canny, J. (1986). A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell., 8(6), 679-698.
REFERENCES
Chiueh, T. (1994). Content-based image indexing. In Proceedings of the 20th International Conference on Very Large Databases (VLDB94), Santiago de Chile, Chile (pp. 582-593). Ciaccisa, P., & Patella, M. (1998). Bulk loading the m-tree. In Proceedings of the Ninth Australian Database Conference (ADC98), Perth, Australia (pp. 15-26). Ciaccia, P., Patella, M., & Zezula, P. (1997). M-tree: An efficient access method for similarity search in metric spaces. In Proceeding of the 23rd VLDB International Conference on Very Large Databases (VLDB97), Athens, Greece (pp. 426-435). Euripides, G.M.P., & Faloutsos, C. (1997). Similarity searching in medical image databases. IEEE Transaction on Knowledge and Data Engineering, 3(9), 435-447. Fahlam, S.E. (1988). An empirical study of learning speed for back-propagation networks. Technical Report CMU-CS 88-162, Carnegie-Mellon University. Faloutsos, C., Barber, R., Flickner, M., Niblack, W., Peetkovic, D., & Equitz, W. (1994). Efficient and effective querying by image content. Journal of Intelligent Information System, 3(3/4), 231-261. Fukunaga, K., & Koontz, W. (1970) Representation of random process using the karhumen-love expansion. Information and Control, 16(1), 85-101. Hellerstein, J.M., Naughton, J.F., & Pfeffer, A. (1995). Generalized search trees for database systems. In Proceedings of the 21 st International Conference on Very Large Data Bases (VLDB95), Zurich, Switzerland (pp. 562-573). Gonzalez, R., & Woods, R. (2002). Digital image processing. New York: Addison Wesley. Jain, A.K., & Vailaya, A. (1996). Image retrieval using color and shape. Pattern Recognition, 29(8), 1233-1244. Kittler, J., & Young, P. (1973). A new application to feature selection based on the karhumen-love expansion. Pattern Recognition, 5(4), 335-352. Lee, D., Barber, R.W., Niblack, W., Flickner, M., Hafner, J., & Petkovic, D. (1993). Indexing for complex queries on a query-by-content image. In Proceedings of SPIE Storage and Retrieval for Image and Video Database III, San Jose, California (pp. 24-35). Lerner, R.M., Kendall, P.C., Miller, D.T., Hultsch, D.F., & Jensen, R.A. (1986). Psychology. New York: Macmillan. Lowe, D.G. (1985). Perceptual organization and visual recognition. Kluwer Academic. Salton, G., & McGill, M. (1993). Introduction to modern information retrieval. New York: McGraw-Hill. Sellis, T., Roussopoulos, N., & Faloutsos, C. (1987). The R+-tree: A dynamic index for multidimensional objects. In Proceedings of the 12th International Conference on Very Large Databases (VLDB87), Brighton, UK (pp. 507-518). Shen, J., Ngu, A.H.H., Shepherd, J., Huynh, D., & Sheng, Q.Z. (2003). CMVF: A novel dimension reduction scheme for efficient indexing a large image database. In Proceedings of the 22nd ACM SIGMOD International Conference on Management of Data (SIGMOD03), San Diego, California (p. 657). Sirovich, L., & Kirby, M. (1987). A low-dimensional procedure for the identification of human faces. Journal of Optical Society of America, 4(3), 519. Swain, M.J., & Ballard, D.H. (1991). Color indexing. Int. Journal of Computer Version, 7(1),11-32. Turner, M. (1986). Texture discrimination by gabor functions. Biol. Cybern, 55,71-82.
White, D., & Jain, R. (1996). Similarity indexing with the ss-tree. In Proceedings of the 12 th International Conference on Data Engineering, New Orleans (pp. 516-523). Wu, J.K. (1997) Content-based indexing of multimedia databases. IEEE Transaction on Knowledge and Data Engineering, 9(6), 978-989.
ENDNOTE
1
The size of training sample is predefined. In this study, the size is 163.
APPENDIX A
This example demonstrates query effectiveness between different dimension reduction methods including CMVF, pure neural network, PCA with feature combination including color, texture and shape.
Query Result with CMVF: Nine out of nine matches
Query Result with Neural Network: Nine out of nine matches
Query Result with PCA: Two out of nine matches
APPENDIX B
An example that demonstrates query effectiveness improvement due to integration of shape information
Query result of CMVF with color and texture: Seven out of twelve matches
Query result of CMVF with color, texture and shape: Twelve out of twelve matches
Demonstration of the robustness for CMVF against various image alternations. Only the best four results are presented. The first image on every columun is the query example and the database has 10,000 images
APPENDIX C
(a) Blur with 11x11 Gaussian filter
(b) Blur with 11x11 Median filter
(c) Pixelize at nine pixels
(d) Random spread at 10 pixels
(e) 12% Salt & pepper noise
30 Lim & Jin
Chapter 2
Exploiting Pattern Classifiers in Semantic Image Indexing and Retrieval

Joo-Hwee Lim, Institute for Infocomm Research, Singapore Jesse S. Jin, The University of Newcastle, Australia
From Classification to Retrieval:
Users query images by using semantics. Though low-level features can be easily extracted from images, they are inconsistent with human visual perception. Hence, lowlevel features cannot provide sufficient information for retrieval. High-level semantic information is useful and effective in retrieval. However, semantic information is heavily dependent upon semantic image regions and beyond, which are difficult to obtain themselves. Bridging this semantic gap between computed visual features and user query expectation poses a key research challenge in managing multimedia semantics. As a spin-off from pattern recognition and computer vision research more than a decade ago, content-based image retrieval research focuses on a different problem from pattern classification though they are closely related. When the patterns concerned are images, pattern classification could become an image classification problem or an object recognition problem. While the former deals with the entire image
ABSTRACT
From Classification to Retrieval 31
as a pattern, the latter attempts to extract useful local semantics, in the form of objects, in the image to enhance image understanding. In this chapter, we review the role of pattern classifiers in state-of-the-art content-based image retrieval systems and discuss their limitations. We present three new indexing schemes that exploit pattern classifiers for semantic image indexing, and illustrate the usefulness of these schemes on the retrieval of 2,400 unconstrained consumer images.
INTRODUCTION
Users query images by using semantics. For instance, in a recent paper by Enser (2000), he gave a typical request to a stock photo library, using broad and abstract semantics to describe the images one is looking for: Pretty girl doing something active, sporty in a summery setting, beach not wearing lycra, exercise clothes more relaxed in tee-shirt. Feature is about deodorant so girl should look active not sweaty but happy, healthy, carefree nothing too posed or set up nice and natural looking. Using existing image processing and computer vision techniques, low-level features such as color, texture, and shape can be easily extracted from images. However, they have proved to be inconsistent with human visual perception, let alone the incapability to capture broad and abstract semantics as illustrated by the example above. Hence, lowlevel features cannot provide sufficient information for retrieval. High-level semantic information is useful and effective in retrieval. However, semantic information is heavily dependent upon semantic image regions and beyond, which are difficult to obtain themselves. Between low-level features and high-level semantic information, there is a so- called semantic gap. Content-based image retrieval research has yet to bridge this gap between the information that one can extract from the visual data and the interpretation that the same data have for a user in a given situation (Smeulders et al., 2000). In our opinion, the semantic gap is due to two inherent problems. One problem is that the extraction of complete semantics from image data is extremely hard, as it demands general object recognition and scene understanding. This is called the semantics extraction problem. The other problem is the complexity, ambiguity and subjectivity in user interpretation, that is, the semantics interpretation problem. They are illustrated in Figure 1. We think that these two problems are manifestation of two one-to-many relations. In the first one-to-many relation that makes the semantics extraction problem difficult, a real world object, say a face, can be presented in various appearances in an image. This could be due to the illumination condition when the image of the face is being recorded; the parameters associated with the image capturing device (focus, zooming, angle, distance, etc.); the pose of the person; the facial expression; artifacts such as spectacles and hats; variations due to moustache, aging, and so forth. Hence, the same real-world object may not have consistent color, texture and shape as far as computer vision is concerned.
32 Lim & Jin
Figure 1. Semantic gap between visual data and user interpretation
Semantics Extraction Problem Semantics Interpretation Problem
The other one-to-many relation is related to the semantics interpretation problem. Given an image, there are usually many possible interpretations due to several factors. One factor is task-related. Different regions or objects of interest might be focused upon depending on the task or need at hand. For instance, a user looking for beautiful scenic images as wallpaper for his or her desktop computer would emphasize the aesthetic aspect of the images (besides an additional requirement of very high resolution). Furthermore, differences in culture, education background, gender, and so forth, would also inject subjectivity into user interpretation of an image, not to mention that perception and judgement are not time-invariant. For example, a Chinese user may look for reddominant images in designing greeting cards for auspicious events, but these images may not have special appeal to a European user. As a spin-off from pattern recognition and computer vision research more than a decade ago (Smeulders et al., 2000), content-based image retrieval research focuses on a different problem from pattern classification though they are closely related. In pattern classification, according to the Bayes decision theory, we should select class Ci with the maximum a posteriori probability P(Ci|x) for a given pattern x in order to minimize the average probability of classification error (Duda & Hart, 1973, p. 17). When the construction of pattern classifiers relies on statistical learning from observed data, the models for the pattern classifiers could be parametric or non-parametric. When the patterns concerned are images, pattern classification could become an image classification problem (e.g., Vailaya et al., 2001) or an object recognition problem (e.g., Papageorgiou et al., 1998). While the former deals with the entire image as a pattern, the latter attempts to extract useful local semantics, in the form of objects, in the image to enhance image understanding. Needless to say, the success of accurate object recognition would result in better scene understanding and hence more effective image classification.
In content-based image retrieval, the objective of a user is to find images relevant to his or her information need, expressed in some form of query input to an image retrieval system. Given an image retrieval system with a database of N images (assuming N is large and stable for a query session), the hidden information need of a user cast over the N images can be modeled as the posterior probability of the class of relevant images R given an expression of the information need in the form of query specification q and an image x in the current database, P(R|q, x). This formulation follows the formalism of probabilistic text information retrieval (Robertson & Sparck Jones, 1976). Here we assume that the image retrieval system can compute P(R|q, x) for each x in the database. The objective of the system is to rank and return the images in descending order of probability of relevance to the user. Certainly, the image classification and object recognition problems are related to the image retrieval problem, as their solutions would provide better image semantics to an image retrieval system to boost its performance. However, the image retrieval problem is inherently user-centric or query-centric. There is no predefined class and the number of object classes to be recognized to support queries is huge (Smeulders et al., 2000) in unconstrained or broad domains. In this chapter, we review the role of pattern classifiers in state-of-the-art contentbased image retrieval systems and discuss their limitations (Section 2). We propose three new indexing schemes that exploit pattern classifiers for semantic image indexing (Section 3) and illustrate the usefulness of these schemes on the retrieval of 2,400 unconstrained consumer images (Section 4). Last but not least, we provide our perspective on the future trend in managing multimedia semantics involving pattern classification and related research challenges in Section 5, followed by a concluding remark.
RELEVANT RESEARCH
User studies on the behavior of users of image collection is limited. The most comprehensive effort in understanding what a user wants to do with an image collection is Ensers work on image (Enser, 1993; Enser, 1995) (and also video (Amitage & Enser, 1997)) libraries for media professionals. Other user studies have focused on newspaper photo archives (Ornager, 1996; Markkula & Sormunen, 2000), art images (Frost et al., 2000), and medical image archive (Keister, 1994). Typically, knowledgeable users searched and casual users browsed. But all users found that both searching and browsing are useful. As digital cameras and camera phones proliferate, managing personal image collection effectively and efficiently with semantic organization and access of the images is becoming a genuine problem to be tackled in the near future. The most relevant findings on how consumers manage their personal digital photos come from the user studies by K. Rodden (Rodden & Wood, 2003; Rodden, 1999). In particular, Rodden and Wood (2003) found that few people will perform annotation, and comprehensive annotation is not practical, either typed or spoken. Without text annotation, it is not possible to perform text-based retrieval. Hence, the semantic gap problem remains unsolved. Content-based image retrieval research has progressed from the pioneering featurebased approach (Bach et al., 1996; Flickner et al., 1995; Pentland et al., 1995) to the regionbased approach (Carson et al., 1997; Li et al., 2000; Smith & Chang, 1996). In order to bridge
34 Lim & Jin
the semantic gap (Smeulders et al., 2000) that exists between computed perceptual visual features and conceptual user query expectation, detecting semantic objects (e.g., faces, sky, foliage, buildings, etc.) based on trained pattern classifiers has been an active trend (Naphade et al., 2003; Town & Sinclair, 2000). The MiAlbum system uses relevance feedback (Lu et al., 2000) to produce annotation for consumer photos. The text keywords in a query are assigned to positive feedback examples (i.e., retrieved images that are considered relevant by the user who issues the query). This would require constant user intervention (in the form of relevance feedback) and the keywords issued in a query might not necessarily correspond to what is considered relevant in the positive examples. As an indirect annotation, the annotation process is slow and inconsistent between users. There is also the problem of small sampling in retrieval using relevance feedback; the small number of samples would not have statistical significance. Learning with feedback is not stable due to the inconsistency in users feedback. The similarity will also vary when people use it for different applications. Town and Sinclair (2000) use a semantic labeling approach. An image is segmented into regular non-overlapping regions. Each region is classified into visual categories of outdoor scenes by neural networks. Similarity between a query and an image is computed as either the sum over all grids of the Euclidean distance between classification vectors, or their cosine of correlation. The evaluation was carried out on more than 1,000 Corel Photo Library images and about 500 home photos, and better classification and retrieval results were obtained for the professional Corel images. In a leading effort by the IBM (International Business Machines, Inc.) research group to design and detect 34 visual concepts (both objects and sites) in the TREC 2002 benchmark corpus (www.nlpir.nist.gov/projects/trecvid/), support vector machines are trained on segmented regions in key frames using various color and texture features (Naphade et al., 2003; Naphade & Smith, 2003). Recently the vocabulary has been extended to include 64 visual concepts for the TREC 2003 news video corpus (Amir et al., 2003). Several months of effort were devoted to the manual labeling of the training samples using their VideoAnnEx annotation tool (Lin et al., 2003) contributed by the TREC participants. However, highly accurate segmentation of objects is a major bottleneck except for selected narrow domains when few dominant objects are recorded against a clear background (Smeulders et al., 2000, p. 1360). The challenge of object segmentation is acute for polysemic images in broad domains such as unconstrained consumer images. The interpretation of such scenes is usually not unique, as the scenes may have numerous conspicuous objects, some with unknown object classes (Smeulders et al., n.d.). Our Semantic Region Indexing (SRI) scheme addresses the issue of local region classification differently. We have also adopted statistical learning to extract local semantics in image content, though our detection-based approach does not rely on region segmentation. In addition, our innovation lies in reconciliation of multiscale viewbased object detection maps and spatial aggregation of soft semantic histograms as image content signature. Our local semantic interpretation scheme can also be viewed as a systematic extension of the signs designed for domain-specific applications (Smeulders et al., 2000, p. 1359) and the visual keywords built for explicit query specification (Lim, 2001).
Image classification is another approach to bridging the semantic gap that has received more attention lately (Bradshaw, 2000; Lipson et al., 1997; Szummer & Picard, 1998; Vailaya et al., 2001). In particular, the efforts to classify photos based on contents have been devoted to indoor versus outdoor (Bradshaw, Szummer & Picard, n.d.), natural versus man-made (Bradshaw, Vailaya et al., n.d.) and categories of natural scenes (Lipson et al., n.d.; Vailaya et al., n.d.). In general, the classifications were made based on lowlevel features such as color, edge directions, and so forth, and Vailaya et al. presented the most comprehensive coverage of the problem by dealing with a hierarchy of eight categories (plus three others) progressively with separately designed features. The vacation photos used in their experiments are a mixture of Corel photos, personal photos, video key frames, and photos from the Web. A natural and useful insight is to formulate image retrieval as a classification problem. In very general terms, the goal of image retrieval is to return images of a class C that the user has in mind based on a set of features x computed for each image in the database. In probabilistic sense, the system should return images ranked in the descending return status value of P(C|x), whatever C may be defined as desirable. Under this general formulation, several approaches have emerged. A Bayesian formulation to minimize the probability of retrieval error (i.e., the probability of wrong classification) had been proposed by Vasconcelos and Lippman (2000) to drive the selection of color and texture features and to unify similarity measures with the maximum likelihood criteria. Similarly, in an attempt to classify indoor/outdoor and natural/man-made images, a Bayesian approach was used to combine class likelihoods resulted from multiresolution probabilistic class labels (Bradshaw, 2000). The class likelihoods were estimated based on local average color information and complex wavelet transform coefficients. In a different way, Aksoy and Haralick (2002) as well as Wu and others (2000) considered a two-class problem with only the relevance class and the irrelevance class. A two-level classification framework was proposed by Aksoy and Haralick. Image feature vectors were first mapped to two-dimensional class-conditional probabilities based on simple parametric models. Linear classifiers were then trained on these probabilities and their classification outputs were combined to rank images for retrieval. From a different motivation, the image retrieval problem was cast as a transductive learning problem by Wu et al. to include an unlabeled data set for training the image classifier. In particular, a new discriminant-EM algorithm was proposed to generalize the mapping function learned from the labeled training data to a specific unlabeled data set. The algorithm was evaluated on a small database (134 images) of seven classes using 12 labeled images in the form of relevance feedback. This classification approach has been popular in specific domains. For medical images, images have been grouped by pathological classes for diagnostic purpose (Brodley et al., 1999) or by imaging modalities for visualization purpose (Mojsilovic & Gomes, 2002). In the case of facial images (Moghaddam et al., 1998), intrapersonal and extrapersonal classes of variation between two facial images were modeled. Then the similarity between the image intensity of two facial images was expressed as a probabilistic measure in terms of the intrapersonal and extrapersonal class likelihoods and priors using a Bayesian formulation.
36 Lim & Jin
Image classification or class-based retrieval approaches are adequate for query by predefined image class. However, the set of relevant images R may not correspond to any predefined class C in general. In our Class Relative Indexing (CRI) scheme, image classification is not the end but a means to compute interclass semantic image indexes for similarity-based matching and retrieval. While supervised pattern classifiers allow design of image semantics (local object classes or global scene classes), a major drawback of the supervised learning paradigm is the human effort required to provide labeled training samples, especially at the image region level. Lately, there are two promising trends that attempt to achieve semantic indexing of images with minimal or no effort of manual annotation (i.e., semisupervised or unsupervised learning). In the field of computer vision, researchers have developed object recognition systems from unlabeled and unsegmented images (Fergus et al., 2003; Selinger & Nelson, 2001; Weber et al., 2000). In the context of relevance feedback, unlabeled images have also been used to bootstrap the learning from very limited labeled examples (Wang et al., 2003; Wu et al., 2000). For the purpose of image retrieval, unsupervised models based on generic texture-like descriptors without explicit object semantics can also be learned from images without manual extraction of objects or features (Schmid, 2001). As a representative of the state-of-the-art, sophisticated generative and probabilistic model has been proposed to represent, learn, and detect object parts, locations, scales, and appearances from fairly cluttered scenes with promising results (Fergus et al., 2003). Motivated from a machine translation perspective, object recognition is posed as a lexicon learning problem to translate image regions to corresponding words (Duygulu et al., 2002). More generally, the joint distribution of meaningful text descriptions and entire or local image contents are learned from images or categories of images labeled with a few words (Barnard & Forsyth, 2001; Barnard et al., 2003b; Kutics et al., 2003; Li & Wang, 2003). The lexicon learning metaphor offers a new way of looking at object recognition (Duygulu et al., 2002) and a powerful means to annotate entire images with concepts evoked by what is visible in the image and specific words (e.g., fitness, holiday, Paris, etc. (Li & Wang, 2003)). While the results for the annotation problem on entire images look promising (Li & Wang, 2003), the correspondence problem of associating words with segmented image regions remains very challenging (Barnard et al., 2003b) as segmentation, feature selection, and shape representation are critical and nontrivial choices (Barnard et al., 2003a). Our Pattern Discovery Indexing (PDI) scheme addresses the issue of minimal supervision differently. We do not assume availability of text descriptions for image or image classes as by Barnard et al. (2003b) as well as Li and Wang (2003). Neither do we know the object classes to be recognized as by Fergus et al. (2003). We discover and associate local unsegmented regions with semantics and generate their samples to construct models for content-based image retrieval, all with minimal manual intervention. This is realized as a novel three-stage hybrid framework that interleaves supervised and unsupervised classifications.
USING PATTERN CLASSIFIERS FOR SEMANTIC INDEXING

Semantic Region Indexing
One of the goals in content-based image retrieval is semantic interpretation (Smeulders et al., p. 1361). To realize strong semantic interpretation of content, we propose the use of classifications of local image regions and their statistical aggregates as image index. In this chapter, we adopt statistical learning to systematically derive these semantic support regions (SSRs) prior to image indexing. During indexing, the SSRs are detected from multiscale block-based image regions, as inspired by multiresolution viewbased object recognition framework (Papageorgiou et al., 1998; Sung & Poggio, 1998), hence without a region segmentation step. The key in image indexing here is not to record the primitive feature vectors themselves but to project them into a classification space spanned by semantic labels and use the soft classification decisions as the local indexes for further aggregation. Indeed the late K.K. Sung also constructed six face clusters and six nonface clusters and used the distance between the feature vector of a local image block and these clusters as the input to the trained face detector rather than using the feature vector directly (Sung & Poggio, 1998). To compute the SSRs from training instances, we use support vector machines on suitable features for a local image patch and denote this feature vector as z. A support vector classifier Si is a detector for SSR i on z. The classification vector T for region z can be computed via the softmax function (Bishop, 1995) as
Ti ( z ) =
exp Si ( z ) j exp Sj ( z )
(1)
As each support vector machine is regarded as an expert on an SSR class, the outputs of Si i are set to 0 if there exists Sj, j i that has a positive output. As we are dealing with heterogeneous consumer photos, we adopt color and texture features to characterize SSRs. A feature vector z has two parts, namely a color feature vector zc and a texture feature vector z t. For the color feature, we compute the mean and standard deviation of each color channel (i.e., z c has six dimensions). We use the YIQ color space over other color spaces, as it performed better in our experiments. For the texture feature, we adopted the Gabor coefficients (Manjunath & Ma, 1996). Similarly, the means and standard deviations of the Gabor coefficients (five scales and six orientations) in an image block are computed as z t (60 dimensions). Zero-mean normalization (Ortega et al., 1997) was applied to both the color and texture features. In this chapter, we adopted polynomial kernels with a modified dot product similarity measure between feature vectors y and z,
yz =
1 yc zc yt zt + t ( c ) 2 | y || z c | | y || z t |
(2)
38 Lim & Jin
To detect SSRs with translation and scale invariance in an image to be indexed, the image is scanned with windows of different scales, following the strategy in viewbased object detection (Papageorgiou et al., 1998). In our experiments, we progressively increase the window size from 2020 to 6060 at a step of 10 pixels, on a 240360 size-normalized image. That is, after this detection step, we have five maps of detection. To reconcile the detection maps across different resolutions onto a common basis, we adopt the following principle: If the most confident classification of a region at resolution r is less than that of a larger region (at resolution r + 1) that subsumes the region, then the classification output of the region should be replaced by those of the larger region at resolution r + 1. Using this principle, we start the reconciliation from the detection map based on the largest scan window (6060) to the detection map based on the next-to-smallest scan window (3030). After four cycles of reconciliation, the detection map that is based on the smallest scan window (2020) would have consolidated the detection decisions obtained at other resolutions. Suppose a region Z comprised of n small equal regions with feature vectors z 1, z2, , zn respectively. To account for the size of detected SSRs in the spatial area Z, the SSR classification vectors of the reconciled detection map is aggregated as
Ti ( Z ) =
1 Ti ( z k ) n k
(3)
For Query by Example (QBE), the content-based similarity l between a query q and an image x can be computed in terms of the similarity between their corresponding local tessellated blocks. For example, the similarity based on L 1 distance measure (city block distance) between query q with m local blocks Yj and image x with m local blocks Zj is defined as
( q, x ) = 1
1 | Ti (Y j ) Ti (Z j ) | 2m j i
(4)
This is equivalent to histogram intersection (Swain & Ballard, 1991) with further averaging over the number of local histograms m except that the bins have semantic interpretation as SSRs. There is a trade-off between content symmetry and spatial specificity. If we want images of similar semantics with different spatial arrangement (e.g., mirror images) to be treated as similar, we can have larger tessellated blocks (i.e., similar to a global histogram). However, in applications where spatial locations are considered differentiating, local histograms will provide good sensitivity to spatial specificity. Furthermore, we can attach different weights to the blocks (i.e., Yj, Zj) to emphasize the focus of attention (e.g., center). In this chapter, we report experimental results based on even weights as grid tessellation is used. In this chapter, we have attempted various similarity and distance measures (e.g., cosine similarity, L 2 distance, Kullback-Leibler (KL) distance, etc.) and the simple city block distance in Equation 4 has the best performance.
Figure 2. Examples of semantic support regions shown in top-down, left-to-right order: people (face,figure,crowd,skin), sky (clear,cloudy,blue), ground (floor,sand,grass), water (pool,pond,river), foliage (green,floral,branch), mountain (far,rocky), building (old,city,far), interior (wall,wooden,china,fabric,light)
Table 1. Training statistics of the 26 SSR classes
num. pos. trg. num. sup. vec. num. pos. test num. errors error (%)
min. 5 9 3 0 0
max. 26 66 13 14 7.8
avg. 14.4 33.3 6.9 5.7 3.2
Note that we have presented the features, distance measures, and window sizes of SSR detection, etc. in concrete forms to facilitate understanding. The SSR methodology is indeed generic and flexible to adapt to application domains. For the data set and experiments reported in this paper, we have designed 26 classes of SSRs (i.e., Si, i = 1, 2, , 26 in Equation 1), organized into eight superclasses as illustrated in Figure 2. We cropped 554 image regions from 138 images and used 375 of them (from 105 images) as training data for support vector machines to compute the support vectors of the SSRs and the remaining one-third for validation. Among all the kernels evaluated, those with better generalization result on the validation set are used for the indexing and retrieval tasks. A polynomial kernel with degree 2 and constant 1 (C = 100) (Joachims, 1999) produced the best result on precision and recall. Hence, it was adopted in the rest of our experiments. Table 1 lists the training statistics of the 26 SSR classes. The columns show, left to right, the minimum, maximum and average of the number of positive training examples (from a total of 375), the number of support vectors computed from the training examples, the number of positive test examples (from a total of 179), the number of misclassified examples on the 179 test set, and the percentage of error on the test set. The negative training (test) examples for an SSR class are the union of positive training (test) examples of the other 25 classes. The minimum number of positive training and test examples are from the Interior:Wooden SSR while their maximum numbers are from the People:Face class. The minimum and maximum numbers of support vectors are associated with the Sky:Clear and Building:Old SSRs, respectively. The SSR with the best generalization is the Interior:Wooden class, and the worst test error belongs to the Building:Old class.
40 Lim & Jin
When we are dealing with QBE, the set of relevant images R is obscure and a query example q only provides a glimpse into it. In fact, the set of relevant images R does not exist until a query has been specified. However, to anchor the query context, we can define prior image classes Ck, k = 1, 2, , M as prototypical instances of the relevance class R and compute the relative memberships to these classes of query q. Similarly we can compute the interclass index for any database image x. These interclass memberships allow us to compute a form of categorical similarity between q and x (see Equation 7). In this chapter, as our test images are consumer photos, we design a taxonomy for consumer photos as shown in Figure 3. This hierarchy is more comprehensive than that addressed by Vailaya et al. (2001). In particular, we consider subcategories for indoor and city as well as more common subcategories for nature. We select the seven disjoint categories represented by the leaf nodes (except the miscellaneous category) in Figure 3 as semantic support classes (SSCs) to model the categorical context of relevance. That is, we trained seven binary SVMs Ck, k = 1, 2, , 7 on these categories: interior or objects indoor (inob), people indoor (inpp), mountain and rocky area (mtrk), parks or gardens (park), swimming pool (pool), street scene (strt), and waterside (wtsd). Using the softmax function (Bishop, 1995), the output of classification Rk given an image x is computed as,
Class Relative Indexing
Rk ( x) =
exp Ck ( x ) j expCj ( x )
(5)
The feature vector of an image for classification is the SRI image index, that is, T i (Zj) i, j as described above. To be consistent with the SSR training, we adopted the Figure 3. Proposed taxonomy for consumer photos. The seven disjoint categories (the leaf nodes except miscellaneous) are selected as semantic support classes to model categorical context of relevance.
Table 2. Statistics related to SSC learning (left-to-right): SSC class labels, numbers of positive training examples (p-train), numbers of positive test examples (p-test), numbers of support vectors computed (sv), and the classification rate (rate) on the entire 2400 collection.
SSC inob inpp mtrk park pool strt wtsd p-train 27 172 13 61 10 129 30 p-test 107 688 54 243 42 516 120 sv 136 234 116 158 72 259 151 rate 95.7 85.1 98.0 92.4 98.7 84.4 95.3
polynomial kernels and the similarity measure between image indexes u = Ti (Y j) and v = Ti (Zj) as
uv =
1 m j
T (Y )T (Z ) T (Y ) T (Z
i i j i j k k j 2 k k
)2
(6)
The similarity between a query q and an image x is computed as
( q, x) = 1
1 | Rk ( q ) Rk ( x ) | 2 k
(7)
Similar to the SSR training, the support vector machines were trained using a polynomial kernel with degree 2 and constant 1 (C = 100) (Joachims, 1999). For each class, a human subject was asked to define the list of ground truth images from the 2,400 collection, and 20% of the list was used for training. To ensure unbiased training samples, we generated 10 different sets of positive training samples from the ground truth list for each class based on uniform random distribution. The negative training (test) examples for a class are the union of positive training (test) examples of the other six classes and the miscellaneous class. The classifier training for each class was carried out 10 times on these different training sets, and the support vector classifier of the best run was retained. Table 2 lists the statistics related to the SSC learning. The miscellaneous class (not shown in the table) has 171 images that include images of dark scene and bad quality.
Pattern Discovery Scheme

The Pattern Discovery Indexing (PDI) scheme is a semisupervised framework to discover local semantic patterns and generate their samples for training with minimal human intervention. Image classifiers are first trained on local image blocks from a small
42 Lim & Jin
number of labeled images. Then local semantic patterns are discovered from clustering the image blocks with high classification output. Training samples are induced from cluster memberships for support vector learning to form local semantic pattern detectors. An image is then indexed as a tessellation of local semantic histograms and matched using histogram intersection similar to that of the SRI scheme. Given an application domain, some typical classes Ck with their image samples are identified. The training samples are tessellated image blocks z from the class samples. After learning, the class models would have captured the local class semantics and a high SVM output (i.e., Ck(z) 0) would suggest that the local region z is typical to the semantics of class k. With the help of the learned class models Ck, we can generate sets of local image regions that characterize the class semantics (which in turn captures the semantic of the content domain) Xk as
X k = { z | C k ( z ) > } ( 0)
(8)
However, the local semantics hidden in each Xk are opague and possibly multimode. We would like to discover the multiple groupings in each class by unsupervised learning such as Gaussian mixture modeling and fuzzy c-means clustering. The result of the clustering is a collection of partitions mkj, j = 1, 2, , Nk in the space of local semantics for each class, where mkj are usually represented as cluster centers and Nk are the numbers of partitions for each class. Once we have obtained the typical semantic partitions for each class, we can learn the models of Discovered Semantic Regions (DSR) Si, i = 1, 2, , N where N = k Nk (i.e., we linearize the ordering of mkj as mi). We label a local image block (x k Xk) as a positive example for Si if it is closest to mi and as a negative example for Si j i,
X i+ = {x | i = arg min t | x mt |} X i = {x | i arg min t | x mt |}
(9) (10)
where |.| is some distance measure. Now we can perform supervised learning again on X+i and X-i using say support vector machines Si(x) as DSR models. To visualize a DSR Si, we can display the image block s i that is most typical among those assigned to cluster mi that belonged to class k,
C k ( si ) = max C k ( x ) +
xX i
(11)
For consumer images used in our experiments, we make use of the same seven disjoint categories represented by the leaf nodes (except the miscellaneous category) in Figure 3. The same color and texture features as well as the modified dot product similarity measure used in the supervised learning framework (Equation 2) are adopted for the support vector classifier training with polynomial kernels degree 2, constant 1, C = 100
Table 3. Training statistics of the semantic classes Ck for bootstrapping local semantics. The columns (left to right) list the class labels, the size of ground truth, the number of training images, the number of support vectors learned, the number of typical image blocks subject to clustering (Ck(z) > 2), and the number of clusters assigned.
Class inob inpp mtrk park pool strt wtsd G.T. 134 840 67 304 52 645 150 #trg 15 20 10 15 10 20 15 #SV 1905 2249 1090 955 1138 2424 2454 #data 1429 936 1550 728 1357 735 732 #clus 4 5 2 4 2 5 4
(Joachims, 1999). The training samples are 6060 image blocks (tessellated with 20 pixels in both directions) from 105 sample images. Hence, each SVM was trained on 16,800 image blocks. After training, the samples from each class k are fed into classifier Ck to test their typicalities. Those samples with SVM output Ck(z) > 2 (Equation 8) are subject to fuzzy c-means clustering. The number of clusters assigned to each class is roughly proportional to the number of training images in each class. Table 3 lists training statistics for these semantic classes: inob (indoor interior/objects), inpp (indoor people), mtrk (mountain/rocks), park (park/garden), pool (swimming pool), strt (street), and wtsd (waterside). Hence, we have 26 DSRs in total. To build the DSR models, we trained 26 binary SVMs with polynomial kernels (degree 2, constant 1, C = 100 (Joachims, 1999)), each on 7467 positive and negative examples (Equations 9 and 10) (i.e., sum of column 5 of Table 3). To visualize the 26 DSRs that have been learned, we compute the most typical image block for each cluster (Equation 11) and concatenate their appearances in Figure 4. Image indexing was based on the steps as in the case of SRI (Equations 1 to 3) and matching uses the same similarity measure as given in Equation 4.
Figure 4. Most typical image blocks of the DSRs learned (left to right): china utensils and cupboard top (first four) for the inob class; faces with different background and body close-up (next five) for the inpp class; rocky textures (next two) for the mtrk class; green foliage and flowers (next four) for the park class; pool side and water (next two) for the pool class; roof top, building structures, and roadside (next five) for the strt class; and beach, river, pond, far mountain (next four) for the wtsd class.
44 Lim & Jin
EXPERIMENTAL RESULTS
Dataset and Queries
In this paper, we evaluate the SRI, CRI and PDI schemes on 2,400 unconstrained consumer photos. These genuine consumer photos are taken over five years in several countries with both indoor and outdoor settings. The images are those of the smallest resolution (i.e., 256384) from Kodak PhotoCDs, in both portrait and landscape layouts. After removing possibly noisy marginal pixels, the images are of size 240 360. Figure 5 displays typical photos in this collection. As a matter of fact, this genuine consumer photo collection includes photos of bad quality (e.g., faded, over- and underexposed, blurred, etc.) (Figure 6). We retained them in our test to reflect the complexity of the original data. The indexing process automatically detects the layout and applies the corresponding tessellation template. We defined 16 semantic queries and their ground truths (G.T.) among the 2,400 photos (Table 4). In fact, Figure 5 shows, in top-down left-to-right order, two relevant images for queries Q01-Q16 respectively. As we can see from these sample images, the relevant images for any query considered here exhibit highly varied and complex visual appearance. Hence, to represent each query, the we have selected three relevant photos as query examples for our experiments because a single query image is far from satisfactory to capture the semantic of any query. Indeed single query images have resulted in poor precisions and recalls in our initial experiments. The precisions and recalls were computed without the query images themselves in the lists of retrieved images. Figure 5. Sample consumer photos from the [Trial mode] collection. They also represent [Trial mode] relevant images (top-down, left-right) for each of the [Trial mode] queries used in our experiments.
Figure 6. Some consumer photos of bad quality
Table 4. Semantic queries used in QBE experiments

Query Q01 Q02 Q03 Q04 Q05 Q06 Q07 Q08 Q09 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Description indoor outdoor people close-up people indoor interior or object city scene nature scene at a swimming pool street or roadside along waterside in a park or garden at mountain area buildings close-up people close up, indoor small group, indoor large group, indoor G.T. 994 1218 277 840 134 697 521 52 645 150 304 67 239 73 491 45
When a query has multiple examples, q = { q1, q2, , qK }, the similarity (q, x) for any database image is computed as
( q, x) = max i ( qi , x)
(12)
Results and Comparison

In this chapter, we compare our proposed indexing schemes (denoted as SRI, CRI and PDI) with the feature-based approach that combines color and texture in a linearly optimal way (denoted as CTO). For each approach, we conducted experiments with various system parameters and selected their best performances. We looked at both the overall average precisions (denoted as Pavg) and average precisions at top 30 retrieved images (denoted as P30) over 16 queries to select the best performances. The choices of system parameters are described below before result comparison of the best performances. For the color-based signature, both global and local (44 grid) color histograms of b3 (b = 4, 5, , 17), the number of bins in the RGB color space were computed on an image. In the case of global color histograms, the performance saturated at 4096 (b = 16) and 4913 (b = 17) bins with Pavg = 0.36 and P30 = 0.58. Hence, the one that used less number of bins was preferred. Among the local color histograms attempted, the one with 2197 bins (b = 13) gave the best precisions with Pavg = 0.36 and P30 = 0.58. Histogram intersection (Swain & Ballard, 1991) was used to compare two color histograms. For the texture-based signature, we adopted the means and standard deviations of Gabor coefficients and the associated distance measure as reported in Manjunath and Ma (1996). The Gabor coefficients were computed with five scales and six orientations.
46 Lim & Jin
Convolution windows of 20 20, 30 30, , 60 60 were attempted. Similarly, we experimented with both global and local (4 4 grid) signatures. The best results were obtained when 20 20 windows were used. We obtained Pavg = 0.25 and P30 = 0.30 for global signatures and Pavg = 0.24 and P30 = 0.38 for local signatures. These inferior results when compared to those of color histograms lead us to conclude that a simple statistical texture descriptor is less effective than a color histogram for heterogeneous consumer image contents. The distance measures between a query and an image for the color and texture methods were normalized within [0, 1] and combined linearly ( [0, 1]):
(q, x) = c (q, x) + (1 ) t (q, x )
(13)
where lc and lt are similarities based on color and texture features respectively. Among the relative weights attempted at 0.1 intervals, the best fusion was obtained at Pavg = 0.38 and P30 = 0.61 with equal color influence and texture influence for global signatures. In the case of local signatures, the fusion peaked when the local color histograms were given a dominant influence of 0.9, resulting in Pavg = 0.38 and P30 = 0.59. The Precision/Recall curves (averaged over 16 queries) in Figure 7 illustrate the precisions at various recall values for the four methods compared. All three proposed indexing schemes outperformed the feature-based fusion approach.
Figure 7. Precision/Recall curves for CTO, SRI, CRI and PDI schemes
Table 5. Average precisions at top numbers of retrieved images (left to right): numbers of retrieved images, average precisions based on CTO, SRI, CRI and PDI, respectively. The numbers in parentheses are the relative improvement over the CTO method. The last row shows the overall average precisions.
Avg.Prec. At 20 At 30 At 50 At 100 Overall CTO 0.54 0.59 0.52 0.46 0.38 SRI 0.76 (41%) 0.70 (19%) 0.62 (19%) 0.54 (17%) 0.45 (18%) CRI 0.71 (31%) 0.68 (15%) 0.64 (23%) 0.58 (26%) 0.53 (39%) PDI 0.71 (31%) 0.68 (15%) 0.63 (21%) 0.57 (24%) 0.48 (26%)
Table 5 shows the average precisions among the top 20, 30, 50 and 100 retrieved images as well as the overall average precisions for the methods compared. Overall, the proposed SRI, CRI and PDI schemes improve over the CTO method by 18%, 39% and 26%, respectively. The CRI scheme has the best overall average precision of 0.53 while the SRI scheme retrieves the highest number of relevant images at top 20 and 30 images.
DISCUSSION
The complex task of managing multimedia semantics has attracted a lot of research interests due to the inexorable growth of multimedia information. While automatic feature extraction does offer some objective measures to index the content of an image, it is far from satisfactory to capture the subjective and rich semantics required by humans in multimedia information retrieval tasks. Pattern classifiers provide a mid-level means to bridge the gap between low-level features and higher level concepts (e.g., faces, buildings, indoor, outdoor, etc.). We believe that object and event detection in images and videos based on supervised or semisupervised pattern classifiers will continue to be active research areas. In particular, combining multiple modalities (visual, auditory, textual, Web) to achieve synergy among the semantic cues from different information sources has been accepted as a promising direction to create semantic indexes for multimedia contents (e.g., combining visual and textual modalities for images; auditory and textual modalities for music; auditory, visual and textual modalities for videos, etc.) in order to enhance system performance. However, currently there is neither established formalism nor proven large-scale application to guide or demonstrate the exploitation of pattern classifiers and multiple modalities in semantic multimedia indexing, respectively. Hence, we believe principled representation and integration schemes for multimodality and multiclassifier as well as realistic large-scale applications will be well sought after in the next few years. While some researchers push towards a generic methodology for broad applicability, we will also see many innovative uses of multimodal pattern classifiers that incorporate domain-specific knowledge to solve specific narrow domain multimedia indexing problems.
48 Lim & Jin
Similarly, in the area of semantic image indexing and retrieval, we foresee three promising trends, among other research opportunities. First, generic object detection and recognition will continue to be an important research topic, especially in the direction of unlabeled and unsegmented object recognition (e.g., Fergus et al., 2003). We hope that the lessons learned in many forthcoming object recognition systems in narrow domains can be abstracted into some generic and useful guiding principles. Next, complementary information channels will be utilized to better index the images for semantic access. For instance, in the area of consumer images, the time stamps available from digital cameras can help to organize photos into events (Cooper et al., 2003). Associated text information (e.g., stock photos, medical images, etc.) will provide a rich semantic source in addition to image content (Barnard & Forsyth, 2001; Barnard et al., 2003b; Kutics et al., 2003; Li &Wang, 2003). Last, but not least, we believe that pattern discovery (as demonstrated in this chapter) is an interesting and promising direction for image understanding and indexing. These three trends (object recognition, text association and pattern discovery) are not conflicting and their interaction and synergy would produce very powerful semantic image indexing and retrieval systems in the future.
CONCLUDING REMARKS
In this chapter, we have reviewed several key roles of pattern classifiers in contentbased image retrieval systems, ranging from segmented object detection to image scene classification. We pointed out the limitations related to region segmentation for object detection, image classification for similarity matching, and manual labeling effort for supervised learning. Three new semantic image indexing schemes are introduced to address these issues respectively. They are compared to the feature-based fusion approach that requires very high dimension features to attain a reasonable retrieval performance on the 2,400 unconstrained consumer images with 16 semantic queries. Experimental results have confirmed that our three proposed indexing schemes are effective especially when we consider precisions at top retrieved images. We believe that pattern classifiers are very useful tools to bridge the semantic gap in content-based image retrieval. The potential for innovative use of pattern classifiers is promising as demonstrated by our research results presented in this chapter.
We thank T. Joachims for his great SVMlight software and J.L. Lebrun for his 2,400 family photos.
ACKNOWLEDGMENTS
REFERENCES
Aksoy, S., & Haralick, R.M. (2002). A classification framework for content-based image retrieval. In Proceedings of International Conference on Pattern Recognition 2002 (pp. 503-506).
Bach, J.R. et al. (1996). Virage image search engine: an open framework for image management. In Storage and Retrieval for Image and Video Databases IV, Proceedings of SPIE 2670 (pp. 76-87). Barnard, K., & Forsyth, D. (2001). Learning the semantics of words and pictures. In Proceedings of International Conference on Computer Vision 2001 (pp. 408-415). Barnard, K. et al. (2003). The effects of segmentation of feature choices in a translation model of object recognition. In Proceedings of IEEE Computer Vision and Pattern Recognition 2003 (pp. 675-684). Barnard, K. et al. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107-1135. Bishop, C.M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Bradshaw, B. (2000). Semantic based image retrieval: A probabilistic approach. In Proceedings of ACM Multimedia 2000, (pp. 167-176). Brodley, C.E. et al. (1999). Content-based retrieval from medical image databases: A synergy of human interaction, machine learning and computer vision. In Proceedings of AAAI (pp. 760-767). Carson, C. et al. (2002). Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8), 1026-1038. Cooper, M. et al. (2003). Temporal event clustering for digital photo collections. In Proceedings of ACM Multimedia 2003 (pp. 364-373). Duda, R.O., & Hart, P.E. (1973). Pattern classification and scene analysis. New York: John Wiley & Sons. Duygulu, P. et al. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of European Conference on Computer Vision 2002 (vol IV, pp. 97-112). Enser, P. (2000). Visual image retrieval: Seeking the alliance of concept based and content based paradigms. Journal of Information Science, 26(4), 199-210. Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. In Proceedings of IEEE Computer Vision and Pattern Recognition 2003 (pp. 264-271). Flickner, M. et al. (1995). Query by image and video content: The QBIC system. IEEE Computer, 28(9), 23-30. Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods - Support vector learning (pp. 169-184). Boston: MIT-Press. Kapur, J.N., & Kesavan, H.K. (1992). Entropy optimization principles with applications. New York: Academic Press. Kutics, A. et al. (2003). Linking images and keywords for semantics-based image retrieval. In Proceedings of International Conference on Multimedia & Exposition (pp. 777-780). Li, J., & Wang, J.Z. (2003). Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10), 1-14. Li, J., Wang, J.Z., & Wiederhold, G. (2000). Integrated region matching for image retrieval. Proceedings of ACM Multimedia 2000 (pp. 147-156).
50 Lim & Jin
Lim, J.H. (2001). Building visual vocabulary for image indexation and query formulation. Pattern Analysis and Applications, 4(2/3), 125-139. Lipson, P., Grimson, E., & Sinha, P. (1997). Configuration based scene classification and image indexing. In Proceedings of International Conference on Computer Vision (pp. 1007-1013). Manjunath, B.S., & Ma, W.Y. (1996). Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8), 837-842. Moghaddam, B., Wahid, W., & Pentland, A. (1998). Beyond Eigenfaces: Probabilistic matching for face recognition. In Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition (pp. 30-35). Mojsilovic, A., & Gomes, J. (2002). Semantic based categorization, browsing and retrieval in medical image databases. In Proceedings of IEEE International Conference on Image Processing (pp. III 145-148). Naphade, M.R. et al. (2003). A framework for moderate vocabulary semantic visual concept detection. In Proceedings of International Conference on Multimedia & Exposition (pp. 437-440). Ortega, M. et al. (1997). Supporting similarity queries in MARS. In Proceedings of ACM Multimedia (pp. 403-413). Papageorgiou, P.C., Oren, M., & Poggio, T. (1997). A general framework for object detection. In Proceedings of International Conference on Computer Vision (pp. 555-562). Pentland, A., Picard, R.W., & Sclaroff, S. (1995). Photobook: Content-based manipulation of image databases. International Journal of Computer Vision, 18(3), 233-254. Robertson, S.E. (1977). The probability ranking principle in IR. Journal of Documentation, 33, 294-304. Schmid, C. (2001). Constructing models for content-based image retrieval. In Proceedings of IEEE Computer Vision and Pattern Recognition 2001 (pp. 39-45). Selinger, A., & Nelson, R.C. (2001). Minimally supervised acquisition of 3D recognition models from cluttered images. In Proceedings of IEEE Computer Vision and Pattern Recognition 2001 (pp. 213-220). Smeulders, A.W.M. et al. (2000). Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1349-1380. Smith, J.R., & Chang, S.-F. (1996). VisualSEEk: A fully automated content-based image query system. In Proceedings of ACM Multimedia, Boston, November 20 (pp. 8798). Sung, K.K., & Poggio, T. (1998). Example-based learning for view-based human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1), 39-51. Swain, M.J., & Ballard, D.N. (1991). Color indexing. International Journal of Computer Vision, 7(1), 11-32. Szummer, M., & Picard, R.W. (1998). Indoor-outdoor image classification. In Proceedings of IEEE International Workshop on Content-based Access of Image and Video Databases (pp. 42-51). Town, C., & Sinclair, D. (2000). Content-based image retrieval using semantic visual categories. Technical Report 2000.14, AT&T Laboratories Cambridge.
Vailaya, A., et al. (2001). Bayesian framework for hierarchical semantic classification of vacation images. IEEE Transactions on Image Processing, 10(1), 117-130. Vasconcelos, N., & Lippman, A. (2000). A probabilistic architecture for content-based image retrieval. In Proceedings of IEEE Computer Vision and Pattern Recognition (pp. 1216-1221). Wang, L., Chan, K.L., & Zhang, Z. (2003). Bootstrapping SVM active learning by incorporating unlabelled images for image retrieval. In Proceedings of IEEE Computer Vision and Pattern Recognition (pp. 629-634). Weber, M., Welling, M., & Perona, P. (2000). Unsupervised learning of models for recognition. In Proceedings of European Conference on Computer Vision (pp. 1832). Wu, Y., Tian, Q., & Huang, T.S. (2000). Discriminant-EM algorithm with application to image retrieval. In Proceedings of IEEE Computer Vision and Pattern Recognition (pp. 1222-1227).
52 Tian, Wu, Yu & Huang
Chapter 3
Self-Supervised Learning Based on Discriminative Nonlinear Features and Its Applications for Pattern Classification
Qi Tian, University of Texas at San Antonio, USA Ying Wu, Northwestern University, USA Jie Yu, University of Texas at San Antonio, USA Thomas S. Huang, University of Illinois, USA
For learning-based tasks such as image classification and object recognition, the feature dimension is usually very high. The learning is afflicted by the curse of dimensionality as the search space grows exponentially with the dimension. Discriminant expectation maximization (DEM) proposed a framework by applying self-supervised learning in a discriminating subspace. This paper extends the linear DEM to a nonlinear kernel algorithm, Kernel DEM (KDEM), and evaluates KDEM extensively on benchmark image databases and synthetic data. Various comparisons with other state-of-the-art learning techniques are investigated for several tasks of image classification, hand posture recognition and fingertip tracking. Extensive results show the effectiveness of our approach.
ABSTRACT
Self-Supervised Learning 53
INTRODUCTION
Invariant object recognition is a fundamental but challenging computer vision task, since finding effective object representations is generally a difficult problem. Three dimensional (3D) object reconstruction suggests a way to invariantly characterize objects. Alternatively, objects could also be represented by their visual appearance without explicit reconstruction. However, representing objects in the image space is formidable, since the dimensionality of the image space is intractable. Dimension reduction could be achieved by identifying invariant image features. In some cases, domain knowledge could be exploited to extract image features from visual inputs, such as in content-based image retrieval (CBIR). CBIR is a technique which uses visual content to search images from large-scale image databases according to users interests, and has been an active and fast advancing research area since the 1990s (Smeulders, 2000). However, in many cases machines need to learn such features from a set of examples when image features are difficult to define. Successful examples of learning approaches in the areas of content-based image retrieval, face and gesture recognition can be found in the literature (Tieu et al., 2000; Cox et al., 2000; Tong & Wang, 2001; Tian et al., 2000; Bellhumeur, 1996). Generally, characterizing objects from examples requires huge training datasets, because input dimensionality is large and the variations that object classes undergo are significant. Labeled or supervised information of training samples are needed for recognition tasks. The generalization abilities of many current methods largely depend on training datasets. In general, good generalization requires large and representative labeled training datasets. Unfortunately, collecting labeled data can be a tedious, if not impossible, process. Although unsupervised or clustering schemes have been proposed (e.g., Basri et al., 1998; Weber et al., 2000), it is difficult for pure unsupervised approaches to achieve accurate classification without supervision. This problem can be alleviated by semisupervised or self-supervised learning techniques which take hybrid training datasets. In content-based image retrieval (e.g., Smeulders et al., 2000; Tieu et al., 2000; Cox et al., 2000; Tong & Wang, 2001; Tian et al., 2000), there are a limited number of labeled training samples given by user query and relevance feedback (Rui et al., 1998). Pure supervised learning on such a small training dataset will have poor generalization performance. If the learning classifier is overtrained on the small training dataset, over-fitting will probably occur. However, there are a large number of unlabeled images or unlabeled data in general in the given database. Unlabeled data contain information about the joint distribution over features which can be used to help supervised learning. These algorithms assume that only a fraction of the data is labeled with ground truth, but still take advantage of the entire data set to generate good classifiers; they make the assumption that nearby data are likely to be generated by the same class. This learning paradigm could be seen as an integration of pure supervised and unsupervised learning. Discriminant-EM (DEM) (Wu et al., 2000) is a self-supervised learning algorithm for such purposes that use a small set of labeled data with a large set of unlabeled data. The basic idea is to learn discriminating features and the classifier simultaneously by inserting a multiclass linear discriminant step in the standard expectation-maximization (EM) (Duda et al., 2001) iteration loop. DEM makes the assumption that the probabilistic
structure of data distribution in the lower dimensional discriminating space is simplified and could be captured by lower-order Gaussian mixture. Fisher discriminant analysis (FDA) and multiple discriminant analysis (MDA) (Duda et al., 2001) are traditional two-class and multiclass discriminant analysis techniques which treat every class equally when finding the optimal projection subspaces. Contrary to FDA and MDA, Zhou and Huang (2001) proposed a biased discriminant analysis (BDA) which treats all positive, that is, relevant, examples as one class, and negative, that is, irrelevant, examples as different classes for content-based image retrieval. The intuition behind BDA is that all positive examples are alike, each negative example is negative in its own way (Zhou & Huang, n.d.). Compared with the state-ofthe-art methods such as support vector machines (SVM) (Vapnik, 2000), BDA (Zhou & Huang, n.d.) outperforms SVM when the size of negative examples is small (< 20). However, one drawback of BDA is its ignorance of unlabeled data in the learning process. Unlabeled data could improve the classification under the assumption that nearby data is to be generated by the same class (Cozman & Cohen, 2002). In the past years there has been a growing interest in the use of unlabeled data for enhancing classification accuracy in supervised learning such as text classification (e.g., Nigram et al., 2000; Mitchell, 1999), face expression recognition (e.g., Cohen et al., 2003), and image retrieval (e.g., Wu et al., 2000; Wang et al., 2003). DEM differs from BDA in the use of unlabeled data and the way they treat the positive and negative examples in the discrimination step. However, the discrimination step is linear in both DEM and BDA, and they have difficulty handling data sets which are not linearly separable. In CBIR, image distribution is likely, for example, a mixture of Gaussians, which is highly nonlinear-separable. In this paper, we generalize the DEM from linear setting to a nonlinear one. Nonlinear, kernel discriminant analysis transforms the original data space X to a higher dimensional kernel feature space1 F and then projects the transformed data to a lower dimensional discriminating subspace such that nonlinear discriminating features could be identified and training data could be better classified in a nonlinear feature subspace. The rest of this chapter is organized as follows: In the second section, we present nonlinear discriminant analysis using kernel functions (Wu & Huang, 2001; Tian et al., 2004). In the third section, two schemes are presented for sampling training data for efficient learning of nonlinear kernel discriminants. In the fourth section, Kernel DEM is formulated, and in the fifth section we apply the Kernel DEM algorithm to various applications and compare with other state-of-the-art methods. Our experiments include standard benchmark testing, image classification using real image database and synthetic data, view-independent hand posture recognition and invariant fingertip tracking. Finally, conclusions and future work are given in the last section.
NONLINEAR DISCRIMINANT ANALYSIS

Preliminary results of applying DEM for CBIR have been shown in (Wu et al., 2000). In this section, we generalize the DEM from linear setting to a nonlinear one. We first map the data x via a nonlinear mapping into some high, or even infinite dimensional feature space F and then apply linear DEM in the feature space F. To avoid working with the mapped data explicitly (being impossible if F is of an infinite dimension), we will adopt
the well-known kernel trick (Schlkopf & Smola, 2002). The kernel functions k(x,z) compute a dot product in a feature space F: k(x,y) = ( (x)T (z)). Formulating the algorithms in F using only in dot products, we can replace any occurrence of a dot product by the kernel function k, which amounts to performing the same linear algorithm as before, but implicitly in a kernel feature space F. The kernel principle has quickly gained attention in image classification in recent years (e.g., Zhou & Huang, 2001; Wang et al., 2003; Wu et al., 2001; Tian et al., 2004; Schlkopf et al., 2002; Wolf & Shashua, 2003).
Linear Features and Multiple Discriminant Analysis

It is common practice to preprocess data by extracting linear and nonlinear features. In many feature extraction techniques, one has a criterion assessing the quality of a single feature which ought to be optimized. Often, one has prior information available that can be used to formulate quality criteria, or probably even more commonly, the features are extracted for a certain purpose, for example, for subsequently training some classifier. What one would like to obtain is a feature which is as invariant as possible while still covering as much of the information necessary for describing the datas properties of interest. A classical and well-known technique that solves this type of problem, considering only one linear feature, is the maximization of the so called Rayleigh coefficient (Mika et al., 2003; Duda et al., 2001).
J (W ) =
| W T S1W | | W T S 2W |
(1)
Here, W denotes the weight vector of a linear feature extractor (i.e., for an example x, the feature is given by the projections (W Tx) and S1 and S2 are symmetric matrices designed such that they measure the desired information and the undesired noise along the direction W. The ratio in Equation (1) is maximized when one covers as much as possible of the desired information while avoiding the undesired. If we look for discriminating directions for classification, we can choose S B (between-class variance) to measure the separability of class centers that is S 1 in Equation (1), and S W to measure the within-class variance, that is, S 2 in Equation (1). In this case, we recover the well-known Fisher discriminant (Fisher, 1936), where S B and SW are given by
S B = N j (m j m)(m j m) T
j =1
(2)
SW = ( xi( j ) m j )( xi( j ) m j ) T
j =1 i =1
Nj
(3)
we use {xi(j), i = 1,...,Nj}, j = 1,...,C (C = 2 for Fisher discriminant analysis (FDA)) to denote the feature vectors of training samples. C is the number of classes, Nj is the number of
the samples of the jth class, xi(j) is the ith sample from the j th class, m j is mean vector of the j th class, and m is grand mean of all examples. If S1in Equation (1) is the covariance matrix
S1 =
1 C
(x
j =1 1 Nj i =1
Nj
( j) i
m)( x i( j ) m) T
(4)
and S2 identity matrix, we recover standard principal component analysis (PCA) (Diamantaras & Kung, 1996). If S1 is the data covariance and S2 the noise covariance (which can be estimated analogous to Equation (4), but over examples sampled from the assumed noise distribution), we obtain oriented PCA (Diamantaras & Kung, 1996), which aims at finding a direction that describes most variance in the data while avoiding known noise as much as possible. PCA and FDA, that is, linear discriminant analysis (LDA), are both common techniques for feature dimension reduction. LDA constructs the most discriminative features while PCA constructs the most descriptive features in the sense of packing most energy. There has been a tendency to prefer LDA over PCA because, as intuition would suggest, the former deals directly with discrimination between classes, whereas the latter deals without paying particular attention to the underlying class structure. An interesting result reported by Martinez and Kaka (2001) is that this is not always true in their study on face recognition. According to Martinez and Kak, PCA might outperform LDA when the number of samples per class is small or when the training data nonuniformly sample the underlying distribution. When the number of training samples is large and training data is representative for each class, LDA will outperform PCA. Multiple discriminant analysis (MDA) is a natural generalization of Fishers linear discriminative analysis (FDA) for multiple classes (Duda et al., 2001). The goal is to maximize the ratio of Equation (1). The advantage of using this ratio is that it has been proven (Fisher, 1938) that if SW is a nonsingular matrix then this ratio is maximized when
the column vectors of the projection matrix, W, are the eigenvectors of SW1 S B . It should be noted that W maps the original d1-dimensional data space X to a d2-dimensional space (d2 C 1, C is the number of classes). For both FDA and MDA, the columns of the optimal W are the generalized eigenvector(s) w i associated with the largest eigenvalue(s). W opt = [w 1, w 2 , ..., w C1 ] will contain in its columns C-1 eigenvectors corresponding to C-1 eigenvalues, that is, S Bw i = iS Ww i (Duda et al., 2001).
Kernel Discriminant Analysis

To take into account nonlinearity in the data, we propose a kernel-based approach. The original MDA algorithm is applied in a feature space F which is related to the original space by a nonlinear mapping f: x (x). Since in general the number of components in (x) can be very large or even infinite, this mapping is too expensive and can not be carried out explicitly, but through the evaluation of a kernel k, with elements k(xi, ij) = (xi)T
(xj). This is the same idea adopted by the support vector machine (Vapnik, 2000), kernel PCA (Schlkopf et al., 1998), and invariant feature extractions (Mika et al., 1999; Roth & Steinhage, 1999). The trick is to rewrite the MDA formulae using only dot products of the form iT j , so that the reproducing kernel matrix can be substituted into the formulation and the solution, thus eliminating the need for direct nonlinear transformation. Using superscript to denote quantities in the new space and using SB and SW for between-class scatter matrix and within-class scatter matrix, we have the objective function in the following form:
Wopt = arg max

W
| W T S BW | | W T SW W |
(5)
and
S B = N j (m m )(m j m ) T j
j =1 C Nj C
(6)
S W = ( (x i( j ) ) m )( (x i( j ) ) m j ) T j
j =1 i =1
(7)
with m =
1 N
(x k ) , mj =
k =1
1 Nj
(x
k =1
Nj
) where j = 1, ..., C, and N is the total number
of samples. In general, there is no other way to express the solution W opt F, either because F is too high or infinite dimension, or because we do not even know the actual feature space connected to a certain kernel. Schlkopf and Smola (2002) and Mika et al. (2003) showed that any column of the solution Wopt , must lie in the span of all training samples in F, that r is, Wi F. Thus for some expansion coefficients = [ 1 , L , N ]T ,
N r w i = k (x k ) = k =1
i = 1, K , N
(8)
where = [ (x 1 ), L , (x N )] . We can therefore project a data point xk onto one coordinate of the linear subspace of F as follows (we will drop the subscript on wi in the ensuing equation):
r w T (x k ) = T T (x k )
(9)
k (x 1 , x k ) r v = T =T M k k (x N , x k ) k (x 1 , x k ) M k = k (x N , x k )
(10)
(11)
where we have rewritten dot products, (x)T (y) with kernel notation k(x,y). Similarly, we can project each of the class means onto an axis of the subspace of feature space F using only products:
r w T m = T j
1 Nj
(x 1 ) T ( x k ) M k =1 (x N ) T ( x k )
Nj
(12)
1 Nj N j k (x 1 , x k ) r k =1 M =T N 1 j N j k (x N , x k ) k =1 r = Tj
(13)
(14)
It follows that
r r w T S B w = T K B
T where K B = N j ( j )( j ) and j =1 C
(15)
r r w T SW w = T K W
C Nj
(16)
T where K W = ( k j )( k j ) . The goal of kernel multiple discriminant analysis j =1 k =1
(KMDA) is to find
A opt = arg max

A
| AT KBA | | A T KW A |
(17)
r r where A = [ 1 , L , C 1 ] , C is the total number of classes, N is the size of training samples, and KB and KW are NN matrices which require only kernel computations on the training samples (Schlkopf & Smola, 2002). r Now we can solve for s, the projection of a new pattern z onto w is given by Equations (9) and (10). Similarly, algorithms using different matrices for S1, and S2 in Equation (1), are easily obtained along the same lines.
Biased Discriminant Analysis

Biased discriminant analysis (BDA) (Zhou & Huang, 2001) differs from traditional MDA defined in Equations (1)-(3) and (5)-(7) in a modification on the computation of between-class scatter matrix SB and within-class scatter matrix SW. They are replaced by SNP and SP, respectively.
S N P = (y i m x )(y i m x ) T
i =1
Ny
(18)
S P = (x i m x )(x i m x ) T
i =1
Nx
(19)
where {xi, i = 1, ..., Nx} denotes the positive examples and {yi, i = 1, ..., Ny} denotes the negative examples, and mx is the mean vector of the sets {xi}, respectively. SNP is the scatter matrix between the negative examples and the centroid of the positive examples, and SP is the scatter matrix within the positive examples. NP indicates the asymmetric property of this approach, that is, the users biased opinion towards the positive class, thus the name of biased discriminant analysis (BDA) (Zhou & Huang, 2001).
Regularization and Discounting Factors

It is well known that sample-based plug-in estimates of the scatter matrices based on Equations (2, 3, 6, 7, 18, 19) will be severely biased for a small number of training samples, that is, the large eigenvalue becomes larger, while the small ones become smaller. If the number of the feature dimensions is large compared to the number of training examples, the problem becomes illposed. Especially in the case of kernel algorithms, we effectively work in the space spanned by all N mapped training examples (x) which are, in practice, often linearly dependent. For instance, for KMDA, a solution with zero within class scatter (i.e., ATKWA = 0) is very likely due to overfitting. A compensation or regulation can be done by adding small quantities to the diagonal of the scatter matrices (Friedman, 1989).
TRAINING ON A SUBSET
We still have one problem: Although we could avoid working explicitly in the extremely high or infinite dimensional space F, we are now facing a problem in N variables, a number which in many practical applications would not allow us to store or manipulate NN matrices on a computer anymore. Furthermore, solving, for example, an eigenproblem or a QP of this size is very time consuming (O(N3)). To maximize Equation (17), we need to solve an NN eigen- or mathematical programming problem, which might be intractable for a large N. Approximate solutions could be obtained by sampling representative subsets of the training data {x k | k = 1, L , M , M << N } , and using
k = [k (x 1 , x k ), L , k (x M , x k )]T to take the place of k . Two data-sampling schemes are proposed.
PCA-Based Kernel Vector Selection
The first scheme is blind to the class labeling. We select representatives, or kernel vectors, by identifying those training samples which are likely to play a key role in
= [1 , L , N ] . is an NN matrix, but rank () << N , when the size of the training dataset is very large. This fact suggests that some training samples could be ignored in calculating kernel features . We first compute the principal components of . Denote the NN matrix of concatenated eigenvectors with P. Thresholding elements of abs(P) by some fraction of the largest element of it allows us to identify salient PCA coefficients. For each column corresponding to a nonzero eigenvalue, choose the training samples which correspond to a salient PCA coefficient, that is, choose the training samples corresponding to rows that survive the thresholding. Do so for every nonzero eigenvalue and we arrive at a decimated training set, which represents data at the periphery of each data cluster. Figure 1 shows an example of KMDA with 2D (two dimensional) two-class nonlinear-separable samples.
Evolutionary Kernel Vector Selection

The second scheme is to take advantage of class labels in the data. We maintain a set of kernel vectors at every iteration which are meant to be the key pieces of data for training. M initial kernel vectors, KV (0), are chosen at random. At iteration k, we have a set of kernel vectors, KV (k), which are used to perform KMDA such that the nonlinear
k projection y i( k ) = w ( k ) (x i ) = A (opt) i( k ) of the original data xi can be obtained. We T T
assume Gaussian distribution (k) for each class in the nonlinear discrimination space , and the parameters (k) can be estimated by {y(k)}, such that the labeling and training error e(k) can be obtained by l i ( k ) = arg max p(l j | y i , ( k )} .
j
If e(k) < e(k1), we randomly select M training samples from the correctly classified training samples as kernel vector KV(t +1) in iteration k +1. Another possibility is that if any current kernel vector is correctly classified, we randomly select a sample in its
Figure 1. KMDA with a 2D two-class nonlinear-separable example: (a) original data; (b) the kernel features of the data; (c) the normalized coefficients of PCA on , in which only a small number of them are large (in black); (d) the nonlinear mapping
(a)
(b)
(c)
(d)
topological neighborhood to replace this kernel vector in the next iteration. Otherwise, that is, e(k) > e(k1), and we terminate. The evolutionary kernel vector selection algorithm is summarized below: Evolutionary Kernel Vector Selection: Given a set of training data D = (X, L) = {xi, li), i = 1, ..., N} to identify a set of M kernel vectors KV = {vi, i = 1, ..., M}. k = 0; e = ; KV (0) = random_pick(X); // Init do {
k A (opt) = KMDA(X, KV ( k ) ) ; // Perform KMDA k Y ( k ) = Pr oj (X, A (opt) );
// Project X to // Bayesian Classifier
( k ) = Bayes(Y ( k ) , L);
L ( k ) = Labeling (Y ( k ) , ( k ) ); // Classification
e ( k ) = Error ( L ( k ) , L) ;
if ( e
(k )
// Calculate error
< e) e = e (k ) ; KV = KV (k ) ; k + +; KV ( k ) = random _ pick ({x i : l i ( k ) l i });
else
KV = KV ( k 1) ; break;
end } return KV;
In this paper, pattern classification is formulated as a transductive problem, which is to generalize the mapping function learned from the labeled training data set L to a specific unlabeled data set U. We make an assumption here that L and U are from the same distribution. This assumption is reasonable because, for example, in content-based image retrieval, the query images are drawn from the same image database. In short, pattern classification is to classify the images or objects in the database by
KERNEL DEM ALGORITHM
y i = arg max p( y j | x i , L, U : x i U }
j =1,L,C
(20)
where C is the number of classes and yi is the class label for xi. The expectation-maximization (EM) (Duda et al., 2001) approach can be applied to this transductive learning problem, since the labels of unlabeled data can be treated as missing values. We assume that the hybrid data set is drawn from a mixed density distribution of C components {cj, j = 1, ..., C}, which are parameterized by = { j, j = 1, ..., C}. The mixture model can be represented as
p ( x | ) = p ( x | c j ; j ) p ( c j | j )
j =1
(21)
where x is sample drawn from the hybrid data set D = L U. We make another assumption that each component in the mixture model corresponds to one class, that is, {yj = c j , j = 1, ..., C}. Since the training data set D is the union of labeled data set L and unlabeled data set U, the joint probability density of the hybrid data set can be written as:
p ( D | ) =
x i U j =1
p(c
| ) p ( x i | c j ; ) p ( y i = c i | ) p ( x i | y i = c i ; )
x i L
(22)
Equation (22) holds when we assume that each sample is independent to others. The first part of Equation (22) is for the unlabeled data set, and the second part is for the labeled data set. The parameters can be estimated by maximizing a posteriori probability p( | D) . Equivalently, this can be done by maximizing
log( p( | D)).
Let
l ( | D) = log( p() p( D | )) , and we have l ( | D ) = log( p()) +
x i U
log( p(c
j =1 i
| ) p (x i | c j ; )) +
x i L
log( p( y
= ci | ) p(x i | y i = ci ; ))
(23)
Since the log of a sum is hard to deal with, a binary indicator zi is introduced, zi = (zi1, ..., ziC), denoted with observation Oj : z ij = 1 if and only if yi = cj, and zij = 0 otherwise, so that
l ( | D, Z ) = log( p()) +
x i D j =1
ij
log( p(O j | ) p(x i | O j ; ))
(24)
The EM algorithm can be used to estimate the probability parameters by an iterative hill climbing procedure, which alternatively calculates E(Z), the expected values of all unlabeled data, and estimates the parameters given E(Z). The EM algorithm generally reaches a local maximum of l ( | D) . As an extension to the EM algorithm, Wu et al. (2000) proposed a three-step algorithm, called Discriminant-EM (DEM), which loops between an expectation step, a discriminant step (via MDA), and a maximization step. DEM estimates the parameters of a generative model in a discriminating space. As discussed in Section 2.2, Kernel DEM (KDEM) is a generalization of DEM in which instead of a simple linear transformation to project the data into discriminant subspaces, the data is first projected nonlinearly into a high dimensional feature space F where the data is better linearly separated. The nonlinear mapping () is implicitly determined by the kernel function, which must be determined in advance. The transformation from the original data space X to the discriminating space , which is a linear subspace of the feature space F, is given by w T() implicitly or A T explicitly. A lowdimensional generative model is used to capture the transformed data in .
p ( y | ) = p(w T (x ) | c j ; j ) p(c j | j )
j =1
(25)
Empirical observations suggest that the transformed data y approximates a Gaussian in , and so in our current implementation, we use low-order Gaussian mixtures to model
the transformed data in . Kernel DEM can be initialized by selecting all labeled data as kernel vectors and training a weak classifier based on only labeled samples. Then, the three steps of Kernel DEM are iterated until some appropriate convergence criterion:
E-step: set Z ( k +1) = E[ Z | D; ( k ) ]

D-step: set A k +1 = arg max opt
A
| AT KBA | , and project a data point x to a linear | A T KW A |
subspace of feature space F.
M-Step: set ( k +1) = arg max p ( | D; Z ( k +1) )
The E-step gives probabilistic labels to unlabeled data, which are then used by the D-step to separate the data. As mentioned above, this assumes that the class distribution is moderately smooth.
EXPERIMENTS AND ANALYSIS

In this section, we compare KMDA and KDEM with other supervised learning techniques on various benchmark datasets and synthetic data for image classification, hand posture recognition, and invariant fingertip tracking tasks. The various datasets include the benchmark datasets2, the MIT facial image database3 (CBC Face Database), Corel database4, our raw dataset of 14,000 unlabeled hand images together with 560 labeled images, and 1,000 images including both fingertips and nonfingertips.
Benchmark Test for KMDA

In the first experiment, we verify the ability of KMDA with our data sampling algorithms. Several benchmark datasets2 are used in the experiments. For comparison, KMDA is compared with a single RBF classifier (RBF), a support vector machine (SVM), AdaBoost, and the kernel Fisher discriminant (KFD) on the benchmark dataset (Mika et al., 1999) and linear MDA. Kernel functions that have been proven useful are for example, Gaussian RBF, k (x , z ) = exp(- x - z
2
/ c) , or polynomial kernels, k(x,z) = (xz)d , for some
positive constants cR and dN, respectively (Schlkopf & Smola, 2002). RBF kernels are used in all kernel-based algorithms. In Table 1, KMDA-random is KMDA with kernel vectors randomly selected from training samples, KMDA-pca is KMDA with kernel vectors selected from training samples based on PCA, KMDA-evolutionary is KMDA with kernel vectors selected from training samples based on an evolutionary scheme. The benchmark test shows the Kernel MDA achieves comparable performance as other state-of-the-art techniques over different training datasets, in spite of the use of a decimated training set. Comparing three schemes of selecting kernel vectors, it is clear that both PCA-based and evolutionarybased schemes work slightly better than the random selection scheme by having smaller error rate and/or smaller standard deviation. Finally, Table 1 clearly shows superior performance of KMDA over linear MDA.
Table 1. Benchmark test: Average test error and standard deviation in percentage
Error Rate in Percentage (%) RBF AdaBoost SVM Classification KFD Method MDA KMDArandom KMDApca KMDAevolutionary (# Kernel Vectors) Banana 10.80.06 12.30.07 11.50.07 10.80.05 38.432.5 11.030.26 10.70.25 10.80.56 120 Benchmark Breast-Cancer 27.60.47 30.40.47 26.00.47 25.80.48 28.571.37 27.41.53 27.50.47 26.30.48 40 Heart 17.60.33 20.30.34 16.00.33 16.10.34 20.11.43 16.50.85 16.50.32 16.10.33 20
Kernel Setting
There are two parameters that need to be determined for kernel algorithms using RBF (Radial Basis Function) kernel. The first is the degree c and the second is the number of kernel vectors used. The kernel-based approaches are facing the problem of sensitivity to its parameter selected, for example, the Gaussian (or Radial Basis Function) kernel,
k (x , z ) = exp(- x - z
/ c) . This very classical kernel is highly sensitive to the scale
parameter c. The varying performance of kernel-based approaches not only happen to the different kernels but also within the same kernel when applied on different, for example, image databases. Till now there is no general guideline on how to set the parameters beforehand except setting them empirically. In the second experiment, we will determine degree c and the number of kernel vectors empirically using Gaussian RBF kernel as an example. The same benchmark dataset as in the previous section is used. Figure 2 shows the average error rate in percentage of KDEM with RBF kernel under different degrees c and the varying number of kernel vectors used on heart data. By empirical observation, we find that 10 for c and 20 for number of kernel vectors gives nearly the best performance at a relatively low computation cost. Similar results are obtained for tests on breast-cancer data and banana data. Therefore this kernel setting will be used in the rest of our experiments.
KDEM versus KBDA for Image Classification

As mentioned above, biased discriminant analysis (BDA) (Zhou & Huang, 2001) has achieved satisfactory results in content-based image retrieval when the number of training samples is small (<20). BDA differs from traditional MDA in that it tends to cluster all the positive samples and scatter all the negative samples from the centroid of the
Figure 2. Average error rate for KDEM with RBF kernel under varying degree c and number of kernel vectors on heart data
Error Rate in Percentage 50 45 40 35 30 25 20 15 1 5 10 20 40 60 80 100 c1 c5 c 10 c 20 c 40 c 60 c 80 c 100
Number of Kernel Vectors
Figure 3. Comparison of KDEM and KBDA for face and non-face classification
20 18 16 14 12 10 8 6 4 2 0 5 10 20 40 60 80 100 200 Number of Negtive Examples
Error Rate
KBDA KDEM
positive examples. This works very well with a relatively small training set. However, BDA is biased towards the centroid of the positive examples. It will be effective only if these positive examples are the most-informative images (Cox et al., 2000; Tong & Wang, 2001), for example, images close to the classification boundary. If the positive examples are most-positive images (Cox et al., 2000; Tong & Wang, 2001), for example, the images far away from the classification boundary. The optimal transformation found based on the most-positive images will not help the classification for images on the boundary. Moreover, BDA ignores the unlabeled data and takes only the labeled data in learning. In the third experiment, Kernel DEM (KDEM) is compared with the Kernel BDA (KBDA) on both an image database and synthetic data. Figure 3 shows the average classification error rate in percentage for KDEM and KBDA with the same RBF kernel for face and nonnonface classification. The face images are from an MIT facial image database3 (CBC Face Database) and nonface images are from a Corel database4. There are 2,429 faces images from the MIT databases and 1,385 nonface images (14 categories with about 99 each category), a subset of the Corel database, in the experiment. Some
Figure 4. Examples of (a) face images from MIT facial database and (b) non-face images from Corel database
(a)
(b)
examples of face and nonface images are shown in Figure 4. For training sets, the face images are randomly selected from the MIT database with fixed size 100, and nonface images are randomly selected from the Corel database with varying sizes from five to 200. The testing set consists of 200 random images (100 faces and 100 nonfaces) from two databases. The images are resized to 1616 and converted to a column-wise concatenated feature vector. In Figure 3, when the size of negative examples is small (< 20), KBDA outperforms KDEM, and KDEM performs better when more negative examples are provided. This agrees with our expectation. In this experiment, the size of negative examples is increased from five to 200. There is a possibility that most of the negative examples are from the same class. To further test the capability of KDEM and KBDA in classifying negative examples with a varying number of classes, we perform experiments on synthetic data for which we have more controls over data distribution. A series of synthetic data is generated based on Gaussian or Gaussian mixture models with feature dimension of 2, 5 and 10 and a varying number of negative classes from 1 to 9. In the feature space, the centroid of positive samples is set at origin and the centroids of negative classes are set randomly with distance of 1 to the origin. The variance of each class is a random number between 0.1 and 0.3. The features are independent of each other. We include 2D synthetic data for visualization purpose. Both the training and testing sets have fixed size of 200 samples with 100 positive samples and 100 negative samples with varying number of classes. Figure 5 shows the comparison of KDEM, KBDA and DEM algorithms on 2D, 5D and 10D synthetic data. In all cases, with the increasing size of negative classes from 1
Figure 5. Comparison of KDEM, KBDA and DEM algorithms on (a) 2-D (b) 5-D (c) 10D synthetic data with varying number of negative classes
45 40 Error Rate 35 30 25 20 15 10 1 2 3 4 5 6 7 8 9 Num ber of Classes of Negtive Exam ples DEM KBDA KDEM
(a) Error rate on 2-D synthetic data

45 40 Error Rate 35 30 25 20 15 10 1 2 3 4 5 6 7 8 9 Number of Classes of Negtive Examples DEM KBDA KDEM
(b) Error rate on 5-D synthetic data

45 40 Error Rate 35 30 25 20 15 10 1 2 3 4 5 6 7 8 9 Num ber of Classes of Negtive Exam ples DEM KBDA KDEM
(c) Error rate on 10-D synthetic data
to 9, KDEM always performs better than KBDA and DEM thus shows its superior capability of multiclass classification. Linear DEM has comparable performance with KBDA on 2D synthetic data and outperforms KBDA on 10D synthetic data. One possible reason is that learning is on hybrid data in both DEM and KDEM, while only labeled data
is used in KBDA. This indicates that proper incorporation of unlabeled data in semisupervised learning does improve classification to some extent. Moreover, Zhou and Huang (2001) used two parameters m, g[0,1] to control the regularization. The regularized version of SP and SNP with n being the dimension of the original space and I identify matrix are
S P = (1 ) S P +
tr[ S P ]I n tr[ S N P ]I n
(26) (27)
S N P = (1 ) S N P +
The parameter m controls shrinkage toward a multiple of the identity matrix. And tr[] denotes the trace operation for a matrix. g is the discounting factor. With different combinations of the (,) values, the regularized and/or discounted BDA provides a rich set of alternatives: ( = 0, = 1) gives a subspace that is mainly defined by minimizing the scatters among the positive examples, resembling the effect of a whitening transform. Whitening transform is the special case when only positive examples are considered; ( = 1, = 0) gives a subspace that mainly separates the negative from the positive centroid, with minimal effort on clustering the positive examples; ( = 0, = 0) is the full BDA and ( = 1, = 1) represents the extreme of discounting all configurations of the training examples and keeping the original feature space unchanged. However, the set of (, ) was proposed without further testing. Zhou and Huang (2001) only analyzed full BDA ( = 0, = 0). To take a step further, we also investigate the various combinations of ( , ) values on the performance of BDA. We test on cropped face images consisting of 94 facial images (48 male and 48 female). We feed BDA a small number of training samples with different values of (,). We find that full BDA ( = 0, = 0) could be further improved by 41.4% in terms of average error rate with a different value ( = 0.1, = 0.4). This is a promising result and we will further investigate the regularization issue for all discriminant-based approaches in future work.
Hand Posture Recognition

In the fourth experiment, we examine KDEM on a hand gesture recognition task. The task is to classify among 14 different hand postures, each of which represents a gesture command model, such as navigating, pointing, grasping, etc. Our raw dataset consists of 14,000 unlabeled hand images together with 560 labeled images (approximately 40 labeled images per hand posture), most from video of subjects making each of the hand postures. These 560 labeled images are used to test the classifiers by calculating the classification errors. Hands are localized in video sequences by adaptive color segmentation, and hand regions are cropped and converted to gray-level images. Gabor wavelet (Jain & Farroknia, 1991) filters with three levels and four orientations are used to extract 12 texture features. Ten coefficients from the Fourier descriptor of the occluding contour are used to represent hand shape. We also use area, contour length, total edge length, density, and second moments of edge distribution, for a total of 28 low-level image features (I-feature). For comparison, we represent images by coefficients of 22 largest principal components
Table 2. View-independent hand posture recognition: Comparison among multi-layer perceptron (MLP), Nearest Neighbor (NN), Nearest Neighbor with growing templates (NN-G), EM, linear DEM (LDEM) and KDEM. The average error rate in percentage on 560 labeled and 14,000 unlabeled hand images with 14 different hand postures.
Algorithm I-Feature E-Feature MLP 33.3 39.6 NN 30.2 35.7 NN-G 15.8 20.3 EM 21.4 20.8 LDEM 9.2 7.6 KDEM 5.3 4.9
of the dataset resized to 2020 pixels (these are eigenimages, or E-features). In our experiments, we use 140, that is, 10 for each hand posture, and 10,000 (randomly selected from the whole database) labeled and unlabeled images respectively, for training both EM and DEM. Table 2 shows the comparison. Six classification algorithms are compared in this experiment. The multilayer perceptron (Haykin, 1999) used in this experiment has one hidden layer of 25 nodes. We experiment with two schemes of the nearest neighbor classifier. One uses just 140 labeled samples, and the other uses 140 labeled samples to bootstrap the classifier by a growing scheme, in which newly labeled samples will be added to the classifier according to their labels. The labeled and unlabeled data for both EM and DEM are 140 and 10,000, respectively. We observe that multilayer perceptrons are often trapped in local minima, and nearest neighbors suffers from the sparsity of the labeled templates. The poor performance of pure EM is due to the fact that the generative model does not capture the
Figure 6. Data distribution in the projected subspace (a) Linear MDA (b) Kernel MDA. Different postures are more separated and clustered in the nonlinear subspace by KMDA.
(a)
(b)
Figure 7. (a) Some correctly classified images by both LDEM and KDEM (b) images that are mislabeled by LDEM, but correctly labeled by KDEM (c) images that neither LDEM or KDEM can correctly label.
(a)
(b)
(c)
ground-truth distribution well, since the underlying data distribution is highly complex. It is not surprising that Linear DEM (LDEM) and KDEM outperform other methods, since the D-step optimizes separability of the class. Comparing KDEM with LDEM, we find KDEM often appears to project classes to approximately Gaussian clusters in the transformed spaces, which facilitate their modeling with Gaussians. Figure 6 shows typical transformed data sets for linear and nonlinear discriminant analysis, in projected 2D subspaces of three different hand postures. Different postures are more separated and clustered in the nonlinear subspace by KMDA. Figure 7 shows some examples of correctly classified and mislabeled hand postures for KDEM and Linear DEM.
Fingertip Tracking
In some vision-based gesture interface systems, fingers could be used as accurate pointing input devices. Also, fingertip detection and tracking play an important role in recovering hand articulations. A difficulty of the task is that fingertip motion often undergoes arbitrary rotations, which makes it hard to invariantly characterize fingertips. In the last experiment, the proposed Kernel DEM algorithm is employed to discriminate fingertips and nonfingertips. We collected 1000 training samples including both fingertips and nonfingertips. Nonfingertip samples are collected from the background of the working space. Some training samples are shown in Figure 8. The 50 samples for each of the two classes are manually labeled. Training images are resized to 2020 and converted to gray-level images. Each training sample is represented by its coefficient of the 22 largest principal components. Kernel DEM algorithm is performed on such training dataset to obtain a kernel transformation and a Bayesian classifier. Assume at time t1, fingertip location is
Figure 8. (a) Fingertip samples and (b) non-fingertip samples
(a)
(b)
~ Xt1 in the image. At time t, the predicted location of the fingertip is X t according to Kalman prediction. For simplicity, the size of search window is fixed by 1010 centered ~ at X t . For each location in the search window, a fingertip candidate is constructed by the 2020 sized image centered at that location. Thus, 100 candidates will be tested. A probabilistic label of such fingertip candidate is obtained by classification. The one with the largest probability is determined as the tracked location at time t. We run the tracking algorithm on sequences containing a large amount of fingertip rotation and complex backgrounds. The tracking results are fairly accurate.
CONCLUSION
Two sampling schemes are proposed for efficient, kernel-based, nonlinear, multiple discriminant analysis. These algorithms identify a representative subset of the training samples for the purpose of classification. Benchmark tests show that KMDA with these adaptations not only outperforms the linear MDA but also performs comparably with the best known supervised learning algorithms. We also present a self-supervised discriminant analysis technique, Kernel DEM (KDEM), which employs both labeled and unlabeled data in training. On synthetic data and real image databases for several applications such as image classification, hand posture recognition, and fingertip tracking, KDEM shows superior performance over biased discriminant analysis (BDA), nave supervised learning and some other existing semisupervised learning algorithms. Our future work includes several aspects: (1) We will look further into the regularization factor issue for the discriminant-based approaches on a large database; (2) We will intelligently integrate biased discriminant analysis for small numbers of training samples with traditional multiple discriminant analysis on large numbers of training samples and varying numbers of classes; (3) To avoid the heavy computation over the whole database, we will investigate schemes of selecting a representative subset of
unlabeled data whenever unlabeled data helps, and perform parametric or nonparametric tests on the condition when it does not help; (4) Gaussian or Gaussian mixture models are assumed for data distribution in the projected optimal subspace, even when the initial data distribution is highly nonGaussian. We will examine the data modeling issue more closely with Gaussian (or Gaussian mixture) and nonGaussian distributions.
ACKNOWLEDGMENT
This work was supported in part by the National Science Foundation (NSF) under EIA-99-75019 in the University of Illinois at Urbana-Champaign, and by the University of Texas at San Antonio.
Basri, R., Roth, D., & Jacobs, D. (1998). Clustering appearances of 3D objects. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA. Bellhumeur, P., Hespanha, J., & Kriegman, D. (1996). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. Proceedings of European Conference on Computer Vision, Cambridge, UK. CBCL Face Database #1, MIT Center for Biological and Computation Learning. Online: http://www.ai.mit.edu/projects/cbcl Cohen, I., Sebe, N., Cozman, F. G., Cirelo, M. C., & Huang, T. S. (2003). Learning Bayesian network classifiers for facial expression recognition with both labeled and unlabeled data. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Madison, WI. Cox, I. J., Miller, M. L., Minka, T. P., & Papsthomas, T. V. (2000). The Bayesian image retrieval system, PicHunter: Theory, implementation, and psychophysical experiments. IEEE Transactions on Image Processing, 9(1), 20-37. Cozman, F. G., & Cohen, I. (2002). Unlabeled data can degrade classification performance of generative classifiers. Proceedings of the 15 th International Florida Artificial Intelligence Society Conference, Pensacola, FL, (pp. 327-331). Cui, Y., & Weng, J. (1996). Hand sign recognition from intensity image sequence with complex background. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, San Francisco (pp. 88-93). Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural networks. New York: John Wiley & Sons. Duda, R. O., Hart, P. E., & Stork, D.G. (2001). Pattern classification (2nd ed.). New York: John Wiley & Sons. Fisher, R. A. (1936). The use of multiple measurement in taxonomic problems. Annals of Eugenics, vol. 7, 179-188. Fisher, R. A. (1938). The statistical utilization of multiple measurements. Annals of Eugenics, vol. 8, 376-386. Friedman, J. (1989). Regularized discriminant analysis. Journal of American Statistical Association, 84(405), 165-175.
REFERENCES
Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). NJ: Prentice Hall. Jain, A. K., & Farroknia, F. (1991). Unsupervised texture segmentation using Gabor filters. Pattern Recognition, 24(12), 1167-1186. Martinez, A. M., & Kak, A. C., (2001). PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 228-233. Mika, S., Rtsch, G., Weston, J., Schlkopf, B., Smola, A., & Mller, K. (1999a). Fisher discriminant analysis with Kernels. Proceedings of IEEE Workshop on Neural Networks for Signal Processing. Mika, S., Rtsch, G., Weston, J., Schlkopf, B., Smola, A., & Mller, K. R. (1999b). Invariant feature extraction and classification in kernel spaces. Proceedings of Neural Information Processing Systems, Denver. Mika, S., Rtsch, G., Weston, J., Schlkopf, B., Smola, A., & Mller, K. R. (2003). Constructing descriptive and discriminative nonlinear features: Rayleigh coefficients in kernel feature spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5). Mitchell, T. (1999). The role of unlabeled data in supervised learning. Proceedings of the Sixth International Colloquium on Cognitive Science, Spain. Nigram, K., McCallum, A. K., Thrun, S., & Mitchell, T. M. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103-134. Roth, V., & Steinhage, V. (1999). Nonlinear discriminant analysis using kernel functions. Proceedings of Neural Information Processing Systems, Denver, CO. Rui, Y., Huang, T. S., Ortega, M., & Mehrotra, S. (1998). Relevance feedback: A power tool in interactive content-based image retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 8(5), 644-655. Schlkopf B., & Smola, A. J. (2002). Learning with kernels. Boston: MIT Press. Schlkopf, B., Smola, A., & Mller, K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299-1319. Smeulders, A., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1349-1380. Tian, Q., Hong, P., & Huang, T. S. (2000). Update relevant image weights for contentbased image retrieval using support vector machines. Proceedings of IEEE International Conference on Multimedia and Expo, New York, (vol. 2, pp. 11991202). Tian, Q., Yu, J., Wu, Y., & Huang, T.S. (2004). Learning based on kernel discriminant-EM algorithm for image classification. IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Quebec, Canada. Tieu, K., & Viola, P. (2000). Boosting image retrieval. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Hilton Head, SC. Tong, S., & Wang, E. (2001). Support vector machine active learning for image retrieval. Proceedings of ACM International Conference on Multimedia, Ottawa, Canada (pp. 107-118). Vapnik, V. (2000). The nature of statistical learning theory (2nd ed.). Springer-Verlag. Wang, L., Chan, K. L., & Zhang, Z. (2003). Bootstrapping SVM active learning by incorporating unlabelled images for image retrieval. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Madison, WI.
Weber, M., Welling, M., & Perona, P. (2000). Towards automatic discovery of object categories. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Hilton Head Island, SC. Wolf, L., & Shashua, A. (2003). Kernel principle for classification machines with applications to image sequence interpretation. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Madison, WI. Wu, Y., & Huang, T. S. (2001). Self-supervised learning for object recognition based on kernel discriminant-EM algorithm. Proceedings of IEEE International Conference on Computer Vision, Vancouver, Canada. Wu, Y., Tian, Q., & Huang, T. S. (2000). Discriminant EM algorithm with application to image retrieval. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Hilton Head Island, SC. Zhou, X., & Huang, T.S. (2001). Small sample learning during multimedia retrieval using biasMap. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, Hawaii.
ENDNOTES
1
2 3
A term used in kernel machine literatures to denote the new space after the nonlinear transform; not to be confused with the feature space concept used in contentbased image retrieval to denote the space for features or descriptors extracted from the media data. The benchmark data sets are obtained from http://mlg.anu.edu.au/~raetsch/. The MIT facial database can be downloaded from http://www.ai.mit.edu/projects/ cbcl/software-datasets/FaceData2.html. The Corel database is widely used as benchmark in content-based image retrieval.
Section 2 Audio and Video Semantics: Models and Standards
Context-Based Interpretation and Indexing of Video Data 77
Chapter 4
Context-Based Interpretation and Indexing of Video Data

Ankush Mittal, IIT Roorkee, India Cheong Loong Fah, The National University of Singapore, Singapore Ashraf Kassim, The National University of Singapore, Singapore Krishnan V. Pagalthivarthi, IIT Delhi, India
Most of the video retrieval systems work with a single shot without considering the temporal context in which the shot appears. However, the meaning of a shot depends on the context in which it is situated and a change in the order of the shots within a scene changes the meaning of the shot. Recently, it has been shown that to find higherlevel interpretations of a collection of shots (i.e., a sequence), intershot analysis is at least as important as intrashot analysis. Several such interpretations would be impossible without a context. Contextual characterization of video data involves extracting patterns in the temporal behavior of features of video and mapping these patterns to a high-level interpretation. A Dynamic Bayesian Network (DBN) framework is designed with the temporal context of a segment of a video considered at different granularity depending on the desired application. The novel applications of the system include classifying a group of shots called sequence and parsing a video program into individual segments by building a model of the video program.
ABSTRACT
78 Mittal, Fah, Kassim & Pagalthivarthi
INTRODUCTION
Many pattern recognition problems cannot be handled satisfactorily in the absence of contextual information, as the observed values under-constrain the recognition problem leading to ambiguous interpretations. Context is hereby loosely defined as the local domain from which observations are taken, and it often includes spatially or temporally related measurements (Yu & Fu, 1983; Olson & Chun, 2001), though our focus would be on the temporal aspect, that is, measurements and formation of relationships over larger timelines. Note that our definition does not address a contextual meaning arising from culturally determined connotations, such as a rose as a symbol of love. A landmark in the understanding of film perception was the Kuleshov experiments (Kuleshov, 1974). He showed that the juxtaposition of two unrelated images would force the viewer to find a connection between the two, and the meaning of a shot depends on the context in which it is situated. Experiments concerning contextual details performed by Frith and Robson (1975) showed a film sequence has a structure that can be described through selection rules. In video data, each shot contains only a small amount of semantic information. A shot is similar to a sentence in a piece of text; it consists of some semantic meaning which may not be comprehensible in the absence of sufficient context. Actions have to be developed sequentially; simultaneous or parallel processes are shown one after the other in a concatenation of shots. Specific domains contain rich temporal transitional structures that help in the classification process. In sports, the events that unfold are governed by the rules of the sport and therefore contain a recurring temporal structure. The rules of production of videos for such applications have also been standardized. For example, in baseball videos, there are only a few recurrent views, such as pitching, close up, home plate, crowd and so forth (Chang & Sundaram, 2000). Similarly, for medical videos, there is a fixed clinical procedure for capturing different video views and thus the temporal structures are exhibited. The sequential order of events creates a temporal context or structure. Temporal context helps create expectancies about what may come next, and when it will happen. In other words, temporal context may direct attention to important events as they unfold over time. With the assumption that there is inherent structure in most video classes, especially in a temporal domain, we can design a suitable framework for automatic recognition of video classes. Typically in a Content Based Retrieval (CBR) system, there are several elements which determine the nature of the content and its meaning. The problem can thus be stated as extracting patterns in the temporal behavior of each variable and also in the dynamics of relationship between the variables, and mapping these patterns to a high-level interpretation. We tackle the problem in a Dynamic Bayesian Framework that can learn the temporal structure through the fusion of all the features (for tutorial, please refer to Ghahramani (1997)). The chapter is organized as follows. A brief review of related work is presented first. Next we describe the descriptors that we used in this work to characterize the video. The algorithms for contextual information extraction are then presented along with a strategy for building larger video models. Then we present the overview of the DBN framework and structure of DBN. A discussion on what needs to be learned, and the problems in
using a conventional DBN learning approach are also presented in this section. Experiments and results are then presented, followed by discussion and conclusions.
RELATED WORK
Extracting information from the spatial context has found its use in many applications, primarily in remote sensing (Jeon & Landgrebe, 1990; Kittler & Foglein, 1984), character recognition (Kittler & Foglein, n.d.), and detection of faults and cracks (Bryson et al., 1994). Extracting temporal information is, however, a more complicated task but it has been shown to be important in many applications like discrete monitoring (Nicholson, 1994) and plan recognition tasks, such as tracking football players in a video (Intille & Bobick, 1995) and traffic monitoring (Pynadath & Wellman, 1995). Contextual information extraction has also been studied for problems such as activity recognition using graphical models (Hamid & Huang, 2003), visual intrusion detection (Kettnaker, 2003) and face recognition. In an interesting analysis done by Nack and Parkes (1997), it is shown how the editing process can be used to automatically generate short sequences of video that realize a particular theme, say humor. Thus for extraction of indices like humor, climax and so forth, context information is very important. A use of context for finding an important shot in a sequence is highlighted by the work of Aigrain and Joly (1996). They detect editing rhythm changes through the second-order regressive modeling of shot duration. The duration of a shot PRED (n) is predicted by (1) where the coefficients of a and b are estimated in a 10-shot sliding window. The rule employed therein is that if (Tn > 2 *PRED (n)) or (Tn < PRED (n)/2) then it is likely that the nth shot is an important (distinguished) shot in a sequence. The above model was solely based on tapping the rhythm information through shot durations. Dorai and Venkatesh (2001) have recently proposed an algorithmic framework called computational media aesthetics for understanding of the dynamic structure of the narrative structure via analysis of the integration and sequencing of audio/video elements. They consider expressive elements such as tempo, rhythm and tone. Tempo or pace is the rate of performance or delivery and it is a reflection of the speed and time of the underlying events being portrayed and affects the overall sense of time of a movie. They define P (n), a continuous valued pace function, as
P (n) = W (s(n)) +
m(n) - m m
where s refers to shot length in frames, m to motion magnitude, m and m, to the mean and standard deviation of motion respectively and n to shot number. W (s(n)) is an overall two-part shot length normalizing scheme, having the property of being more sensitive near the median shot length, but slows in gradient as shot length increases into the longer range. We would like to view the contextual information extraction from a more generic perspective and consider the extraction of temporal pattern in the behavior of the variables.
THE DESCRIPTORS
The perceptual level features considered in this work are Time-to-Collision (TTC), shot editing, and temporal motion activity. Since our emphasis is on presenting algorithms for contextual information, they are only briefly discussed here. The interested reader can refer to Mittal and Cheong (2001) and Mittal and Altman (2003) for more details. TTC is the time needed for the observer to reach the object, if the instantaneous relative velocity along the optical axis is kept unchanged (Meyer, 1994). There exists a specific mechanism in the human visual system, designed to cause one to blink or to avoid a looming object approaching too quickly. Video shot with a small TTC evokes fear because it indicates a scene of impending collision. Thus, TTC can serve as a potent cue for the characterization of an accident or violence. Although complex editing cues are employed by the cameramen for making a coherent sequence, the two most significant ones are the shot transitions and shot pacing. The use of a particular shot transition, like dissolve, wipe, fade-in, and so forth, can be associated with the possible intentions of the movie producer. For example, dissolves have been typically used to bring about smoothness in the passage of time or place. A shot depicting a close-up of a young woman followed by a dissolve to a shot containing an old woman suggests that the young woman has become old. Similarly, shot pacing can be adjusted accordingly for creating the desired effects, like building up of the tension by using a fast cutting-rate. The third feature, that is, temporal motion feature, characterizes motion via several measures such as total motion activity, distribution of motion, local-motion/global-motion, and so forth. Figure 1. Shots leading to climax in a movie
Chase 1
Collision Alarm
Climax scene (a)
Figure 2. Feature values for the shots in Figure 1. Shots in which TTC length is not shown correspond to a case when TTC is infinite; that is, there is no possibility of collision.
Figure 1 shows the effectiveness of TTC in the content characterization with the example of a climax. Before the climax of a movie, there are generally some chase scenes leading to the meeting of bad protagonists of a movie with good ones. One example of such a movie is depicted in this figure, where the camera is shown both from the perspective of the prey and of the predator leading to several large lengths of the impending collisions as depicted in Figure 2. During the climax, there is a direct contact, and therefore collision length is small and frequent. Combined with large motion and small shot length (both relative to the context), TTC could be used to extract the climax sequences. Many works in the past have focused on shot classification based on single-shot features like color, intensity variation, and so forth. We believe that since each video class represents structured events unfolding in time, the appropriate class signatures are
Figure 3. A hierarchy of descriptors

Semantic Structure Media Descriptor Large-timeline Descriptors Context over large timeline Feelings Activity Scene Characteristics Sequence-level Descriptors Context over shots/ Mapping Collision Detection Shot Transition Motion Descriptors Shot-level Descriptors Context Within Shot Optical Flow Color Statistics Frame Difference Frame-level Descriptors
also present in the temporal domain. The features such as shot-length, motion, TTC, and so forth, were chosen to illustrate the importance of temporal information, as they are perceptual in nature, as opposed to low-level features such as color, and so forth.
CONTEXT INFORMATION EXTRACTION

Hierarchy of Context
Depending on the level of abstraction of the descriptors, the context information can be coarsely (i.e., large neighborhood) or finely integrated. Figure 3 shows a hierarchy of descriptors in a bottom-up fashion where each representational level derives its properties from the lower levels. At the lowest level of the hierarchy are properties like color, optical flow, and so forth, which correspond only to the individual frames. They might employ the spatial context methods that are found in the image processing literature. At the second lowest level are descriptors such as TTC, shot transition details, and so forth, which derive their characteristics from a number of frames. For example, patterns in the mean and variance of color features computed over a number of frames (i.e., context) are used in identifying shot-transition (Yeo & Liu, 1995). Similarly, the context over frames is used for collision detection by identifying a monotonically decreasing TTC, and for the extraction of motion descriptors (like local-motion/global-motion). At the next higher level are the sequence level descriptors (or indices), which might require the information of several shots. Some examples of these descriptors could be in terms of feelings (like interesting, horror, excitement, sad, etc.), activity (like accident, chase, etc.) and scene characteristics (like climax, violence, newscaster, commercial,
etc.). For example, a large number of collisions typically depict a violent scene, whereas one or two collision shots followed by a fading to a long shot typically depicts a melancholy scene. If the context information over the neighboring shots is not considered, then these and many other distinctions would not be possible. Finally, these indices are integrated over a large timeline to obtain semantic structures (like for news and sports) or media descriptors. An action movie has many scenes where violence is shown, a thriller has many climax scenes and an emotional movie has many sentimental scenes (with close-ups, special effects, etc.). These make it possible to perform automatic labeling as well as efficient screening of the media (for example, for violence or for profanity). Descriptors at different levels of the hierarchy need to be handled with different strategies. One basic framework is presented in the following sections.
The Algorithm
Figure 4 depicts the steps in a context information extraction (mainly the first pass over the shots). The digitized media is kept in the database and the shot transition module segments the raw video into shots. The sequence composer groups a number of shots (say ) based on the similarity values of the shots (Jain, Vailaya, & Wei, 1999) depending on the selected application. For example, for applications like climax detection, the sequences consist of a much larger number of shots than that for the sports identifier. The feature extraction of a sequence yields an observation sequence, which consists of feature vectors corresponding to each shot in the sequence. An appropriate DBN Model is selected based on the application, which determines the complexity of the mapping required. Thus, each application has its own model (although the DBN could be made in such a way that a few applications could share a common model, but the performance would not be optimal). During the training phase, only the sequences corresponding to positive examples of the domain are used for learning. During the querying or labeling phase, DBN evaluates the likelihood of the input observation sequence belonging to the domain it represents. The sequence labeling module and the application selector communicate with each other to define the task which needs to be performed, with the output being a set of likelihoods. If the application involves classifying into one of the exclusive genres, the label corresponding to the DBN model with the maximum likelihood is assigned. In general, however, a threshold is chosen (automatically during the training phase) over the likelihoods of a DBN model. If the likelihood is more than the threshold for a DBN model, the corresponding label is assigned. Thus, a sequence can have zero label or multiple labels (such as interesting, soccer, and violent) after the first pass. Domain rules aid further classification in the subsequent passes, the details of which are presented in the next section.
Context Over Sequences: The Subsequent Passes

The algorithm in the subsequent passes is dependent on the application (or the goal). The fundamental philosophy, however, is to map the sequences with their labels onto a one-dimensional domain, and apply the algorithms relevant in the spatial context. A similar example could be found in the character recognition task, where the classification accuracy of a character could be improved by looking at the letters both preceding and following it (Kittler & Foglein, 1984). In this case, common arrangements of the letters
Figure 4. Steps in context information extraction (details of the first pass are shown here)
SequenceComposition Media Database Shot Transition Detector ... Sequence Feature Extraction Module (TTC, Sequence motion, color....) Observation Sequence
Shots
Information & Feedback DBN Model1 Sequence Labeling DBN Model 2 . . . DBN Model k
Application Selector
Second pass Select Mode
Select Input (0/1) & Feature Vector
(e.g., in English: qu, ee, tion) are used to provide a context within which the character may be interpreted. Three examples of algorithms at this level are briefly presented in this section as follows: 1. Probabilistic relaxation (Hummel & Zucker, 1983) is used to characterize a domain like the identification of climax, which has a lot of variation amongst the training samples. In probabilistic relaxation, each shot within the contextual neighborhood is assigned a label with a given probability (or likelihood) by the DBN models. A sliding window is chosen around the center shot. Iterations of the relaxation process update each of the labels around the central shot with respect to a
Figure 5. Subsequent passes perform analysis on the context line of sequences

Effecti-M-1 Seqi-M Label i-M ... Effecti-2 Seqi-1 Label i-1 Effect i-1 Seq Label i Effect i Effecti+1 Seqi+1 Label i+1 ... Seqi+M Label i+M Effect i-M
2M+1 time
Center
2.
3.
compatibility function between the labels in the contextual neighborhood. In this manner, successive iterations propagate context throughout the timeline. The compatibility function encodes the constraint relationships, such as it is more likely to find a climax scene at a small distance from the other climax scene. In other words, the probabilities of both climax scenes are increased after each pass. The relaxation process is stopped when the number of passes exceeds a fixed value or when the passes do not bring about significant changes to the probabilities. Context cues can also aid in improving the classification accuracy of shots. Consider a CBR system that considers an individual shot or a group of shots (say, 4 to 10), that is, sequence, seqi. A sequence seq i is classified into one of the classes (like MTV, soccer, etc.) with label i on the basis of its low-level and high-level properties. Figure 5 shows the time-layout of sequences with their labels. Effect i is the transition-effect between seqi and seq i+1. Consider the 2M neighborhood sequences (in one dimension) of seq i. The nature of the video domain puts a constraint on defining the constraint relationships within the neighborhood. For example, it is generally not possible to find a tennis shot in between several soccer shots. On the other hand, a few commercial shots may certainly be present in a soccer match. Our strategy to reduce the misclassifications is to slide a window of 2M+1 size over each sequence and check against governing the labels and effects in the neighborhood. If labeli and label i-1 are to be different (i.e., there is a change of video class), effecti should be a special effect (and not a flat-cut), and label i should match at least a few labels ahead, that is, label i+1 and so on. In the second pass, those consecutive items with the same labels are clustered together. A reallotment of the labels is done on the basis of rules, like the length of each clustered sequence should be greater than a minimum value. The second pass can also be used to combine unclassified clustered sequences with appropriate neighboring items. This strategy can also be used to retrieve the entire program. By classifying only the individual shots, retrieval of the entire program is a difficult task. The model can only be built for certain sections of the program. For example, it is not an easy task to model outdoor shots in News, or in-between audience shots in sports. However, through this pass scheme, parts of the same program can also be appropriately combined to form one unit. The goal of the subsequent pass could be to extract information about the media (such as the media being an action movie, thriller movie or an emotional movie), or to construct semantic timelines. For media descriptors like action, the pass involves counting the number of violent scenes, with due consideration to their degrees of violence, which is estimated by the likelihood generated by the DBN model. The pass also considers motion, shot length, and so forth, in the scenes classified as nonviolent. Constraint conditions are also enhanced during the pass (for instance, the shot length on average is smaller in an action movie than that in a thriller). The application of constructing semantic timelines is considered in detail in the next section.
Building Larger Models

How DBNs can be made to learn larger models of video is illustrated here. Consider, for example, parsing broadcast news into different sections, and the user is provided with
a facility to browse any one of them. Examples of such queries could be Show me the sport clip which came in the news and go to the weather report. If a video can be segmented into its scene units, the user can more conveniently browse through that video on a scene basis rather than on a shot-by-shot basis, as is commonly done in practice. This allows a significant reduction of information to be conveyed or presented to the user. Zweig (1998) has shown that a Finite State Automaton (FSA) can be modeled using DBNs. The same idea is explored in the domain of a CBR system. The video programs evolve through a series of distinct processes, each of which is best represented by a separate model. When modeling these processes, it is convenient to create submodels for each stage, and to model the entire process as a composition of atomic parts. By factoring a complex model into a combination of simpler ones, we achieve a combinatorial reduction in the number of models that need to be learned. Thus a probabilistic nondeterministic FSA can be constructed as shown in Mittal and Altman (2003). In the FSA, each of the states represents a stage of development and can be either manually or automatically constructed with a few hours of News program. Since most video programs begin and end with a specific video sequence (which can be recognized), modeling the entire structure through FSA, which has explicitly defined start and end states, is justified. The probabilistic FSA of News has transitions on the type of shot cut (i.e., dissolve, wipe, etc.) with a probability. The modeling of the FSA by DBN can be done as follows. The position in the FSA at a specific time is represented by a state variable in the DBN. The DBN transition variable encodes which arc is taken out of the FSA state at any particular time. The number of values the transition variable assumes is equal to the maximum out-degree of any of the states in the FSA. The transition probabilities associated with the arcs in the automaton are reflected in the class probability tables associated with the transition variables in the DBN.
DYNAMIC BAYESIAN NETWORKS

DBNs (Ghahramani, 1997; Nicholson, 1994; Pavlovic, Frey, & Huang, 1999) are a class of graphical, probabilistic models that encode dependencies among sets of random variables evolving in time. They generalize the Hidden Markov Models (HMM) and the Linear dynamical systems by adopting a wider range of topologies and inference algorithms. DBN has been used to temporally fuse heterogeneous features like face, skin and silence detectors for tackling the problem of speaker detection (Garg et al., 2000; Pavlovic et al., 2000). An important application which demonstrates the potential of DBN is multivariate classification of business cycles in phases (Sondhauss & Weihs, 1999). Modeling a DBN with enough hidden states allows the learning of the patterns of variation shown by each feature in individual video classes (or high-level indices). It can also establish correlations and associations between the features leading to the learning of conditional relationships. Temporal contextual information from temporal neighbors is conveyed to the current classification process via the class transition probabilities. Besides the fact that DBNs were explicitly developed to model temporal domain, the other reason for preferring DBN over time-delay neural networks or modified versions of other
learning tools like SVM, and so forth, is that DBN offers the interpretation in terms of probability that makes it suitable to be part of a larger model.
Consider an observation sequence seqT consisting of feature vectors F0, . . . , F T for T + 1 shots (typically seven to 30 shots). Since multiple-label assignment should be allowed in the domain of multimedia indexing (for example, interesting + soccer), each video class or index is represented by a DBN model. A DBN model of a class is trained with preclassified sequences to extract the characteristic patterns in the features. During the inference phase, each DBN model gives a likelihood measure, and if this exceeds the threshold for the model, the label is assigned to the sequence. Let the set of n CBR features at time t be represented by F 1t, , Fnt where feature vector F Z, Z is the set of all the observed and hidden variables. The system is modeled as evolving in discrete time steps and is a compact representation for the two time-slice conditional probability distribution P (Zt+1 | Z t ). Both the state evolution model and the observation model form a part of the system such that the Markov assumption and the time-invariant assumption hold. The Markov assumption simply states that the future is independent of the past given the present. The time-invariant assumption means that the process is stationary, that is, P (Zt+1 | Zt) is the same for all t, which simplifies the learning process. A DBN can be expressed in terms of two Bayesian Networks (BN): a prior network BN0, which specifies a distribution over the initial states, and a transition network BN, which represents the transition probability from state Z t to state Zt+1. Although a DBN defines a distribution over infinite trajectories of states, in practice, reasoning is carried out on a finite time interval 0, . . . , T by unrolling the DBN structure into long sequences of BNs over Z1 , . . . , ZT . In time slice 0, the parents of Z0 and its conditional probability distributions (CPD) are those specified in the prior network BN0. In time slice t + 1, the parents of Z t+1 and its CPDs are specified in BN!. Thus, the joint distribution over Z 1 , . . . , Z T is
Structuring the CBR Network
P (z 0 , . . . , z T ) = PBN 0 (z 0 )
T -1 t =1
PBN (z t +1 | z t )
Consider a DBN model and an observation sequence seqT consisting of feature vectors F1 , . . . , Fm. There are three basic problems of inference, learning and decoding of a model which are useful in a CBR task.
DBN Computational Tasks

The inference problem can be stated as computing the probability P ( | seqi) such that the observation sequence is produced by the model. The classical solution to the DBN inference is based on the same theory as the forward-backward propagation for HMMs (Rabiner, 1989). The algorithm propagates forward messages at the start of the sequence, gathering evidence along the way. Similar process is used to propagate the backward messages t in the reverse direction. The posterior distribution t over the states at time t is simply at (z) t(z) (with suitable re-normalization). The joint posterior
over the states at t and t + 1 is proportional to t(zt ) PBN (zt+1 |z t) tt+1 (zt+1 ) PBN (Ft+1 |zt+1 ). The learning algorithm for dynamic Bayesian networks follows from the EM algorithm. The goal of sequence decoding in DBN is to find the most likely state sequence Q of hidden variables given the observations such that X *T = arg maxXT P (XT | seqT). This task can be achieved by using the Viterbi algorithm (Viterbi, 1967) based on dynamic programming. Decoding attempts to uncover the hidden part of the model and outputs the state sequence that best explains the observations. The previous section presents an application to parse a video program where each state corresponds to a segment of the program.
What Can Be Learned?

An important question that can be raised is this: Which pattern can be learned and which cannot be learnt? Extracting temporal information is complicated. Some of the aspects of the temporal information that we attempt to model are 1. Temporal ordering. Many scene events have precedent-antecedent relationships. For example, as discussed in the sports domain, shots depicting interesting events are generally followed by a replay. These replays can be detected in a domainindependent manner and the shots preceding them can be retrieved, which would typically be the highlights of the sport. Logical constraints could be of the form that event A occurs either before the event B or the event C or later than the event D. Time duration. The time an observation sequence lasts is also learnt. For example, a climax scene is not expected to last over hundreds of shots. Thus, a very-long sequence having feature values (like large number of TTC shots and small shotlengths) similar to the climax should be classified as nonclimax. Association between variables. For instance, during the building up of a climax, TTC measure should continuously decrease while the motion should increase.
2. 3.
4.
The Problems in Learning

Although DBN can simultaneously work with continuous variables and discrete variables, some characteristic effects, which do not occur often, might not be properly modeled by DBN learning. A case in point is the presence of a black frame at the end of every commercial. Since it is just a spike in a time-frame, DBN associates more meaning to its absence and treats the presence at the end of sequence as noise. Therefore, we included a binary variable: presence of a black frame, which takes the value 1 for every frame if at the end of a sequence there is a black frame. While DBN discovers the temporal correlations of the variables very well, it does not support the modeling of long-range dependencies and aggregate influences from variables evolving at different speeds, which could be a frequent occurrence in a video domain. This solution to the problem is suggested as explicitly searching for violations of the Markov property at widely varying time granularity (Boyen et al., 1999). Another crucial point to be considered in the design of DBN is the number of hidden states. Insufficient numbers of hidden states would not model all the variables, while an excessive number would generally lead to overlearning. Since, for each video class a
separate DBN is employed, the optimization of the number of hidden states is done by trial and test method so as to achieve the best performance on the training data. Since feature extraction on video classes is very inefficient (three frames/second!), we do not have enough data to learn the structure of DBN at present for our application. We assumed a simple DBN structure as shown in Figure 6. Another characteristic of the DBN learning is that the features which are not relevant to the class have their estimated probability density functions spread out, which can easily be noticed and these features can be removed. For example, there is no significance of the shot-transition length in cricket sequences, and thus this feature can be removed. Like most of the learning tools, DBN also requires the presence of a few representational training sequences of the model to help extract the rules. For example, just by training the interesting sequences of cricket, the interesting sequences of badminton cannot be extracted. This is because the parameter learning is not generalized enough. DBN is specific in learning the shot-length specific to cricket, along with the length of the special effect (i.e., wipe) used to indicate replay. On the other hand, a human expert would probably be in a position to generalize these rules. Of course, this learning strategy of DBN has its own advantages. Consider for example, a sequence of cricket shots (which is noninteresting) having two wipe transi-
Figure 6. DBN architecture for the CBR system. The black frame feature has value 1 if the shot ends with black frame and 0 otherwise. It is especially relevant in detection of commercials.
State t-1 State t State t+1 State Evolution Model
Observation Model Feature vector t-1 Feature vector t Feature vector t+1
Time slice t
Black Frame
Cut type
Presence of TTC
Shot Length
Length of Transition
Length of TTC
Frame Difference
tions and the transitions are separated by many flat-cut shots. Since DBN is trained to expect only one to three shots in between two-wipe transitions, it would not recognize this as an interesting shot. Another practical problem that we faced was to decide the optimum length of shots for DBN learning. If some redundant shots are taken, DBN fails to learn the pattern. For example, in climax application, if we train the DBN only with climax scene, it can perform the classification with good accuracy. However, if we increase the number of shots, the classification accuracy drops. This implies that the training samples should have just enough number of shots to model or characterize the pattern (which at present requires human input). Modeling video programs is usually tough, as they cannot be mapped to the templates except in exceptional cases. The number of factors which contribute toward variations in the parameters and thus stochastic learning in DBN is highly suitable. We would therefore consider subsequent passes in the next section which takes the initial probability assigned by DBN and tries to improve the classification performance based on the task at hand.
EXPERIMENTS AND APPLICATIONS

For the purpose of experimentation, we recorded sequences from TV using a VCR and grabbed shots in MPEG format. The size of the database was around four hours 30 minutes (details are given in Table 1) from video sequences of different categories. The frame dimension was 352288. The principal objective of these experiments is not to demonstrate the computational feasibility of some of the algorithms (for example, TTC or shot detection). Rather we want to demonstrate that the constraints/laws derived from the structure or patterns of interaction between producers and viewers are valid. A few applications are considered to demonstrate the working and effectiveness of the present work. In order to highlight the contribution of the perceptual-level features and the contextual information discussed in this chapter, other features like color, texture, shape, and so forth, are not employed. The set of features employed is shown in Figure 6. Many works in the past (including ours (Mittal & Cheong, 2003)) have focused on shot classification based on single-shot features like color, intensity variation, and so forth. We believe that since each video class represents structured events unfolding in time, the appropriate class signatures are also present in the temporal domain such as shotlength, motion, TTC, and so forth. For example, though the plots of log shot-length of news, cricket and soccer were very similar, the frequent presence of wipe and high-motion distinguishes cricket and soccer from the news.
Sequence Classifier
As discussed before, the problem of sequence classification is to assign labels from one or more of the classes to the observation sequence, seq T consisting of feature vectors F0 , . . . , FT . Table 2 shows the classification performance of the DBN models for six video classes. The experiment was conducted by training each DBN model with preclassified sequences and testing with unclassified 30 sequences of the same class and 50
Table 1. Video database used for experimentation

Type Sports MTV Commercial Movie BBC News News from Singapore TCS Channel 5 ((consist of 2 commercial breaks) Total Duration (min sec) 6350 1210 1713 2725 6454 8750 4 hr 33 min 22 sec
sequences of other classes. The DBN models for different classes had a different value of T, which was based on optimization performed during the training phase. The number of shots for DBN model of news was 5 shots, of soccer 8 shots, and of commercials 20 shots. The news class consists of all the segments, that is, newscaster shots, outdoor shots, and so forth. The commercial sequences all have a black frame at the end. There are two paradigms of training the DBN: (i) Through a fixed number of shots (around seven to 10), and (ii) through a fixed number of frames (400 to 700 frames). The first paradigm works better than the second one as is clear from the recall and precision rates. This could be due to two reasons: first, having a fixed number of frames does not yield proper models for classes, and second, the DBN output is in terms of likelihood, which reduces as there are a larger number of shots in a sequence. Large numbers of shots implies more state transitions, and since each transition has probability of less than one, the overall likelihood decreases. Thus, classes with longer shotlengths are favored in the second paradigm, leading to misclassifications. In general, DBN modeling gave good results for all the classes except soccer. It is interesting to note that commercials and news were detected with very high precision because of the presence of the characteristic black frame and the absence of high-motion, respectively. A large number of fade-in and fade-out effects are present in the MTV and commercial classes, and dissolves and wipes are frequently present in sports. The black frame feature also prevents the MTV class being classified as a commercial, though both of them have similar shot lengths and shot transitions. The poor performance of the soccer class could be explained by the fact that the standard deviations of the DBN parameters corresponding to motion or shot length features were large, signifying that the degree of characterization of the soccer class was poor by features such as shot length. On the other hand, cricket was much more structured, especially with a large number of wipes.
Highlight Extraction
In replays, the following conventions are typically used: 1. Replays are bounded by a pair of identical gradual transitions, which can either be a pair of wipes or a pair of dissolves.
Table 2. Performance: Sequence classifier

Fixed Number of shots Class Commercial (CO ) Cricket (CR) MT V (MT) News (NE) Soccer (SO) Tennis (TE) Recall Precision Misclassified (Out of 30) None 4 False alarm (Out of 10 seq. for each class) None 3 MT, 1 SO Fixed Number of frames Misclassified (Out of 30) 3 8 False alarm (Out of 10 seq. for each class) None 1 MT, 1 NE, 7 SO, 4 TE 7NE, 6 TE, 7 SO 3 TE, 1 SO
2 NE, 2 TE
11
None 2 CO, 4 CR, 5 MT, 8 NE,7 TN 4 MT, 6 NE , 2 SO 83.9 % 76.7 %
10 4
13 13
1 CO, 1 MT . 4 NE, 9 TE 3 CR, 3 NE , 5 SO 68.3 % 66.5 %
2.
Cuts and dissolves are the only transition types allowed between two successive shots during a replay. Figure 7 shows two interesting scenes from the soccer and cricket videos. Figure 7(a) shows a soccer match in which the player touched the football with his arm. A cut to a close-up of the referee showing a yellow card is used. A wipe is employed in the beginning of the replay, showing how the player touched the ball. A dissolve is used for showing the expression of the manager of the team, followed by a wipe to indicate the end of the replay. Finally, a close-up view of the player who received the yellow card is shown.
For the purpose of experiments, only two sports, cricket and soccer, were considered, although the same idea can be extended to most of the sports, if not all. This is because the interesting scene analysis is based on detecting the replays, the structure of which is the same for most sports. Below is a typical format of the transition effects around an interesting shot in cricket and soccer.
E . C G1 . E G2 E where, E {Cut, Dissolve} C {Cut} G1, G2 {Wipe, Dissolve}, G1 and G2 are the gradual transitions before and after the replay D {Dissolve} , , {Natural number} is a followed by operator.
Figure 7. Typical formats of the transition effects used in the replay sequences of (a) soccer match (b) cricket match
This special structure of replay is used for identifying interesting shots. The training sequence for both sports, which was typically 5 to 7 shots, consisted of interesting shots followed by the replays. Figure 8 shows the classification performance of DBN for highlight extraction. A threshold could be chosen on the likelihood returned by DBN based on training data (such that the threshold is less than the likelihood of most of the training sequences). All sequences possessing a likelihood more than the threshold of a DBN model are assigned an interesting class label. Figure 8 shows that only two misclassifications result with the testing. One recommendation could be to lower the threshold such that no interesting sequences are missed; although a few uninteresting sequences may also be labeled and retrieved to the user. Once an interesting scene is identified, the replays are removed before presenting it to the user. An interesting detail in this application is that one might want to show the scoreboard during the extraction of highlights (scoreboards are generally preceded and followed by many dissolves). DBN can encode in its learning both the previous model of the shot followed by a replay and the scoreboard sequences with their characteristics.
Climax Characterization and Censoring Violence

Climax is the culmination of a gradual building up of tempo of events, such as chase scenes, which end up typically in violence or in passive events. Generally, a movie with a lot of climax scenes is classified as a thriller while a movie which has more violence scenes is classified as action. Sequences from thriller or action movies could be used to train a DBN model, which can learn about the climax or violence structures easily. This application can be used to present a trailer to the user, or to classify the movie into different media categories and restrict the presentation of media to only acceptable scenes.
Figure 8. Classification performance for highlight extraction

Cricket 0
0.05
0. 1
Threshold
Log likelihood
0.15
0. 2
0.25
Interesting 0. 3 Uninteresting Misclassified 0 1 2 3 4 5 Shots 6 7 8 9 10 11
Over the last two decades, a large body of literature has linked the exposure to violent television with increased physical aggressiveness among children and violent criminal behavior (Kopel, 1995, p. 17; Centerwall, 1989). A modeling of the censoring process can be done to restrict access to media containing violence. Motion alone is insufficient to characterize violence, as many acceptable classes (especially sports like car racing, etc.) also possess high motion. On the other hand, shot-length and especially TTC are highly relevant due to the fact that the camera is generally at a short distance during violence (thus the movements of the actors yield impression of impending collisions). The cutting rate is also high, as many perspectives are generally covered. The set of the features used, though, remains the same as a sequence classifier application. Figure 10 shows a few images from a violent scene of a movie. Figure 9 shows the classification performance for a censoring application. For each sequence in the training and test database, the opinions of three judges were sought in terms of one of the three categories: violent, nonviolent and cannot say. Majority voting was used to decide if a sequence is violent. The test samples were from sports, MTV, commercials and action movies. Figure 9 shows that while the violent scenes were correctly detected, two MTV shots were misclassified as violent. For one sequence of these MTV sequences, the opinion of two judges was cannot say, while the third one classified it as violent (Figure 11). The other sequence consisted of too many objects near to the camera with a large cutting rate although it should be classified as nonviolent.
DISCUSSION AND CONCLUSION

In this chapter, the integration of temporal context information and the modeling of this information through DBN framework were considered. Modeling through DBN
Figure 9. Classification performance for violence detection

Soccer 0
0.05
0.1
Threshold
0.15
Loglikelihood
0.2
0.25
0.3
0.35
Interesting Uninteresting
0.4
Misclassified
0.45
5 Shots
10
11
Figure 10. A violent scene that needs to be censored
removes the cumbersome task of manually designing a rule-based system. The design of such a rule-based system would have to be based on the low-level details, such as thresholding; besides, many temporal structures are difficult to observe but could be extracted by automatic learning approaches. DBN assignment of the initial labels on the data prepares it for the subsequent passes, where expert knowledge could be used without much difficulty. The experiments conducted in this chapter employed a few perceptual-level features. The temporal properties of such features are more readily understood than lowlevel features. Though the inclusion of low-level features could enhance the characterization of the categories, it would raise the important issue of dealing with the highCopyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
Figure 11. An MTV scene which was misclassified as violent. There are many moving objects near the camera and the cutting rate is high.
dimensionality of the feature-space in the temporal domain. Thus, the extraction of information, which involves learning, decoding and inference, would require stronger and more efficient algorithms.
REFERENCES
Aigrain, P., & Joly, P. (1996). Medium knowledge-based macro-segmentation of video into sequences. Intelligent Multimedia Information Retrieval. Boyen, X., Firedman, N., & Koller, D. (1999). Discovering the hidden structure of complex dynamic systems. Proceedings of Uncertainty in Artificial Intelligence (pp. 91100). Bryson, N., Dixon, R.N., Hunter, J.J., & Taylor, C. (1994). Contextual classification of cracks. Image and vision computing, 12, 149-154. Centerwall, B. (1989). Exposure to television as a risk factor for violence. Journal of Epidemology, 643-652. Chang, S. F., & Sundaram, H. (2000). Structural and semantic analysis of video. IEEE International Conference on Multimedia and Expo (pp. 687-690). Dorai, C., & Venkatesh, S. (2001). Bridging the semantic gap in content management systems: Computational media aesthetics. International Conference on Computational Semiotics in Games and New Media (pp. 94-99). Frith, U., & Robson, J. E. (1975). Perceiving the language of film in Perception, 4, 97-103. Garg, A., Pavlovic, V., & Rehg, J. M. (2000). Audio-visual speaker detection using Dynamic Bayesian networks. IEEE Conference on Automatic Face and Gesture Recognition (pp. 384-390). Ghahramani, Z. (1997). Learning dynamic Bayesian networks. Adaptive Processing of Temporal Information. Lecture Notes in AI. SpringerVerlag. Hamid, I. E., & Huang, Yan. (2003). Argmode activity recognition using graphical models. IEEE CVPR Workshop on Event Mining: Detection and Recognition of Events in Video (pp. 1-7). Hummel, R. A., & Zucker, S. W. (1983). On the foundations of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 267-287.
Intille, S. S., & Bobick, A. F. (1995). Closed-world tracking. IEEE International Conference on Computer Vision (pp. 672-678). Jain, A. K., Vailaya, A., &Wei, X. (1999). Query by video clip. Multimedia Systems, 369384. Jeon, B., & Landgrebe, D. A. (1990). Spatio-temporal contextual classification of remotely sensed multispectral data. IEEE International Conference on Systems, Man and Cybernetics (pp. 342-344). Kettnaker, V. M. (2003). Time-dependent HMMs for visual intrusion detection. IEEE CVPR Workshop on Event Mining: Detection and Recognition of Events in Video. Kittler, J., & Foglein, J. (1984). Contextual classification of multispectral pixel data. Image and Vision Computing, 2, 13-29. Kopel, D. B. (1995). Massaging the medium: Analyzing and responding to media violence without harming the first. Kansas Journal of Law and Public Policy, 4, 17. Kuleshov, L. (1974). Kuleshov on film: Writing of Lev Kuleshov. Berkeley, CA: University of California Press. Meyer, F. G. (1994). Time-to-collision from first-order models of the motion fields. IEEE Transactions of Robotics and Automation (pp. 792-798). Mittal, A., & Altman, E. (2003). Contextual information extraction for video data. The 9th International Conference on Multimedia Modeling (MMM), Taiwan (pp. 209223). Mittal, A., & Cheong, L.-F. (2001). Dynamic Bayesian framework for extracting temporal structure in video. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2, 110-115. Mittal, A., & Cheong, L.-F. (2003). Framework for synthesizing semantic-level indices. Journal of Multimedia Tools and Application, 135-158. Nack, F., & Parkes, A. (1997).The application of video semantics and theme representation in automated video editing. Multimedia tools and applications, 57-83. Nicholson, A. (1994). Dynamic belief networks for discrete monitoring. IEEE Transactions on Systems, Man, and Cybernetics, 24(11), 1593-1610. Olson, I. R., & Chun, M. M. (2001). Temporal contextual cueing of visual attention. Journal of Experimental Psychology: Learning, Memory, and Cognition. Pavlovic, V., Frey, B., & Huang, T. (1999). Time-series classification using mixed-state dynamic Bayesian networks. IEEE Conference on Computer Vision and Pattern Recognition (pp. 609-615). Pavlovic, V., Garg, A., Rehg, J., & Huang, T. (2000). Multimodal speaker detection using error feedback dynamic Bayesian networks. IEEE Conference on Computer Vision and Pattern Recognition. Pynadath, D. V., & Wellman, M. P. (1995). Accounting for context in plan recognition with application to traffic monitoring. International Conference on Artificial Intelligence, 11. Rabiner, L. R. (1989). A tutorial on hidden markov models and selected application in speech recognition. Proceedings of the IEEE (vol. 77, pp. 257-286). Sondhauss, U., & Weihs, C.(1999). Dynamic bayesian networks for classification of business cycles. SFB Technical report No. 17. Online at http://www.statistik.unidortmund.de/ Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory (pp. 260-269).
Yeo, B. L., & Liu, B.(1995). Rapid scene analysis on compressed video. IEEE Transactions on Circuits, Systems, and Video Technology (pp. 533-544). Yu, T. S., & Fu, K. S. (1983). Recursive contextual classification using a spatial stochastic model. Pattern Recognition, 16, 89-108. Zweig, G. G. (1998). Speech recognition with dynamic Bayesian networks. PhD thesis, Dept. of Computer Science, University of California, Berkeley.
Content-Based Music Summarization and Classification
99
Chapter 5

Changsheng Xu, Institute for Infocomm Research, Singapore Xi Shao, Institute for Infocomm Research, Singapore Namunu C. Maddage, Institute for Infocomm Research, Singapore Jesse S. Jin, The University of Newcastle, Australia Qi Tian, Institute for Infocomm Research, Singapore
This chapter aims to provide a comprehensive survey of the technical achievements in the area of content-based music summarization and classification and to present our recent achievements. In order to give a full picture of the current status, the chapter covers the aspects of music summarization in compressed domain and uncompressed domain, music video summarization, music genre classification, and semantic region detection in acoustical music signals. By reviewing the current technologies and the demands from practical applications in music summarization and classification, the chapter identifies the directions for future research.
ABSTRACT
100 Xu, Shao, Maddage, Jin & Tian
INTRODUCTION
Recent advances in computing, networking, and multimedia technologies have resulted in a tremendous growth of music-related data and accelerated the need to analyse and understand the music content. Music representation is multidimensional and time-dependent. How to effectively organize and process such large variety and quantity of music information to allow efficient browsing, searching and retrieving is an active research area in recent years. Audio content analysis, especially music content understanding, posts a big challenge for those who need to organize and structure music data. The difficulty arises in converting the featureless collections of raw music data to suitable forms that would allow tools to automatically segment, classify, summarize, search and retrieve large databases. The research community is now at the point where the limitations and properties of developed methods are well understood and used to provide and create more advanced techniques tailored to user needs and able to better bridge the semantic gap between the current audio/music technologies and the semantic needs of interactive media applications The aim of this chapter is to provide a comprehensive survey of the technical achievements in the area of content-based music summarization and classification and to present our recent achievements. The next section introduces music representation and feature extraction. Music summarization and music genre classification are presented in details in the two sections, respectively. Semantic region detection in acoustical music signals is described in the fifth section. Finally, the last section gives the concluding remarks and discusses future research directions.
MUSIC REPRESENTATION AND FEATURE EXTRACTION

Feature extraction is the first step of content-based music analysis. There are many features that can be used to characterize the music signal. Generally speaking, these features can be divided into three categories: timbral textural features, rhythmic content features and pitch content features.
Timbral Textural Features

Timbral textural features are used to differentiate mixture of sounds that may have the same or similar rhythmic and pitch contents. The use of these features originates from music-speech discrimination and speech recognition. The calculated features are based on the short-time Fourier transform (STFT) and are calculated for each frame.
Amplitude Envelope
The amplitude envelope describes the energy change of the signal in the time domain and is generally equivalent to the so-called ADSR (attack, decay, sustain and release) of a music song. The envelope of the signal is computed with a frame-by-frame root mean square (RMS) and a third-order Butterworth low-pass filter (Eiilis, 1994). RMS is a perceptually relevant measure and has been shown to correspond closely to the way we hear Loudness. The length of the RMS frame determines the time resolution of the
101
envelope. A large frame length yields low transient information and a small frame length yields greater transient energy.
Spectral Power
For a music signal s(n), each frame is weighted with a Hanning window h(n):
h( n ) = 83 n 1 cos(2 ) 2 N
(1)
where N is the number of samples of each frame. The spectral power of the signal s(n) is calculated as
1 S ( k ) = 10 log 10 N
s(n)h(n) exp( j 2
n =0
N 1
nk ) N
(2)
Spectral Centroid
The spectral centroid (Tzanetakis & Cook, 2002) is defined as the centre of gravity of spectrum magnitude in STFT.
Ct =
M ( n) n
n =1 N t
M ( n)
n =1 t
(3)
where Mt(n) is the spectrum of the Fast Fourier Transform (FFT) at the t-th frame with a frequency bin f. The spectral centroid is a measure of spectral shape. Higher centroid values correspond to brighter textures with more high frequencies.
Spectrum Rolloff
Spectrum Rolloff (Tzanetakis & Cook, 2002) is the frequency below which 85% of spectrum distribution is concentrated. It is also a measure of the spectral shape.
Spectrum Flux
Spectrum flux (Tzanetakis & Cook, 2002) is defined as the variation value of spectrum between two adjacent frames.
SF = N t ( f ) N t 1 ( f )
(4)
where Nt( f ) and Nt-1( f ) are the normalized magnitude of the FFT at the current frame t and previous frame t-1, respectively. Spectrum flux is a measure of the amount of local spectral changes.
Cepstrum
The mel-frequency cepstra has proven to be highly effective in automatic speech recognition and in modeling the subjective pitch and frequency content of audio signals. Psychophysical studies have found the phenomena of the mel pitch scale and the critical band, and the frequency scale-warping to the mel scale has led to the cepstrum domain representation. The cepstrum can be illustrated by means of the Mel-Frequency Cepstral Coefficients (MFCCs). These are computed from the FFT power coefficients (Logan & Chu, 2000). The power coefficients are filtered by a triangular bandpass filter. The filter consists of K triangular banks. They have a constant mel-frequency interval and cover the frequency range of 0-4000Hz. Denoting the output of the filter bank by Sk (k=1,2,K), (K is denoted to 19 for speech recognition purpose, while K has a value higher than 19 for music signals because music signals have a wider spectrum than speech signals), the MFCCs are calculated as
cn =
2 K (log S k ) cos[n(k 0.5) / K ] n = 1,2,..., L K k =1
(5)
where L is the order of the cepstrum. Figure 1 illustrates the estimation procedure of MFCC.
Zero Crossing Rates

Zero crossing rates are usually suitable for narrowband signals (Deller et al., 2000), but music signals include both narrowband and broadband components. Therefore, the short time zero crossing rates can be used to characterize music signals. The N-length short time zero crossing rates are defined as
Z s ( m) =
1 N
n = m N +1
sgn{s( n)} sgn{s ( n 1)} 2
w(m n)
(6)
where w(m) is a rectangular window.
Low Energy Component

Low energy component (LEC) (Wold et al., 1996) is the percentage of frames that have energy less than the average energy over the whole signal. It measures amplitude distribution of the signal.
Figure 1. Estimation procedure of MFCC

Digtal Digital Signal Signal
FFT FFT
MelMelScale Scale Filter Filter
Sum Sum
log log
MFCC MFOC
DCT
103
Figure 2. Estimation procedure of octave-based spectral contrast

Digtal Digital Signal Signal
FFT FFT
Octave Octave Scale Scale Filter Filter
Peak/ Peak/ Valley Valley Select Select
log log
K-L K-L
Spectral Spectral Contrast Contrast
Spectral Contrast Feature

In Jiang et al. (2002), an octave-based spectral contrast feature was proposed to represent the spectral characteristics of a music clip. These are computed from the FFT power coefficients. The power coefficients are filtered by a triangular bandpass filter, that is, an octave scale filter. Then the spectral peaks, valleys, and their differences in each subband are extracted. The final step for spectral contrast is to use a Karhumen-Loeve transform to eliminate relativity. Figure 2 illustrates the estimation procedure of octavebased spectral contrast.
Rhythmic Content Features

Rhythmic content features characterize the movement of music signals over time and contain information such as the regularity of the rhythm, beat, tempo, and time signature. The feature set for representing rhythm structure is usually extracted from the beat histogram. Tzanetakis (Tzanetakis et al., 2001) used a beat histogram built from the autocorrelation function of the signal to extract rhythmic content features. The timedomain amplitude envelopes of each band are extracted by decomposing the music signal into a number of octave frequency bands. Then, the envelopes of each band are summed together followed by autocorrelation of resulting sum envelopes. The dominant peaks of the autocorrelation function, corresponding to the various periodicities of the signals envelopes, are accumulated over the whole music source into a beat histogram where each bin corresponds to the peak lag. Figure 3 illustrates the procedure of constructing a beat histogram from music sources. After the beat histogram is created, the rhythmic content features are extracted from the beat histogram. Generally, they contain amplitudes of the first and the second histogram peaks, the ratio of the amplitudes of the second peak over the first peak, periods of the first and second peaks, and the overall sum of the histogram, and so forth.
Pitch Content Features

Pitch is a perceptual term which can be approximated by fundamental frequency. The pitch content features describe the melody and harmony information about music signals, and a pitch content feature set is extracted based on various multipitch detection techniques. More specifically, the multipitch detection algorithm described in Tolonen and Karjalainen (2000) can be used to estimate the pitch. In this algorithm, the signal is decomposed into two frequency bands and an amplitude envelope is extracted for each frequency band. The envelopes are summed and an enhanced autocorrelation function is computed so that the effect of integer multiples of the peak frequencies on multiple pitch detection is reduced. The prominent peaks of this summary enhanced autocorrelation
Figure 3. Procedure to construct a beat histogram from music signals
function correspond to the main pitches for that short segment of sound and are accumulated into pitch histograms. Then the pitch content features can be extracted from the pitch histograms.
MUSIC SUMMARIZATION
The creation of a concise and informative extraction that accurately summarizes original digital content is extremely important in a large-scale information repository. Currently, the majority of summaries used commercially are manually produced from the original content. For example, a movie clip may provide a good preview of the movie. However, as a large volume of digital content has become publicly available on the Internet and in other physical storage media during recent years, automatic summarization has become increasingly important and necessary. There are a number of techniques being proposed and developed to automatically generate summaries from text (Mani & Maybury, 1999), speech (Hori & Furui, 2000) and video (Gong et al., 2001). Similar to text, speech and video summarization, music summarization refers to determining the most common and salient themes of a given music piece that may be used to represent the music and is readily recognizable by a listener. Automatic music summarization can be applied to music indexing, content-based music retrieval and web-based music distribution. A summarization system for MIDI data has been developed (Kraft et al., 2001). It uses the repetition nature of MIDI compositions to automatically recognize the main melody theme segment for a given piece of music. A detection engine converts melody recognition and music summarization to string processing and provides efficient ways of retrieval and manipulation. The system recognizes maximal length segments that have
105
nontrivial repetitions in each track of the MIDI data of music pieces. These segments are treated as basic units in music composition, and are the candidates for the melody in a music piece. However, MIDI format is not sampled audio data (i.e., actual audio sounds), instead, it contains synthesizer instructions, or MIDI notes, to reproduce audio. Compared with actual audio sounds, MIDI data cannot provide a real playback experience and an unlimited sound palette for both instruments and sound effects. On the other hand, MIDI data is a structured format, so it is easy to create a summary according to its structure. Therefore, MIDI summarization has little practical significance. In this section, we focus on the music summarization for sound recoding from the real world, both in uncompressed domain such as WAV format and compressed domain such as MP3 format.
Music Summarization in Uncompressed Domain

Although an exact definition of what makes part of a song memorable or distinguishable is still unclear, a general consensus assumes the most repeated section plays an important role. Approaches aimed at automatic music summarization include two stages. The first stage is feature extraction. The music signal is segmented into frames and each frame is characterized by features. Features related to instrumentation, texture, dynamics, rhythmic characteristics, melodic gestures and harmonic content are used. Unfortunately, some of these features are difficult to extract, and it is not always clear which features are most relevant. As a result, the first challenge in music summarization is to determine the relevant features and find a way to extract them. In the second stage (music structure analysis stage), the most repeated sections are identified based on similarity analysis using various methods discussed below.
Feature Extraction
The commonly used features for music summarization are timbral textual features including Amplitude Envelope, used in Xu et al. (2002) MFCC, used in Logan and Chu (2000), Xu et al. (2002), Foote et al. (2002), Lu and Zhang (2003) Octave-based spectral contrast, used in Lu and Zhang (2003)
It considers the spectral peak, spectral valley and their differences in each subband. Therefore, it can roughly reflect the relative distribution of harmonic and nonharmonic components in the spectrum, which complements the weak point of MFCC, that is, MFCC averages the spectral distribution in each subband and thus loses the relative spectral information. Pitch, used in Chai and Vercoe (2003)
The pitch can be estimated using autocorrelation of each frame. Although all the test data in their experiment are polyphonic, the authors believe that this feature is able to capture much information for music signals with a leading vocal.
Music Structure Analysis

All approaches of music structure analysis are based on detecting the most repeated section of a song. As a result, the critical issue in this phrase is how to measure the similarity for different frames or for different sections. More precisely, these approaches can be classified into two main categories: Machine Learning Approaches and Pattern Matching Approaches. Machine learning approaches attempt to categorize each frame of a song into a certain cluster based on the similarity distance between this frame and other frames in the same song. Then the frame number of each cluster is used to measure the occurrence frequency. The final summary is generated based on the cluster that contains the largest number of frames. Pattern matching approaches aim at matching the underlying candidate excerpts, which include a fixed number of continuous frames, with the whole song. The final summary is generated based on the best matching excerpt. Machine Learning Approach Since the music structure can be determined without prior knowledge, the provision for unsupervised learning can be naturally met. Clustering is the most widely used approach in this category, and there are several music structure analysis methods based on clustering. Logan (Logan & Chu, 2000) used clustering techniques to find the most salient part of a song, which is called the key phrase, in selections of popular music. They proposed a modified cross-entropy or Kullback Leibler (KL) distance to measure the similarity between the different frames. In addition, they proposed a Hidden Markov Model (HMM)-based summarization method which used each state of HMM to correspond to a group of similar frames in the song. Xu (Xu et al., 2002) proposed a clustering based method to group segmented frames into different clusters to structure the music content. They used the Mahalanobis distance for similarity measure. Lu and Zhang (2003) divided their music structure analysis into two steps. In the first step, they used clustering method similar to Xu et al. (2002) to group the frames. In the second step, they used estimated phrase length and the phrase boundary confidence of each frame to detect the phrase boundary. In this way, the final music summary will not include broken music phrases. Pattern Matching Approach The pattern matching approach aims at matching the underlying excerpt with the whole song to find the most salient part. The best matching excerpt can be the one that is most similar to the whole song or the one that is repeated most in the whole song. Foote et al. (2002) and Cooper and Foote (2002) proposed a representation called similarity matrix for visualizing and analyzing the structure of music. One attempt of this representation was to locate points of significant change in music, which they called audio novelty. The audio novelty score is based on the similarity matrix, which compares frames of music signals based on the features extracted from the audio. The resulting summary was selected to maximize quantitative measures of the similarity between candidate excerpts and the source audio as a whole. In their method, a simple Euclidean distance or Cosine distance was used to measure the similarity between different frames.
107
Bartsch and Wakefield (2001) used chroma-based features and the similarity matrix proposed by Foote for music summarization. Chai and Vercoe (2003) proposed a Dynamic Programming method to detect the repetition of a fixed length excerpt in a song one by one. Firstly, they segmented the whole song into frames, and grouped the fixed number of continuous frames into excerpts. Then they computed the repetition property of each excerpt in the song using Dynamic Programming. The consecutive excerpts that have the same repetitive property were merged into sections and each section was labelled according to the repetitive relation (i.e., they gave each section a symbol such as A,B, etc.). The final summary was generated based on the most frequently repeated section.
Evaluation of Summary Result

It is difficult to objectively evaluate the music summarization results because there is no absolute measure to evaluate the quality of generated music summary. The only possible validation is thus through well designed user tests. Basically, the current subjective evaluation methods can be divided into two categories: the Attributes Ranking method and the Ground Truth Based method. The Attributes Ranking Method The Attributes Ranking method uses appropriate attributes to access the users perception of systems. Logan and Chu (2000), and Lu and Zhang (2003) used the general perception of the subjects as the evaluation attribute in the music summary. The rating has three levels: 3, 2, and 1, representing good, acceptable and unacceptable. By comparing the average scores of different summarization methods, the highest scoring method emerges as the best one. The drawback of this evaluation standard is obvious: it is extremely coarse, not only for the number of rating levels, but also for the number of evaluating attributes. Xu et al. (2002) and Shao (Shao et al., 2004) proposed a more delicate evaluation standard. They divided the general perception of a summary into three attributes, namely Clarity, Conciseness and Coherence. In addition, the rating level had been extended to five levels. Chai and Vercoe (2003) considered four novel attributes to evaluate a music summary: the percentage of the summary that contains a vocal portion, the percentage of the summary that contains the songs title, the percentage of the summary that starts at the beginning of a section and the percentage of the summary that starts at the beginning of a phrase. The Ground Truth Based Method The Ground Truth Based method compares the summarization method with a predefined ground truth which is generally assumed as a summary generated manually by the music experts from the original music. Lu and Zhang (2003) measured the overlap between an automatically extracted music summary and the summary generated manually by the music expert. Cooper and Foote (2002) proposed a method to measure whether the summarization is good or not by generating the summary for the same song with different lengths and validating whether the longer length summary contains the shorter one. Their basic
assumption is that the ideal summary lasting for a long time should contain the summary of short time.
Music Summarization in Compressed Domain

So far, the music summarization methods mentioned above are all performed on uncompressed domain. Due to the huge size of music data and the limited bandwidth, audio/music analysis in compressed domain is in great demand. Research in this field is still in its infancy and there are many open questions that need to be solved. There are a number of approaches proposed for compressed domain audio processing. Most of the work focuses on compressed domain audio segmentation (Tzanetakis & Cook, 2000; Patel & Sethi, 1996). Compared with compressed domain speech processing, compressed domain music processing is much more difficult, because music consists of many types of sounds and instrument effects. Wang and Vilermo (2001) used the window type information encoded in MPEG-1 Layer 3 side information header to detect beats. The short windows were used for short but intensive sounds to avoid pre-echo. They found that the window-switching pattern of pop-music beats for their specific encoder at bit-rates of 64-96 kbps gives (long, long-to-short, short, short, short-to-long, long) window sequences in 99% of the beats. However, there is no music summarization method available for compressed domains. Due to the large amount of compressed domain music (e.g., MP3) available nowadays, automatic music summarization in compressed domain is in high demand. We have proposed an automatic music summarization method in the compressed domain (MP3 format) (Shao et al., 2004). Considering the features that have been used to characterize music content for summarization in uncompressed domain, we have developed similar features in the compressed domain to simulate those features. Compressed domain feature selection includes amplitude envelope, spectral centroid and mel-frequency cepstrum. They are extracted frame by frame over a segmentation window which includes 30 MP3 granules. We have analyzed and illustrated that these features approximated well the corresponding features in uncompressed domain (PCM) (Shao et al., 2004). However, there are two major differences between compressed domain and uncompressed domain feature extraction. Firstly, the time resolution for PCM and MP3 is different. For PCM samples, we can arbitrarily adjust the window size; but for MP3 samples, the resolution unit is granule, which means we can only increase or decrease the window size by a granule (corresponding to 576 PCM samples). Secondly, to conceal side effects, we segment PCM samples into a fixed length and overlap windows to generate summary. But for MP3 samples, we group 30 MP3 granules into a bigger window which is not overlapping. Based on calculated features of each frame, all the Machine Learning Approaches mentioned in the uncompressed domain can be used to find the most salient part of the music. We use clustering method in Xu et al. (2002) to group the music frames and get the structure of the music content. After clustering, the structure of the music content can be obtained. Each cluster contains frames with similar features. Summary can be generated in terms of this structure and music domain knowledge. According to music theory, the most distinctive or representative music themes occur repetitively in an entire music work. The scheme of summary generation can be found in Xu et al. (2002).
109
Evaluation
We adopted the Attributes Ranking method to evaluate the summarization result. Three attributes, namely clarity, conciseness and coherence, are introduced. The experiment shows that the summarization conducted on MP3 samples is comparable with the summarization conducted on PCM samples for all genres of music testing samples. The aim of providing different music of different genres is to determine the effectiveness of the proposed method in creating summary of different genres. A complete description of the results can be found in Shao et al. (2004).
Music Video Summarization

Nowadays, many music companies are putting their music video (MTV) products on the Web and customers can purchase them online. From the customer point of view, they would prefer to watch the highlight of an MTV before they make a decision on whether to purchase or not. On the other hand, from the music company point of view, they would be glad to invoke the buying interests of the music fans by showing the highlights of a music video rather than showing all of the video. Although there are summaries in some Web sites, they are currently generated manually, which needs expensive manpower and is time-consuming. Therefore, it is crucial to come up with an automatic summarization approach for music videos. There are a number of approaches proposed for automatically creating video summaries. The existing video summarization methods can be classified into two categories: key-frame extraction and highlight creation. Using a set of key frames to create a video summary is the most common approach. A great number of key frame extraction methods (DeMenthon et al. 1998; Gunsel & Tekalp, 1998) have been proposed. Key frames can help the user identify the desired shots of video, but they are insufficient to help the user obtain a general idea of whether the created summary is relevant or not. To make the created summary more relevant and representative to the video content, video highlight creation methods (Sundaram et al., 2002; Assfalg et al., 2002) are proposed to reduce a long video into a short sequence and help the user determine whether a video is worth viewing in its entirety. It can provide an impression of the entire video content or only contain the most interesting video sequences. MTV is a special kind of video. It is an extension of music and widely welcomed by music fans. Nowadays, automatic video summarization has been applied to sports video (Yow et al., 1995), news video (Nakamura & Kanade, 1997), home video (Gong et al., 2001) and movies (Pfeiffer et al., 1996). However, there is no widely accepted summarization technique used for music video. We have proposed an automatic music video summarization approach (Shao et al., 2003), which is described in the following subsections.
Structure of Music Video

Video programs such as movies, dramas, talk shows, and so forth, have a strong synchronization between the audio and visual contents. What we hear from the audio track is highly correlated with what we see on the screen, and vice versa. For this type of video program, since synchronization between audio and image is critical, the summarization has to be either audio-centric or image-centric. The audio-centric summarization can be accomplished by first selecting important audio segments of the original
video based on certain criteria and then concatenating them to compose an audio summary. To enforce the synchronization, the visual summary has to be generated by selecting the image segments corresponding to those audio segments which form the audio summary. Similarly, the image-centric summarization can be created by selecting representative image segments from the original video to form a visual summary, and then taking the corresponding audio segments to form the associated audio summary. For these types of summarizations, either audio or visual contents of the original video will be sacrificed in the summaries. However, music video programs do not have a strong synchronization between their audio and visual contents. Considering a music video program in which an audio segment presents a song sung by a singer, the corresponding image segment could be a close-up shot of the singer sitting in front of a piano, or shots of some related interesting scenes. The audio content does not directly refer to the corresponding visual content. Since music video programs do not have strong synchronization between the associated audio and visual contents, we propose to first create an audio and a visual summary separately, and then integrate the two summaries with partial alignment. With this approach, we can maximize the coverage for both audio and visual contents without sacrificing either of them.
Our Proposed Method

Figure 4 is the block diagram of the proposed music video summarization system. Music video is separated into the audio track and the visual track. For the audio track, a music summary is created by analyzing music content based on music features, adaptive clustering algorithm and music domain knowledge. For the visual track, shots are detected and clustered using visual content analysis. Finally, the music video summary is created by specially aligning the music summary with clustered visual shots. Assuming for the moment that the music summarization for the audio track has been settled in the previous section, we focus on the process of shot detection and aligning the music summary with clustered visual shots. The music summarization scheme can be found in Xu et al. (2002). Shot Detection In general, to create a video summary, the original video sequence must be first structured into a shot cluster set S. Any pairs in the set must be visually different, and all shots belonging to the same cluster must be visually similar. We choose the first frame appearing after each detected shot boundary as a key frame. Therefore, for each shot, we have a key frame related to it. When comparing the similarities of two different shots, we calculate the difference between two key frames using color histograms.
Dv (i, j ) =
e =Y ,U ,V k =1..n
e i
( k ) h e (k ) j
(7)
where hie , h e are the YUV-level histograms of key frame i and j, respectively. j
111
Figure 4. Block diagram of proposed summarization system
The total number of clusters in S varies depending on the internal structure of the original video. When given a shot cluster set S, the video sequence with the minimum redundancy measure is the one in which all the shot clusters have a uniform occurrence probability and an equal time length of 1.5 seconds (Gong et al., 2001). Based on these criteria, the video summaries were created using the following major steps: 1. 2. Segment the video into individual camera shots. Group the camera shots into clusters based on their visual similarities. After the clustering process, each resultant cluster consists of the camera shots whose similarity distance to the centre of the cluster is below a threshold D. For each cluster, find the shot with the longest length, and use it as the representative shot for the cluster. Discard the clusters with a representative shot shorter than 1.5 seconds. For those clusters with a representative shot longer than 1.5 seconds, we cut the shot to 1.5 seconds. Sort the representative shots of all clusters by the time code, resulting in the representative shot set U={u1, u2,, um}, mn, where n is the total number of clusters in shot set S.
3. 4.
5.
Music Video Alignment The final task for creating a music video summary is the alignment operation that partially aligns the image segments in the video summary with the associated music segments. Our goal for the alignment is that the summary should be smooth and natural,
and should maximize the coverage for both music and visual contents of the original music video without sacrificing audio or visual parts. Assume that the whole time span Lsum of the video summary is divided by the alignment into P partitions (required clusters), and the time length of partition i is T i. Because each image segment forming the visual summary must be at least L min seconds long (a time slot equals one Lmin duration) as shown in Figure 5, partition i will provide
N i = Ti /Lmin
and hence the total number of available time slots become
(8)
N total = N i
i =1
(9)
For each partition, the time length of music subsummary lasts for three to five seconds, and the time length of a shot is 1.5 seconds. Therefore, the alignment problem can be formally described as Given: An ordered set of representative shots U={u1, u2,, um}, mn, n is the total number of clusters in cluster set S. P partitions and N total time slots.
1. 2.
To extract: P sets of output shots R ={R1, R2, , RP} which are the best matches between shot set U and Ntotal time slots. where: P=The number of partitions where r i1,...,r ij ,...,riNi are optimal shots selected from the shot set U for the i-th partition. By proper reformulation, this problem can be converted into the Minimum Spanning Tree (MST) problem (Dale, 2003). Let G = (V, E) represent an undirected graph with a Figure 5. Alignment operations on image and music
T1 T2 Ti TP-1 TP
Ri={ri1,...,r ij ,...,r iNi} U, i=1,2,,P and Ni= Ti /Lmin
Audio Image
Partition 1 1
Lmin
Partition 3
Lmin
P 3 4
2
Lmin
1
Lmin
2
Lmin
Lmin
113
weighted edge set V and a finite set of vertices E. The MST of a graph defines the lowestweight subset of edges that spans the graph in one connected component. To apply the MST on our alignment problem, we use each vertex to represent a representative shot ui, and an edge eij=(ui , uj) to represent the similarity between shots ui and uj. The similarity here is defined as the combination of time similarity and visual similarity, and we give time similarity a higher weight. The similarity is defined as follows:
eij = (1 )T (i, j ) + D (i, j )
(10)
where (0 1) is a weight coefficient, and D(i, j ) and T(i, j) represent the normalized visual similarity and time similarity, respectively.
D(i, j ) is defined as follows:

D (i, j ) = Dv(i,j) /max(Dv(i,j))
(11)
where Dv(i,j) is the visual similarity calculated from Equation (9). After normalization,
D(i, j ) has a value range from 0 to 1. T(i,j) is defined as follows: 1/(F j Li ) Li < F j T(i,j) = 0 otherwise
(12)
where Li is the index of the last frame in the ith shot, and Fj is the index of the first frame in the jth shot. Using this equation, the closer the two shots are in the time domain, the higher the time similarity value they get. T(i,j) varies from 0 to 1, when shot j just follows shot i. There are no other frames between these two shots. In order to give the time similarity high priority, we set less than 0.5. Thus, we can create a similarity matrix for all shots in representative shots set U, and the (i,j)th element of is eij. For every partition Ri, we generate an MST based on the similarity matrix . In summary, for creating content-rich audio-visual extraction, we propose the following alignment operations: 1. Summarize the music track of the music video. The music summary consists of several partitions, each of which lasts for three to five seconds. The total duration of the summary is about 30 seconds. Divide each music partition into several time slots, each of which lasts for 1.5 seconds. For each music partition, find the corresponding image segment as follows:
2. 3.
In the first time slot of the partition, find the corresponding image segment in the time domain. If it exists in the representative shot set U, assign it to the first slot and delete it from the shot set U; if not, identify it in the shot set S, and find a most similar shot in shot set U using similarity measure which is defined in Equation (7). After finding the shot, take it as the root, apply an MST algorithm on it, find other shots in shot set U, and fill them in the subsequent time slots in this partition.
Future Work
We believe that there is a long way to go for automatically generating transcription from acoustic music signals. Current techniques are not robust and efficient enough. Thus, analyzing the acoustic music signals directly without transcription is practically important in music summarization. The future direction will focus on making summarization more accurate. To achieve this, on the one hand, we need to explore more music features that can be used to characterize the music content; on the other hand, we need to investigate more sophisticated music structure analysis methods to create more accurate and acceptable music summary. In addition, we will investigate more deeply into human perception of music; for example, what makes part of music sound like a complete phrase, and what makes it memorable or distinguishable. For music video, except for improving the summarization on the audio part, more sophisticated music/video alignment methods will be developed. Furthermore, some of the other information in music video can be integrated generation the summary. For example, some Karaoke music videos have lyric captions, which can be detected and recognized. These captions, together with visual shots and vocal information, can be used to make better music video summary.
MUSIC GENRE CLASSIFICATION

The ever-increasing wealth of digitized music on the Internet calls for an automated organization of music materials. Music genre is an important description that can be used to classify and characterize music from different sources such as music shops, broadcasts and Internet. It is very useful for music indexing and content-based music retrieval. For humans, it is not difficult to classify music into different genres. Although to make computers understand and classify music genre is a challenging task, there are still perceptual criteria related to the melody, tempo, texture, instrumentation and rhythmic structure that can be used to characterize and discriminate different music genres. A music genre is characterized by common features related to instrumentation, texture, dynamics, rhythmic characteristics, melodic gestures and harmonic content. Similar to music summarization, the challenge of genre classification is to determine the relevant features and find a way to extract them. Once the features have been extracted, it is then necessary to find an appropriate pattern recognition method for classification. Fortunately, there are a variety of existing machine learning and heuristic-based techniques that can be adapted to this task. Aucouturier and Pachet (2003) presented an overview of the various approaches for automatic genre classification and categorized them into two categories: prescriptive approaches and emergent approaches.
115
Prescriptive Approach
Aucouturier and Pachet (2003) defined the prescriptive approach as an automatic process that involves two steps: frame-based feature extraction followed by machine learning. Tzanetakis et al. (2001) cited a study indicating that humans are able to classify genre after hearing only 250 ms of a music signal. The authors concluded from this that it should be possible to make classification systems that do not consider music form or structure. This implied that real-time analysis of genre could be easier to implement than thought. The ideas were further developed in Tzanetakis and Cook (2002), where a fully functional system was described in detail. The authors proposed to use features related to timbral texture, rhythmic content and pitch content to classify pieces, and the statistical values (such as the mean and the variance) of these features were then computed. Several types of statistical pattern recognition (SPR) classifiers are used to identify genre based on feature data. SPR classifiers attempt to estimate the probability density function for the feature vectors of each genre. The Gaussian Mixture Model (GMM) classifier and K-Nearest Neighbor (KNN) classifier were, respectively, trained to distinguish between 20 music genres and three speech genres by feeding them with feature sets of a number of representative samples of each genre. Pye (2000) used MFCCs as the feature vector. Two statistical classifier, GMM and Tree-based Vector Quantization scheme, were used separately to classify music into six types: blues, easy listening, classic, opera, dance and rock. Grimaldi (Grimalidi et al., 2003) built a system using a discrete wavelet transform to extract time and frequency features, for a total of 64 time features and 79 frequency features. This is a greater number of features than Tzanetakis and Cook (2002) used, although few details were given about the specifics of these features. This work used an ensemble of binary classifiers to perform the classification operation with each trained on a pair of genres. The final classification is obtained through a vote of the classifiers. Tzanetakis, in contrast, used single classifiers that processed all features for all genres. Xu (Xu et al., 2003) proposed a multilayer classifier based on support vector machines (SVM) to classify music into four genres of pop, classic, rock and jazz. In order to discriminate different music genres, a set of music features was developed to characterize music content of different genres and an SVM learning approach was applied to build a multilayer classifier. For different layers, different features and support vectors were employed. In the first layer, the music was classified into pop/classic and rock/jazz using an SVM to obtain the optimal class boundaries. In the second layer, pop/classic music was further classified into pop and classic music and rock/jazz music was classified into rock and jazz music. This multilayer classification method can provide a better classification result than existing methods.
Classification Results and Evaluation

It is impossible to give an exhaustive comparison of these approaches because they use different target taxonomies and different training sets. However, we can still draw some interesting remarks.
Tzanetakis et al. (2001) achieved 61% accuracy using 50 songs belonging to 10 genres. Pye (2000) reported 90% accuracy on a total set of 175 songs over five genres. Grimalidi et al. (2003) achieved a success rate of 82%, although only four categories are used. Xu et al. (2003) reported accuracy of 90% over four categories. A common remark is that some types of music have proven to be more difficult to classify than others. In particular, Classic and Techno are easy to classify, while Rock and Pop are not. A possible explanation for this is that the global frequency distribution of Classic and Techno is very different from other music types, whereas many Pop and Rock music versions use the same instrumentation.
Emergent Approach
There are two challenges in the prescriptive method (see the preceding section): how to determine features to characterize the music and how to find an appropriate pattern recognition method to perform classifications. The more fundamental problem, however, is to determine the structure of the taxonomy in which music pieces will be classified. Unfortunately, this is not a trivial problem. Different people may classify the same piece differently. They may also select genres from entirely different domains or emphasize different features. There is often an overlap between different genres, and the boundaries of each genre are not clearly defined. The lack of universally agreed upon definitions of genres and relationships between them makes it difficult to find appropriate taxonomies for automatic classification systems. Pachet and Cazaly (2000) attempted to solve this problem. They observed that the taxonomies currently used by the music industry were inconsistent and therefore inappropriate for the purpose of developing a global music database. They suggested building an entirely new classification system. They emphasized the goals of producing a taxonomy that was objective, consistent, and independent from other metadata descriptors and that supported searches by similarity. They suggested a tree-based system organized by genealogical relationships as an implementation, where only leaves would contain music examples. Each node would contain its parent genre and the differences between its own genre and that of its parent. Although merits exist, the proposed solution has problems of its own. To begin with, defining an objective classification system is much easier to say than do, and getting everyone to agree on a standardized system would be far from an easy task, especially when it is considered that new genres are constantly emerging. Furthermore, this system did not solve the problem of fuzzy boundaries between genres, nor did it deal with the problem of multiple parents that could compromise the tree structure. Since, up to now, no good solution for the ambiguity and inconsistence of music genre definition, Pachet et al. (2001) presented the emergent approach as the best approach to take to achieve automatic genre classification. Rather than using existing taxonomies, as done in prescriptive systems, emergent systems attempted to emerge classifications according to certain measure of similarity. The authors suggested some similarity measurements based on audio signals as well as on cultural similarity gleaned from the application of data mining techniques to text documents. They proposed the use of both collaborative filtering to search for similarities in the taste profiles of different individuals and co-occurrence analysis on the play lists of different radio programs and
117
track listings of CD compilation albums. Although this emergent system has not been successfully applied to music, the idea of automatically exploiting text documents to generate genre profiles is an interesting one.
Future Work
There are two directions for prescriptive approaches that need to be investigated in the future. First, more music features are needed to be explored because the better feature set can improve the performance dramatically. For example, some music genres use the same instrumentation, which implies that the timbre features are not good enough to separate them. Thus we can use rhythm features in the future. Existing beat-tracking systems are useful in acquiring rhythmic features. However, many existing beat-tracking systems provide only an estimate of the main beat and its strength. For the purpose of genre classification, more detailed information of features such as overall meter, syncopation, use of rubato, recurring rhythmic gestures and the relative strengths of beats and subbeats are all significant. Furthermore, we can consider segmenting the music clip according to its intrinsic rhythmic structure. It captures the natural structure of music genres better than the traditional fixed length window framing segmentation. The second direction is to scale-up unsupervised classification to music genre classification. Since the supervised machine learning method is limited by the inconsistency of built-in taxonomy, we will explore an unsupervised machine learning method which tries to emerge a classification from the database. We will also investigate the possibility of combining an unsupervised classification method with a supervised classification method for music genre classification. For example, the unsupervised method could be employed to initially classify music into broad and strongly different categories, and the supervised method could then be employed to classify finely narrowed subcategories. This would partially solve the problem of fuzzy boundaries between genres and could lead to better overall results. The emergent approach is able to extract high-level similarity between titles and artists and is therefore suitable for unsupervised clustering of songs into meaningful genre-like categories. These techniques suffer from technical problems, such as labelling clusters. These issues are currently under investigation.
SEMANTIC REGION DETECTION IN ACOUSTICAL MUSIC SIGNAL

Semantic region detection in music signals is a new direction in music content analysis. Compared with speech signals, music signals are heterogeneous because they contain different source signal mixtures in different regions. Thus, detecting these regions in music signals can bring down the complexity of both analysis of music content and information extraction. For example, in order to build a content-based music retrieval system, the sung vocal line is one of the intrinsic properties in a given music signal. In automatic music transcription, instrumental mixed vocal line should be analyzed in order to extract music note information such as the type of the instrument and note characteristics (attack, sustain, release, decay). In automatic music summarization, semantic structure of the song (i.e., intro, chorus, verses and bridge, outro) should be identified
accurately. In singing voice identification, it is required to detect and extract the vocal regions in the music. In music source separation, it is required to identify the vocal and instrumental section. To remove the voice from the music for applications such as Karaoke and for automatic lyrics generator, it is also required to detect the vocal sections in the music signal. The continuous raw music data can be divided into four preliminary classes: pure instrumental (PI), pure vocal (PV), instrumental mixed vocal (IMV), and silence (S). The pure instrumental regions contain signal mixture of many types of musical instruments such as string type, bowing type, blowing type, percussion type, and so forth. The pure vocal regions are the vocal lines sung without instrumental music. The IMV regions contain the mixture of both vocals and instrumental music. Although the silence is not a common section in popular music, it can be found at the beginning, ending and between chorus verse transitions in the songs. The singing voice is the oldest musical instrument and the human auditory physiology and perceptual apparatus have evolved to a high level of sensitivity to the human voice. After over three decades of extensive research on speech recognition, the technology has matured to the level of practical applications. However, speech recognition techniques have limitations when applied to singing voice identification because speech and singing voice differ significantly in terms of their production and perception by the human ear (Sundberg, 1987). A singing voice has more dynamic and complicated characteristics than speech (Saitou et al., 2002). The dynamic range of the fundamental frequency (F0) contours in a singing voice is wider than that in speech, and F0 fluctuations in singing voices are larger and more rapid than those in speech. The instrumental signals are broadband and harmonically rich signals compared with singing voice. The harmonic structures of instrumental music are in the frequency range up to 15 kHz, whereas singing voice is in the range of below 5 kHz. Thus it is important to revise the speech processing techniques according to structural musical knowledge so that these techniques can be applied to music content analysis such as semantic region detection where the signal complexity defers for different regions. The semantic region detection is a hot topic in content-based music analysis. In the following subsections, we summarize related work in this area and introduce our new approach for semantic region detection.
Related Work
Many of the existing approaches for speech, instrument and singing voice detection and identification are based on speech processing techniques.
Boundary Detection in Speech

Speech can be considered as a homogeneous acoustic signal because it contains the signal of a single source only. Regions in the speech can be classified into voiced, unvoiced, voiced/unvoiced mixture or silence depending on how the speech model is excited. The efficiency of extracting the meaning of the continuous speech signal depends on how accurately the regions are detected in the automatic speech recognition systems. Many researches on speech analysis have been done over three decades and methodologies are well established (Rabiner & Juang, 1993; Rabiner & Schafer, 1978). All
119
vowels and some consonants [m], [n], [l] are voiced while other consonants [f], [s], [t] are unvoiced. For unvoiced, the source is no longer the phonation of the vocal folds, but the turbulence caused by air is impeded by the vocal tract. Some consonants ([v], [z]) are mixed sounds (mixture of voiced and unvoiced) that use both phonation and turbulence to produce the overall sound (Kim, 1999). Speech is a narrow band (<10 kHz) signal and voiced and unvoiced regions are distinctive in the spectrogram. Voiced fricatives produce quasi-periodic pulses. Thus harmonically spaced strong frequencies in the lower frequency band (<1 kHz) can be noticed in the spectrogram. Since unvoiced fricatives are produced by exciting the vocal tract with broadband noise, it appears as a broadband frequency beam in the spectrum. The analysis of formant structures which are the resonant frequencies of the vocal tract tube has been one of the key techniques for detecting the voiced/unvoiced regions. Pitch contour and time domain speech modeling using signal energy and average zero crossing are some of the other speech features inspected for detecting the speech boundaries. Basic steps for detecting the boundaries in the speech signals are shown in Figure 6. The signal is first segmented into 30~40 ms with 50% overlapping short windows, then features are extracted. The shorter window smooths the shape of lower frequencies in the spectrum and highlights the lower frequency resonant in the vocal tract (formants). Another reason is that a shorter window is capable of detecting dynamic changes of the speech and with reasonable window overlap can detect these temporal properties in the signal. The linear predictive coding coefficients (LPC) calculated from the speech model, stationary signal spectrum representation using Cepstral coefficients and dominant pitch sensitive Mel-scaled Cepstral coefficients are some of the features extracted from the short time windowed signals and they are modeled with statistical learning methods. Most of the speech recognition systems have employed Hidden Markov Model (HMM) to detect these boundaries and it is found that HMM is efficient in modeling the dynamically changing speech properties in different regions.
Instrument Detection and Identification

Researches on instrumental music content analysis focus on identifying musical instruments, timbre, and musical notes from isolated tones recorded in the studio environment. The approaches are similar to the steps shown in Figure 6 where statistical signal processing and neural network have been employed for feature extraction and pattern classification. Fujinaga (1998) trained a K-Nearest Neighbor (K-NN) classifier with spectral domain features extracted from 1,338 spectral slices representing 23 instruments playing a range of pitches. The extracted features included the mass or the integral of the curve (zeroth-order moment), the centroid (first-order moment), the standard deviation (square root of the second-order central moment), the skewness (third-order central moment), kurtosis (fourth-order central moment), higher-order central moments (up to 10th), the Figure 6. Steps for boundary detection in speech signals
Feature extraction Signal Segmentation / Windowing Classification / Learning Boundary detection
Speech
fundamental frequency, and the amplitudes of the harmonic partials. For the best recognition score, genetic algorithm (GA) was used to find the optimized subset of 352 main feature set. In his experiment, it was found that the accuracy varied from 8% to 84%. Cosi (Cosi et al., 1994) trained a Self-Organizing Map (SOM) with MFCCs extracted from isolated musical tones of the 40 different musical instruments for timbre classification. Martin (1999) trained a Bayesian network with different types of features such as spectral features, pitch, vibrato, tremolo features, and note characteristic features to recognize the nonpercussive musical instruments. Eronen and Klapuri (2000) proposed a system for musical instrument recognition using a wide set of features to model the temporal and spectral characteristics of sounds. Kashino and Murase (1997) compared the classification abilities of a feed-forward neural network with a K-Nearest Neighbor classifier, both trained with features of the amplitude envelopes for isolated instrument tones. Brown (1999) trained a GMM with constant-Q Cepstral coefficients for each instrument (i.e., oboe, saxophone, flute and clarinet), using approximately one minute of music data each. Maddage et al. (2002) extracted spectral power coefficients, ZCR, MFCCs and LPC derived Cepstral coefficients from eight pitch class electrical guitar notes (C4 to C5 shown in Table 1) and employ nonparametric learning technique (i.e., the nearest neighbour rule) to classify the musical notes. Over 85% accuracy of correct note classification was reported using a musical note database which has 100 samples of each note.
Singing Voice Detection and Identification

For singing voice detection, Berenzweig and Ellis (2001) used probabilistic features, which are generated from Cepstral coefficients using an MPL neural network acoustic model with 2000 hidden units. Two HMMs (vocal-HMM and nonvocal-HMM) were trained with these specific features, which are originally extracted from 61 fragments (one fragment = 15 seconds) of training data, to classify vocal and instrumental sections of a given song. However, the reported accuracy was only 81.2% with a 40-fragment training dataset. Kim and Brian (2002) first filtered the music signal using IIR band-pass filter (200~2000 Hz) to highlight the vocal energies and then vocal regions were identified by detecting high amounts of harmonicity of the filtered signal using inverse comb filterbank. They achieved 54.9% accuracy with the test set of 20 songs. Zhang (2003) and Zhang and Kuo (2001) used a simple threshold, which is calculated using energy, average zero crossing, harmonic coefficients and spectral flux features, to find the starting point of the vocal part of the music. The similar technique was applied to detect the semantic boundaries of the online audio data; that is, speech, music and environmental sound for the classification. However, the vocal detection accuracy was not reported. Tsai et al. (2003) trained 64-mixture vocal GMM and an 80-mixture nonvocal GMM with MFCCs extracted from 32-ms time length with 10-ms overlapped training data (216 song tracks). An accuracy of 79.8% was reported with their 200 testing tracks. In our previous work (Maddage et al., 2003; Gao et al., 2003; Xu et al., 2004), we trained different statistical learning techniques such as SVM, HMM, MLP neural
121
networks to detect the vocal and nonvocal boundaries in hierarchical fashion. Similar to other methods, our experiments were based on speech-related features, that is, LPC, LPC derived Cepstral coefficients, MFCCs, ZCR and Spectral Power (SP). After parameter tuning we could reach an accuracy over 80%. In Maddage et al. (2004a), we measured both the harmonic spacing and harmonic strength of both instrumental and vocal spectrums in order to detect the instrumental/vocal boundaries in the music. For singer identification, a vocal and instrumental model combination method has been proposed in Maddage et al. (2004b). In that method, vocal and instrumental sections of the songs were characterised using octave scale Cepstral coefficients (OSCC) and LPC derived Cepstral coefficients (LPCC), respectively, and two GMMs (one for vocal and the other for instrumental) were trained to highlight the singer characteristics. The experiments performed on a database of 100 songs indicated that the singer identification could be improved (by 6%) when the instrumental models were combined with the vocal model. The previous methods have borrowed the mature speech processing ideas such as fixed-frame-size acoustic signal segmentation (usually 20~100-ms frame size and 50% overlap), speech processing/coding feature extraction, and statistical learning procedures or linear threshold for segment classification, to detect the vocal/nonvocal boundaries of the music. Although these methods have achieved up to 80% of frame level accuracy, their performance is limited due to the fact that musical knowledge has not been effectively exploited in these (mostly bottom-up) methods. We believe that a combination of bottom-up and top-down approaches, which combines the strength of low-level features and high-level musical knowledge, can provide a powerful tool to improve system performance. In the following subsections, we investigate how well the speech processing techniques can cope with the semantic boundary detection task, and we propose a novel approach, which considers both signal processing and musical knowledge, to detect semantic regions in acoustical music signals.
Song Structure
The popular music structure often contains Intro, Verse, Chorus, Bridge and Outro (Ten Minute Master, 2003). The intro may be two, four or eight bars long (or longer), or there may not be any intro in a song at all. The intro of the pop song is a flash back of the chorus. Both verse and chorus are eight to sixteen bars long. Typically the verse is not as strongly melodic as the chorus. However the verse and chorus of some songs like Beatles songs are equally strong and most of the people can hum or sing their way. Usually the gap between verse and chorus is linked by a bridge which may be only two or four bars. There are instrumental sections in the song and they can be instrumental versions of chorus or verse, or an entirely different tune with an altogether different set of chords. Silence may act as a bridge between verse and chorus of a song, but such cases are rare. Since the sung vocal passages follow the changes in the chord pattern, we can apply the following knowledge of chords (Goto, 2001) to the timing information of vocal passages: 1. 2. Chords are more likely to change on beat times than on other positions. Chords are more likely to change on half-note times than on other positions of beat times.
Figure 7. Time alignment of the structure in songs with bars and quarter notes
Song: 4/4
bar 1 bar 2 bar i bar j i < j < k<l < n bar k bar l Bar length Quarter note length bar n
Intro
Verse
Chorus
Verse
Bridge
Chorus
Outro
Song structure
3.
Chords are more likely to change at the beginning of the measure than at other positions of half-note times.
A typical song structure and its possible time alignment of bars and quarter notes are shown in Figure 7. The duration of the song can be measured by the number of quarter notes in the song, where the quarter-note time length is proportional to the interbeat timing. The beginning and the end of Intro, Verse, Chorus Bridge and Outro usually start and end at quarter notes. This is illustrated in Figure 7. Thus we detect both the onsets of the musical notes and chord pattern changes to compute the quarter-note time length with high confidence.
Overview of the Proposed Method

The block diagram of the proposed approach is shown in Figure 8. In our approach we detect the timing information of the song in terms of quarter-note length which is proportional to interbeat time intervals. Then the audio is segmented into frames which are proportional to quarter-note length. To differentiate from the frequently used fixedlength segmentation, we call this beat space segmentation (BSS). The technical details of rhythm extraction and beat space segmentation are detailed in the next section. There are two reasons why we use BBS. 1. The careful analysis of the song structure reveals that the time lengths of the semantic regions (PV, IMV, PI, & S) are proportional to the interbeat time interval of the music which corresponds to the quarter-note length (see preceding section and next section). The dynamic behaviour of the beat spaced signal section is quasi stationary (Sundberg, 1987; Rossing et al. , 2002). In another way, the musically driven signal properties such as octave spectral spacing and musical harmonic structure change in beat space time steps.
2.
Rhythm Extraction and Beat Space Segmentation

The rhythm extraction is important to obtain metadata from the music. Rhythm can be perceived as a combination of strong and weak beats (Goto & Muraoka, 1994). A strong beat usually corresponds to the first and third quarter note in a measure and the weak beat corresponds to the second and forth quarter note in a measure. If the strong
123
Figure 8. Block diagram of the proposed approach
Musical Audio
Rhythm extraction
Quarter note length proportional audio segmentation
Beat Space Segmentation
Silent frame detection
Musically modified feature extraction Statistical learning & classification
Pure Vocal (PV) Instrumental Mixed Vocals (IMV) Pure Instrumental (PI)
Semantic Regions
beat constantly alternates with the weak beat, the interbeat interval, which is the temporal difference between two successive beats, would correspond to the temporal length of a quarter note. In our method, the beat corresponds to the sequence of equally spaced phenomenal impulses which define the tempo for the music (Scheirer, 1998). We assume the meter to be 4/4, this being the most frequent meter of popular songs and the tempo of the input song to be constrained between 30-240 M.M (Mlzels Metronome: the number of quarter notes per minute) and almost constant (Scheirer, 1998). Our proposed rhythm tracking and extraction approach is shown in Figure 9. We employ a discrete wavelet transform technique to decompose the music signal according to octave scales. The frequency ranges of the octave scales are detailed in Table 1. The system detects both onsets of musical notes (positions of the musical notes) and the chord changes in the music signal. Then, based on musical knowledge, the quarter-note time length is computed. The onsets are detected by computing both frequency transients and energy transients in the octave scale decomposed signals described in Duxburg et al. (2002). In order to detect hard and soft onsets we take the weighted summation of onsets, detected in each sub-band as shown in Equation (13) where Sb i (t) is the onset computed in the i th sub-band at time t and On(t) is the weighted sum of each sub-band onset at time t.
Figure 9. Rhythm tracking and extraction

Steps for onset detection Onset Detection Autocorrelation Frequency Transients Moving Threholds
Sub-band 1
Peak Tracking Ineter beat time information
Sub-band 2
Audio music
Transient Engergy Sub-band frequency spectrum
Rhythm Knowledge Distance measure Musical chord Identification Chord change time information
Sub-band 8
Frequency code book for musical chords
Octave scale sub-band decomposition using Wavelets
Musical chord detector
On(t ) = w(i ) Sbi (t )

i =1
(13)
Based on a statistical analysis of the autocorrelation of detected strong and weak onset times On(t), we obtain an interbeat interval corresponding to the temporal length of a quarter note. By increasing the sensitivity of peak tracking in the regions between the detected quarter notes, we obtain a sixteenth-note level accuracy as shown in Figure 10. We created a code book which contains spectrums of all possible major and minor chord patterns. Musical chords are generated synthetically by mixing prerecorded musical notes of different musical instruments (piano, base guitar, acoustic guitar, Roland synthesizer and MIDI tone database). The size of the chord book is reduced following vector quantization (VQ). To detect the chord, we compare the distance between the spectrums of the given signal segment and spectral patterns in the code book. The closest spectral pattern of the code book is assigned. The detected time information of chord pattern change is compared with the interbeat time information (length of quarter note) given by the onset detector. The chord change timings are multiples of quarter-note time length as described in the section on Song Structure. We divide the detected sixteenth-note level pulses into four groups that correspond to the length of a quarter note. Then the music is framed into quarter-note spaced segments for vocal onset analysis based on the musical knowledge of chords.
Silence Detection
Silence is defined as a segment of imperceptible music, including unnoticeable noise and very short clicks. We use short-time energy to detect silence. The short-time energy function of a music signal is defined as
125
Figure 10. Three-second clip of a musical audio signal

3 s e c o n d s e x c e r t f r o m " L a d y in R e d - C h r is D e B u r g h " 1 Signal Strength 0 .5 0 - 0 .5 -1 20 15 Energy 10 5 0 1 0 .8 0 .6 0 .4 0 .2 0 0 2 4 6 8 1 0 1 2 x 10 4 R e s u lts o f A u to c o r r e la tio n 0 2 4 6 8 1 0 1 2 x 10 4
S ix e te e n th N o te L e v e l P lo t
Strength
4 6 8 S a m p le s ( s a m p le r a te = 4 4 1 0 0 H z )
1 0
1 2
x 10
Table 1. Musical note frequencies and their placement in the octave scales sub-bands
Sub-band No Octave scale - B1 Freq - range (Hz) 0~ 64 C C# D D# E F F# G G# A A# B Musical notes 01
C2 to B2
02
C3 to B3
03
C4 to B4
04
C5 to B5
05
C6 to B6
06
C7 to B7
07
C8 to B8
08 All higher Octave scales in the frequency range of (8192 ~ 22050)
64 ~ 128 65.406 69.296 73.416 77.782 82.407 87.307 92.499 97.999 103.826 110.000 116.541 123.471
128 ~ 256 130.813 138.591 146.832 155.563 164.814 174.614 184.997 195.998 207.652 220.000 233.082 246.942
256 ~ 512 261.626 277.183 293.665 311.127 329.628 349.228 369.994 391.995 415.305 440.000 466.164 493.883
512 ~ 1024 523.251 554.365 587.330 622.254 659.255 698.456 739.989 783.991 830.609 880.000 932.328 987.767
1024 ~ 2048 1046.502 1108.730 1174.659 1244.508 1318.510 1396.913 1479.978 1567.982 1661.219 1760.000 1864.655 1975.533
2048 ~ 4096 2093.004 2217.46 2349.318 2489.016 2637.02 2793.826 2959.956 3135.964 3322.438 3520 3729.31 3951.066
4096 ~ 8192 4186.008 4434.92 4698.636 4978.032 5274.04 5587.652 5919.912 6271.928 6644.876 7040 7458.62 7902.132
En =
1 [ x(m) w(n m)]2 N m
(14)
where x(m) is the discrete time music signal, n is the time index of the short-time energy, and w(m) is a rectangular window, which length is equal to the quarter-note time length, that is
126 Xu, Shao, Maddage, Jin & Tian 0 n N 1, otherwise.
1, w(n) = 0,
(15)
If the short-time energy function is continuously lower than a certain set of thresholds (there may be durations in which the energy is higher than the threshold, but the durations should be short enough and far enough apart from each other), then the segment is indexed as silence. Silence segments will be removed from the music sequence.
We propose a new frequency scaling called the Octave Scale instead of Mel scale to calculate Cepstral coefficients (see section of Timbral Textual Features). The fundamental frequencies F0 and the harmonic structures of musical notes are in an octave scale as shown in Table 1. Sung vocal lines always follow the instrumental line such that both pitch and harmonic structure variations are also in octave scale. In our approach we divide the whole frequency band into eight sub-bands corresponding to the octaves in the music. The frequency ranges of the sub-band are shown in Figure 11. The useful range of fundamental frequencies of tones produced by musical instruments is considerably less than the audible frequency range. The highest tone of the piano has a frequency of 4186 Hz, and this seems to have evolved as a practical upper limit for fundamental frequencies. We have considered the entire audible spectrum to accommodate the harmonics (overtones) of the high tones. The range of fundamental frequencies of the voice demanded in classical opera is 80~1200 Hz which corresponds to the low end of the bass voice and the high end of the soprano voice, respectively. Linearly placing the maximum number of filters in the bands where the majority of the singing voice is present would give better resolution of the signal in that range. Thus, we use 6, 8, 12, 12, 8, 8, 6 and 4 filters in sub-bands 1 to 8, respectively. Then the octave scale Cepstral coefficients are extracted according to Equation (5). In order to make a comparison, we also extract Cepstral coefficients from the Mel scale. Figure 12 illustrates the deviation of the third Cepstral coefficient derived from both scales for PV, PI and IMV classes. The frame size is 30 ms without overlap. It can be seen that the standard deviation is lower for the coefficients derived from the Octave scale, which would make it more robust in our application.
Feature Extraction
Figure 11. Filter-band distribution in octave scale for calculating Cepstral coefficients
Critical Band filters positions
Octaves Sub band frequency (Hz) Sub band number
~B1
C2 to B2 0~128 01
C3 to B3 128~256 02
C4 to B4 256~512 03
C5 to B5 512~1024 04
C6 to B6 1024~2048 05
C7 to B7 2048~4096 06
C8 to B8 4096~8192 07
C9 ~ 8192~22050 08
127
Figure 12. Third Cepstral coefficient derived from Mel-scale (1~1000 frame in black colored lines) and Octave scale (1001~2000 frames in ash colored lines)
3rd Cepstral coefficient derived from M el-Scale and O ctave Scale

1
(a)
0.8 0.6 0.4 1 0 200 400 600 800 1000
Pure Vocal ( PV )
1200
1400
1600
1800
2000
(b)
0.5 0 -0.5 1 0 200 400 600 800 1000
Pure Instrum ental (PI)

1200 1400 1600 1800 2000
(c)
Instrumental M ixed Vocals (IMV)

0.5 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000
Fram e N um ber
Statistical Learning
To find the semantic region boundaries, we use a two-layer hierarchical classification method as shown in Figure 13. This has been proven to be more efficient than the single layer multiclass classification method (Xu et al., 2003; Maddage et al., 2003). In the initial experiments (Gao et al., 2003; Maddage et al., 2003), it is found that PV can be effectively classified from PI and IMV. Thus we separate PV from other classes in the first layer. In the second layer, we detect PV and IMV. However, PV regions in the popular music are rare compared with both PI and IMV regions. In our experiments, SVM and GMM are used as classifiers in layers 1 & 2. When the classifier is SVM, layers 1 & 2 are modeled with parameter optimized radial basis kernel function (Vapnik, 1998). The Expectation Maximization (EM) algorithm (Bilmes, 1998) is used to estimate the parameters for layers 1 & 2 in GMM. The order of Cepstral coefficients, used for both Mel scale and Octave scale in the layer 1 & 2, in both classifiers (SVM & GMM) and number of Gaussian mixtures in each layers are shown in Table 2.
Experimental Results
Our experiments are performed using 15 popular English songs (West life- Moments, If I Let You Go, Flying without Wings, My Love and Fragile Heart; Backstreet Boys- Show me the meaning of being lonely, Quit playing games (with my Heart), All I have to give, Shape of my Heart and Drowning; Michel Learns to Rock (MLTR)-Paint my love, 25 Minutes, Breaking my Heart, How many Hours and Someday) and 5 Sri Lankan songs (Maa Baa la Kale, Mee Aba Wanaye, Erata Akeekaru, Wasanthaye Aga and Sehena Lowak). All music data are sampled from commercial CDs at a 44.1 kHz sample rate and 16 bits per sample in stereo.
Figure 13. Hierarchical classification

Musical Audio Layer 1 Pure Vocals (PV) Pure Instrumental + Instrumental Mixed Vocals (PI + IMV) Layer 2 Pure Instrumental (PI) Instrumental Mixed Vocals (IMV)
Table 2. Orders of Cepstral coefficients and no. of GMs employed

Layers Layer 1 Layer 2 Mel scale 18 24 Octave scale 10 12 No of Gaussian Mixtures (GMs) 24-(PV), 28-(PI+IMV) 48-(PI), 64-(IMV)
We first conduct the experiments on our proposed rhythm tracking algorithm to find the quarter-note time intervals. We test for 30s, 60s, and full length intervals of the song and note the average quarter-note time length calculated from the rhythm tracking algorithm. Our system has managed to obtain 95% of average accuracy in the number of beats detected with a 20 ms error margin on the quarter-note time intervals. The music is then framed into quarter-note spaced segments and experiments are conducted for the detection of the class boundaries (PV, PI, & IMV) of the music. Twenty songs are used by cross validation where 3/2 songs of each artist are used for training and testing respectively in each turn. We perform three types of experiments: EXP1: training and testing songs are divided into 30 ms with 50% overlapping frames. EXP2: training songs are divided into 30 ms with 50% overlapping frames and testing songs are framed according to quarter-note time interval. EXP3: training and testing songs are framed according to quarter-note time intervals.
Experimental results (in % accuracy) using SVM learning are illustrated in Table 3. Mel scale is optimized by having more filter positions on lower frequencies (Deller et al., 2000) because dominant pitches of vocals and musical instruments are in lower frequencies (< 4000 Hz) (Sundberg, 1987). Though the training and testing sets of PV are small (not many songs have sections of only singing voice), it is seen that in all experiments (EXP 1~3), the classification accuracy of PV is higher than other two classes. However, when vocals are mixed with instruments, it can be seen that finding vocal boundaries is more difficult than other two classes.
129
Table 3: Results of hierarchal classification using SVM (in % accuracy)

Classes Classifier EXP1 SVM EXP2 EXP3 PV 72.35 67.34 74.96 Mel-scale PI IMV 68.98 64.57 65.18 64.87 72.38 70.17 PV 73.76 75.22 85.73 Octave scale PI IMV 73.18 68.05 74.15 73.16 82.96 80.36
The results of EXP1 demonstrate the higher performance of the Octave scale compared with the Mel Scale for 30 ms frame size. In EXP2, a slightly better performance can be seen for the Octave Scale but not for the Mel Scale compared with EXP1. This demonstrates that Cepstral coefficients are sensitive to the frame length as well as the position of the filters in Mel or Octave scales. EXP3 is seen to achieve the best performance among EXP1 ~ EXP3 demonstrating the importance of the inclusion of musical knowledge in this application. Furthermore, the better results obtained by the use of Octave scale demonstrate its ability to be able to model music signals better than Mel scale for this application. The results obtained for EXP3 with the use of SVM and GMM are compared as shown in Figure 14. In Table 2, we have shown the numbers of Gaussian mixtures which are empirically found to be good for layer 1 and layer 2 classifications. Since the number of Gaussian mixtures in layer 2 is higher than in layer 1, it reflects that the classifying PI from IMV is more difficult than classifying PV from the rest. It can be seen that SVM performs better than GMM in identifying the region boundaries. We can thus infer that this implementation of SVM, which is a polynomial learning machine that uses a radialbased kernel function, is more efficient than the GMM method that uses a probabilistic modeling method using the EM algorithm. Figure14. Comparison between SVM and GMM in EXP3
% 86 85 84 83 82 81 80 79 78 77 76 PV PI IMV 82.56% 80.96% 80.36% 79.96% 85.73%
84.96%
SVM GMM
Future Work
In addition to semantic boundary detecting in music, the beat space segmentation platform is useful for music structural analysis, content-based music source separation and automatic lyrics generation.
CONCLUSION
This chapter reviews past and current technical achievements in content-based music summarization and classification. We summarized the state of arts in music summarization, music genre classification and semantic region detection in music signals. We also introduced our latest work in compressed domain music summarization, musical video summarization and semantic boundary detection in acoustical music signal. Although many advances have been achieved in many research areas of contentbased music summarization and classification, there are still many research issues that need to be explored to make content-based music summarization and classification approaches more applicable in practice. We also identified the future research directions in music summarization, music genre classification and semantic region detection in music signals.
REFERENCES
Assfalg, J., Bertini, M., DelBimbo, A., Nunziati, W., & Pala, P. (2002). Soccer highlights detection and recognition using HMMs. In Proceedings IEEE International Conference on Multimedia and Explore, 1 (pp. 825-828), Lausanne, Switzerland. Aucouturier, J. J., & Pachet, F. (2003). Representing musical genre: A state of the art. Journal of New Music Research, 32(1), 1-12. Bartsch, M.A., & Wakefield, G.H. (2001). To catch a chorus: Using chroma-based representations for audio thumbnailing. In Proceedings Workshop on Applications of Signal Processing to Audio and Acoustics(WASPAA), New Paltz, New York, (pp. 15-18). Berenzweig, A. L., & Ellis, D. P. W. (2001). Location singing voice segments within music signals. In Proceedings IEEE Workshop on Applications of Signal processing to Audio and Acoustics (WASPAA), New Paltz, New York (pp. 119-122). Bilmes, J. (1998). A gentle tutorial on the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Technical Report ICSI-TR-97-021, University of California, Berkeley. Brown, J. C. (1999, March). Computer identification of musical instruments using pattern recognition with Capstral coefficients as features. Journal of Acoustic Society America, 105(3), 1064-10721. Chai, W., & Vercoe, B. (2003). Music thumbnailing via structural analysis. In Proceedings ACM International Conference on Multimedia, Berkeley, California, (pp. 223-226). Cooper, M., & Foote, J. (2002). Automatic music summarization via similarity analysis. In Proceedings International Conference on Music Information Retrieval, Paris (pp. 81-85).
131
Cosi, P., De Poli, G., & Prandoni, P. (1994). Timbre characterization with mel-cepstrum and neural nets. In Proceedings International Computer Music Conference (pp. 42-45). Dale, N. B. (2003). C++ plus data structures (3rd ed.). Boston: Jones and Bartlett. Deller, J.R., Hansen, J.H.L., & Proakis, J.G. (2000). Discrete-time processing of speech signals. IEEE Press. DeMenthon, D., Kobla, V., & Maybury, M.T. (1998). Video summarization by curve simplification. In Proceedings of the ACM international conference on Multimedia, Bristol, UK (pp. 211-218). Duxburg, C., Sandler, M., & Davies, M. (2002). A hybrid approach to musical note onset detection. In Proceedings of the International Conference on Digital Audio Effects, Hamburg, Germany (pp. 33-28). Eiilis, G. M. (1994). Electronic filter analysis and synthesis. Boston: Artech House. Eronen, A., & Klapuri, A. (2000). Musical instrument recognition using cepstral cofficients and temporal features. In Proceedings of the International Conference on Acoustic, Speech and Signal Processing, Istanbul ,Turkey (Vol. 2, pp. II753 - II756). Foote, J., Cooper, M., & Girgensohn, A. (2002). Creating music video using automatic media analysis. In Proceedings ACM international conference on Multimedia, Juan-les-Pins, France (pp. 553-560). Fujinaga, I. (1998). Machine recognition of timbre using steady-state tone of acoustic musical instruments. In Proceedings International Computer Music Conference (pp. 207-210). Gao, S., Maddage, N.C., & Lee, C.H. (2003). A hidden markov model based approach to musical segmentation and identification. In Proceedings IEEE Pacific-Rim Conference on Multimedia (PCM), Singapore (pp. 1576-1580). Gong, Y., Liu, X., & Hua, W. (2001). Summarizing video by minimizing visual content redundancies. In Proceedings IEEE International Conference on Multimedia and Explore, Tokyo (pp. 788-791). Goto, M. (2001). An audio-based real-time beat tracking system for music with or without drum-sounds. Journal of New Music Research, 30(2), 159-171. Goto, M., & Muraoka, Y. (1994). A beat tracking system for acoustic signals of music. In Proceedings of the Second ACM International Conference on Multimedia (pp. 365-372). Grimalidi, M., Kokaram, A., & Cunningham, P. (2003). Classifying music by genre using a discrete wavelet transform and a round-robin ensemble. Work report. Trinity College, University of Dublin, Ireland. Gunsel, B., & Tekalp, A. M. (1998). Content-based video abstraction. In Proceedings IEEE International Conference on Image Processing, Chicago, Illinois (Vol. 3, pp. 128-132). Hori, C., & Furui, S. (2000). Improvements in automatic speech summarization and evaluation methods. In Proceedings International Conference on Spoken Language Processing ,Beijing, China (Vol. 4, pp. 326-329). Jiang, D., Lu, L., Zhang, H., Tao, J., & Cai, L. (2002). Music type classification by spectral contrast feature. In Proceedings IEEE International Conference on Multimedia and Explore, Lausanne, Switzerland (Vol. 1, pp. 113-116). Kashino, K., & Murase, H. (1997). Sound source identification for ensemble music based on the music stream extraction. In Proceedings International Joint Conference on Artificial Intelligence, Nagoya, Aichi, Japan (pp. 127-134).
Kim, Y. K. (1999). Structured encoding of the singing voice using prior knowledge of the musical score. In Proceedings IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, New York (pp. 47-50). Kim, Y. K., & Brian, W. (2002). Singer identification in popular music recordings using voice coding feature. In Proceedings International Symposium of Music Information Retrieval (ISMIR), Paris. Kraft, R., Lu, Q., & Teng, S. (2001). Method and Apparatus for Music Summarization and Creation of Audio Summaries. U.S. Patent No. 6,225,546. Washington, DC: U.S. Patent and Trademark Office. Logan, B., & Chu, S. (2000). Music summarization using key phrases. In Proceedings IEEE International Conference on Audio ,Speech and Signal Processing, Istanbul, Turkey (Vol. 2, pp. II749 - II752). Lu, L., & Zhang, H. (2003). Automated extraction of music snippets. Proceedings ACM International Conference on Multimedia, Berkeley, California (pp. 140-147). Maddage, N.C., Wan, K., Xu, C., & Wang, Y. (2004a). Singing voice detection using twiceiterated composite fourier transform. In Proceedings International Conference on Multimedia and Explore, Taibei, Taiwan. Maddage, N.C., Xu, C., & Wang, Y. (2003). A SVM-based classification approach to musical audio. In Proceedings International Symposium of Music Information Retrieval, Baltimore, Maryland (pp. 243-244). Maddage, N.C., Xu, C., & Wang, Y. (2004b). Singer identification based on vocal and instrumental models. In Proceedings International Conference on Pattern Recognition, Cambridge, UK. Maddage, N.C., Xu, C. S., Lee, C. H., Kankanhalli, M.S., & Tian, Q. (2002). Statistical analysis of musical instruments. In Proceedings IEEE Pacific-Rim Conference on Multimedia, Taibei,Taiwan (pp. 581-588). Mani, I., & Maybury, M.T. (Eds.). (1999). Advances in automatic text summarization. Boston: MIT Press. Martin, K. D. (1999). Sound-source recognition: A theory and computational model. PhD thesis, MIT Media Lab. Nakamura,Y., & Kanade,T. (1997). Semantic analysis for video contents extraction Spotting by association in news video. In Proceedings of ACM International Multimedia Conference, Seattle, Washington (pp. 393-401). Pachet, F., & Cazaly, D. (2000). A taxonomy of musical genre. In Proceedings ContentBased Multimedia Information Access Conference, Paris. Pachet, F.,Weatermann, G., & Laigre, D. (2001). Musical data mining for EMD. In Proceedings WedelMusic Conference, Italy. Patel, N. V., & Sethi, I. K. (1996). Audio characterization for video indexing. In Proceedings SPIE Storage and Retrieval for Still Image and Video Databases IV, San Jose, California, 2670 (pp. 373-384). Pfeiffer, S., Lienhart, R., Fischer, S., & Effelsberg, W. (1996). Abstracting digital movies automatically. Journal of Visual Communication and Image Representation, 7(4), 345-353. Pye, D. (2000). Content-based methods for the management of digital music. In Proceedings IEEE International Conference on Audio, Speech and Signal Processing, Istanbul ,Turkey (Vol. 4, pp. 2437-2440).
133
Rabiner, L.R., & Juang, B.H. (1993). Fundamentals of speech recognition. New York: Prentice Hall. Rabiner, L.R., & Schafer, R.W. (1978). Digital processing of speech signals. New York: Prentice Hall. Rossing, T.D., Moore, F.R., & Wheeler, P.A. (2002). Science of sound (3rd ed.). Boston: Addison-Wesley. Saitou, T., Unoki, M., & Akagi, M. (2002). Extraction of f0 dynamic characteristics and developments of control model in singing voice. In Proceedings of the 8th International Conference on Auditory Display, Kyoto, Japan (pp. 275-278). Scheirer, E.D. (1998). Tempo and beat analysis of acoustic musical signals. Journal of the Acoustical Society of America, 103(1), 588-601. Shao, X., Xu, C., & Kankanhalli, M.S. (2003). Automatically generating summaries for musical video. In Proceedings IEEE International Conference on Image Processing, Barcelona, Spain (vol. 3, pp. II547-II550). Shao, X., Xu, C., Wang, Y., & Kankanhalli, M.S. (2004). Automatic music summarization in compressed domain. In Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada. Sundaram, H., Xie, L., & Chang, S.F. (2002). A utility framework for the automatic generation of audio-visual skims. In Proceedings ACM International Conference on Multimedia, Juan-les-Pins, France (pp. 189-198). Sundberg, J. (1987). The science of the singing voice. Northern Illinois University Press. Ten Minute Master No 18: Song Structure. (2003). MUSIC TECH Magazine, 62 -63. Tolonen, T., & Karjalainen, M. (2000). A computationally efficient multi pitch analysis model. IEEE Transactions on Speech and Audio Processing, 8(6), 708-716. Tsai, W.H., Wang, H.M., Rodgers, D., Cheng, S.S., & Yu, H.M. (2003). Blind clustering of popular music recordings based on singer voice characteristics. In Proceedings International Symposium of Music Information Retrieval, Baltimore, Maryland (pp. 167-173). Tzanetakis, G., & Cook, P. (2000). Sound analysis using MPEG compressed audio. In Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey (Vol. 2, pp. II761-II764). Tzanetakis G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5), 293-302. Tzanetakis, G., Essl, G., & Cook, P. (2001). Automatic musical genre classification of audio signals. Proceedings International Symposium on Music Information Retrieval, Bloomington, Indiana (pp. 205-210). Vapnik, V. (1998). Statistical learning theory. New York: John Wiley & Sons. Wang, Y., & Vilermo, M. (2001). A compressed domain beat detector using MP3 audio bit streams. In Proceedings ACM International Conference on Multimedia, Ottawa, Ontario, Canada (pp. 194-202). Wold, E., Blum, T., Keislar, D., & Wheaton, J. (1996). Content-based classification, search and retrieval of audio. IEEE Multimedia, 3(3), 27-36. Xu, C., Maddage, N.C., Shao, X.,Cao, F., & Tian, Q. (2003). Musical genre classification using support vector machines. In Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, Hong Kong, China (pp. V429V432).
Xu, C., Zhu,Y., & Tian, Q. (2002). Automatic music summarization based on temporal, spectral and Cepstral features. In Proceedings IEEE International Conference on Multimedia and Explore, Lausanne, Switzerland (pp. 117-120). Xu, C.S., Maddage, N.C., & Shao, X. (in press). Automatic music classification and summarization. IEEE Transaction on Speech and Audio Processing. Yow, D., Yeo, B.L., Yeung, M., & Liu, G. (1995). Analysis and presentation of soccer highlights from digital video. In Proceedings of Asian Conference on Computer Vision, Singapore. Zhang, T. (2003). Automatic singer identification. In Proceedings International Conference on Multimedia and Explore, Baltimore, Maryland (Vol. 1, pp. 33-36). Zhang, T., & Kuo, C.C.J. (2001), Audio content analysis for online audiovisual data segmentation and classification. In IEEE Transaction on Speech and Audio Processing, 9(4), 441-457.
A Multidimensional Approach for Describing Video Semantics
135
Chapter 6

Uma Srinivasan, CSIRO ICT Centre, Australia Surya Nepal, CSIRO ICT Centre, Australia
In order to manage large collections of video content, we need appropriate video content models that can facilitate interaction with the content. The important issue for video applications is to accommodate different ways in which a video sequence can function semantically. This requires that the content be described at several levels of abstraction. In this chapter we propose a video metamodel called VIMET and describe an approach to modeling video content such that video content descriptions can be developed incrementally, depending on the application and video genre. We further define a data model to represent video objects and their relationships at several levels of abstraction. With the help of an example, we then illustrate the process of developing a specific application model that develops incremental descriptions of video semantics using our proposed video metamodel (VIMET).
ABSTRACT
INTRODUCTION
With the convergence of Internet and Multimedia technologies, video content holders have new opportunities to provide novel media products and services, by repurposing the content and delivering it over the Internet. In order to support such
136 Srinivasan & Nepal
applications, we need video content models that allow video sequences to be represented and managed at several levels of semantic abstraction. Modeling video content to support semantic retrieval is a hard task, because video semantics means different things to different people. The MPEG-7 Community (ISO/ISE 2001) has spent considerable effort and time in coming to grips with ways to describe video semantics at several levels, in order to support a variety of video applications. The task of developing content models that show the relationships across several levels of video content descriptions has been left to application developers. Our aim in this chapter is to provide a framework that can be used to develop video semantics for specific applications, without limiting the modeling to any one domain, genre or application. The Webster Dictionary defines the meaning of semantics as the study of relationships between signs and symbols and what they represent. In a way, from the perspective of feature analysis work (MPEG, 2000; Rui, 1999; Gu, 1998; Flickner et al., 1995; Chang et al. 1997; Smith & Chang, 1997), low- level audiovisual features can be considered as a subset, or a part of, visual signs and symbols that convey a meaning. In this context, audio and video analysis techniques have provided a way to model video content using some form of constrained semantics, so that video content can be retrieved at some basic level such as shots. In the larger context of video information systems, it is now clear that feature analyses alone are not adequate to support video applications. Consequently, research focus has shifted to analysing videos to identify higher-level semantic content such as objects and events. More recently, video semantic modeling has been influenced by film theory or semiotics (Hampapur, 1999; Colombo et al., 2001; Bryan-Kinns, 2000), where a meaning is conveyed through a relationship of signs and symbols that are manipulated using editing, lighting, camera movements and other cinematic techniques. Whichever theory or technology one chooses to follow, it is clear that we need a video model that allows us to specify relationships between signs and symbols across video sequences at several levels of interpretation (Srinivasan et al., 2001). The focus of this chapter is to present an approach to modeling video content, such that video semantics can be described incrementally, based on the application and the video genre. For example, while describing a basketball game, we may wish to describe the game at several levels: the colour and texture of players uniforms, the segments that had the crowd cheering loudly, the goals scored by a player or a team, a specific movement of a player and so on. In order to facilitate such descriptions, we have developed a framework that is generic and not definitive, but still supports the development of application specific semantics. The next section provides a background survey of some of these approaches used to model and represent the semantics associated with video content. In the third section, we present our Video Metamodel Framework (VIMET) that helps to model video semantics at different levels of abstraction. It allows users to develop and specify their own semantics, while simultaneously exploiting results of video analysis techniques. In the fourth section, we present a data model that implements the VIMET metamodel. In the fifth section we present an example. Finally, the last section provides some conclusions and future directions.
137
BACKGROUND
In order to highlight the challenges and issues involved in modeling video semantics, we have organized video semantic modeling approaches into four broad categories and cite a few related works under each category.
Feature Extraction Combined with Clustering or Temporal Grouping

Most approaches that fall under this category have focused on extracting visual features such as colour and motion (Rui et al., 1999; Gu, 1998; Flickner et al., 1995), and abstracting visual signs to identify semantic information such as objects, roles, events and actions. Some approaches have also included audio analysis to extract semantics from videos (MPEG MAAATE, 2000). We recognize two types of features that are extracted from videos: static features and motion-based features. Static features are mostly perceptual features extracted from stationary images or single frames. Examples are colour histograms, texture maps, and shape polygons (Rui et al., 1999). These features have been exploited as attributes that represent colour, texture and shape of objects and images in a single frame. Motion-based features are those that exploit the basic spatiotemporal nature of videos (Chang et al., 1997). Examples of motion-based low-level features are motion-vectors for visual features and frequency bands for audio signals. These features have been used to segment videos into shots or temporal segments. Once the features are extracted, they are clustered into groups or segmented into temporal units to classify video content at a higher semantic level. Hammoud et al. (2001) use such an approach for modeling video semantics. They extract shots, use clustering to identify similar shots, and use a time-space graph to represent temporal relationships between clusters for extracting scenes as semantic units. Hacid et al. (2000) present a database approach for modeling and querying video data. They use two layers for representing video content: feature and content layer, and semantic layer. The lowest layer is characterized by a set of techniques and algorithms for visual features and relationships. The top layer contains objects of interest, their description and their relationships. A significant component of this approach is the use of temporal cohesion to attach semantics to the video.
Relationship-Based Approach for Modeling Video Semantics

This approach is based on understanding the different types of relationships among video objects and their features. For example, in videos, spatial and temporal relationships of features that occur over space and time have contributed significantly for identifying content at a higher level of abstraction. For example, in the MPEG domain, DCT coefficients and motion vectors are compared over a temporal interval to identify shots and camera operations (Meng et al., 1995). Similarly, spatial relationships of objects over multiple frames are used to identify moving objects (Gu, 1998). Smith and Benitez (2000) present a Multimedia-Extended Entity Relationship (MM-EER) model that captures various types of structural relationships such as spatiotemporal, intersection, and composition, and provide a unified multimedia description framework for content-based
description of multimedia in large databases. The modeling framework supports different types of relationships such as generalization, aggregation, association, structural relationship, and intensional relationship. Baral et al. (1998) also extended ER diagrams with a special attribute called core. This attribute stores the real object in contrast to the abstraction which is reflected in the rest of the attributes. Pradhan et al. (2001) propose a set of interval operations that can be used to build an answer interval to the given query by synthesizing video interval containing keywords similar to that of the query.
Theme-Based Annotation Synchronized to Temporal Structure

This approach is based on using textual annotations synchronized to a temporal structure (Yap et al. 1996). This approach is often used in television archives where segments are described thematically and linked to temporal structures such as shots or scenes. Hjelsvold and Midtstraum (1994) present a general data model for sharing and reuse of video materials. Their data model is based on two basic ideas: temporal structural representation of video and thematic annotations and relationships between them. Temporal structure of video includes frame, shot, scene, sequence, and so forth. The data model allows users to describe frame sequences using thematic annotations. It allows detailed descriptions of the content of the video material, which are not necessarily linked to structural components. More often they are linked to arbitrary frame sequences by establishing a relationship between the frame sequences and relevant annotations, which are independent of any structural components. Other works that fall under this category include (Bryan-Kinns, 2000; Srinivasan, 1999).
Approach Based on Knowledge Models and Film Semiotics

As we move up the semantic hierarchy to represent higher-level semantics in videos, in addition to extracting features and establishing their relationships, we need knowledge models that help us understand higher-level concepts that are specific to a video genre. Models that aim to represent content at higher semantic levels use some form of knowledge models or semiotic and film theory to understand the semantics of videos. Film theory or semiotics describes how a meaning is conveyed through a relationship of signs and symbols that are meaningful to a particular culture. This is achieved by manipulation of editing, lighting, camera movements and other cinematic techniques. Colombo et al. (2001) present a model that allows retrieval of commercial content based on their salient semantics. The semantics are defined from the semiotic perspective, that is, collections of signs and semantic features like colour, motion, and so forth, are used to build semantics. It classifies commercials into four different categories based on semiotics: practical, critical, utopic and playful. A set of rules is then defined to map the set of perceptual features to each of the four semiotic categories for commercials. Bryan-Kinns (2000) presents a framework (VCMF) for video content modeling which represents the structural and semantic regularities of classes of video recordings. VCMF can incorporate many domain-specific knowledge models to index video at the semantic level.
139
Hampapur (1999) presents an approach based on the use of knowledge models to build domain-specific video information systems. His aim is to provide higher-level semantic access to video data. The knowledge model proposed in their paper includes the semantic model, cinematic model and physical world model. Each model has several sub-models; for example, the cinematic model includes a camera motion model and a camera geometry model. We realize that the above categories are not mutually exclusive and many models and frameworks use a combination of two or more of these modeling strategies. We now describe our metamodel framework. It uses feature extraction, multimodal relationships, and semiotic theory to model video semantics at several levels of abstraction. In a way, we use a combination of the four approaches described above.
VIDEO METAMODEL (VIMET)

Figure 1 shows the video metamodel which we call VIMET. The horizontal axis shows the temporal dimension of the video and the vertical axis shows the semantic dimension. A third dimension represents the domain knowledge associated with specialized interpretations of a film or a video (Metz, 1974; Lindley & Srininvasan, 1999). The metamodel represents: (1) static and motion-based features in both audio and visual domain, (2) different types of spatial and temporal relationships, (3) multimodal, multifeature relationships to identify events, and (4) concepts influenced by principles of film theory. The terms and concepts used in the metamodel are described below. The bottom layer shows the standard way in which a video is represented to show temporal granularity.
Figure 1. A multidimensional data model for building semantics (the shadow part represents the temporal (dynamic) component)
Domain Knowledge
Semantic Concepts
Multi Feature Multi Modal Relationships Single Feature Relationships Audio Video Feature Audio Video Content Representation
Structure of Features Spatial Relationship Static Object Frame Shot
Order of Events Temporal Relationship Motion-Based Scene Clip
Object: It represents a single object within a frame. An object may be identified manually or automatically if an appropriate object detection algorithm exists. Frame: It represents a single image of a video sequence. There are 25 frames (or 30 in certain video formats) in a video of one-second duration. Shot: A shot is a temporal video unit between two distinct camera operations. Shot boundaries occur due to different types of camera operations such as cuts, dissolves, fades and wipes. There are various techniques used to determine shots. For example, DCT coefficients are used in MPEG videos to identify distinct shot boundaries. Scene: A scene is a series of consecutive shots constituting a unit from the narrative point of view sharing some thematic visual content. For example, a video sequence that shows Martina Hinges playing the game point in the 1997 Australian open final is a thematic scene. Clip: A clip is an arbitrary length of video used for a specific purpose. For example, a video sequence that shows tourist places around Sydney is a video clip made for tourism purpose.
The semantic dimension helps model semantics in an incremental way using features and their relationships. It uses information from the temporal dimension to develop the semantics. Static features: Represent features extracted from objects and frames. Shape, color and texture are examples of static features extracted from the objects. Similarly, global color histogram and average colors are examples of features extracted from a frame. Motion-based features: Represent features extracted from video using motionbased information. An example of such a feature is motion vector.
The next level of conceptualisation occurs when we group individual features to identify objects and events using some criteria based on spatial and/or temporal relationships. Spatial Relationships: Two types of spatial relationships are possible for videos: topological relationships and directional relationships. Topological relationships include relations such as contains, covered by, and disjoint (Egenhofer & Franzosa, 1991). Directional relationships include relations such as right-of, left-of, above and below (Frank, 1996). Temporal Relationships: This includes the most commonly used temporal relationship from Allens 13 temporal relationships (Allen, 1983). These include temporal relations such as before, meets, overlaps, finishes, starts, contains, equals, during, started by, finished by, overlapped by, met by and after.
Moving further up in the semantic dimension, we use multiple features both in the audio and the visual domain, and establish (spatial and temporal) relationships across these multimodal features to model higher-level concepts. In the temporal dimension, this helps us identify semantic constructs based on structure of features and order of events occurring in a video.
141
Structure of Features: Represents patterns of features that define a semantic construct. For example, a group of regions connected through some spatial relationship over multiple frames, combined with some loudness value spread over a temporal segment, could give an indication of a particular event occurring in the video. Order of Events: Represents recurring pattern of events identified manually or detected automatically using multimodal feature relationships. For example, camera pan is a cinematic event that is derived by using a temporal relationship of motion vectors. (The motion vectors between consecutive frames due to the pan should point in a single direction that exhibits a strong modal value corresponding to the camera movement.) Similarly, a sudden burst of sound is a perceptual auditory event. These cinematic and perceptual events may be arranged in certain sequence from a narrative point of view to convey a meaning. For example, a close-up followed by a loud sound could be used to produce a dramatic effect. Similarly, a crowd cheer followed by scorecard could be used to determine field goals in basketball videos. (We will describe this in greater details later in our example.)
Modeling the meaning of a video, shot, or sequence requires the description of the video object at several levels of interpretation. Film semiotics, pioneered by the film theorist Christian Metz (1974), has identified five levels of cinematic codification that cover visual features, objects, actions and events depicted in images together with other aspects of the meaning of the images. These levels are represented in the third dimension of the metamodel. The different levels interact together, influencing the domain knowledge associated with a video. Perceptual level: This is the level at which visual phenomena become perceptually meaningful, the level at which distinctions are perceived by the viewer. This is the level that is concerned with features such as colour, loudness and texture. Cinematic level: This level is concerned with formal film and video editing techniques that are incorporated to produce expressive artifacts. For example, arranging a certain rhythmic pattern of shots to produce a climax, or introducing voice-over to shift the gaze. Diegetic level: This refers to the four-dimensional spatiotemporal world posited by a video image or a sequence of video images, including spatiotemporal descriptions of objects, actions, or events that occur within that world. Connotative level: This level of video semantics is the level of metaphorical, analogical and associative meanings that the objects and events in a video may have. An example of connotative significance is the use of facial expression to denote some emotion. Subtextual level: This is the level of more specialized, hidden and suppressed meanings of symbols and signifiers that are related to special cultural and social groups.
The main idea of this metamodel framework is to allow users to develop their own application models, based on their semantic notion and interpretation, by specifying objects and relationships of interest at any level of granularity.
Next we describe the data model to implement the ideas presented in the VIMET metamodel. The elements of the data model allow application (developers) to incrementally develop video semantics that needs to be modeled and represented in the context of the application domain.
VIDEO DATA MODEL

The main elements of the data model are: (i) Video Object (VO) to model a video sequence of any duration, (ii) VO Attributes that are either explicitly specified or based on automatically detected audio, visual, spatial and temporal features, (iii) VO Attributelevel relationships for computed attribute values, (iv) Video Concept Object (VCO) to accommodate fuzzy descriptions, and (v) Object level relationships, to support multimodal and multifeature relationships across video sequences.
Video Object (VO)

We define a video object (VO) as an abstract object that models a video sequence at any level of abstraction in the semantic-temporal dimension shown in Figure 1. This is the most primitive object that can be used in a query. Definition: A typical Video Object (VO) is a five-tuple VO = <Xf, Sf, Tf, Vf , Af) where Xf is a set of textual attributes, Sf is a set of spatial attributes, Tf is a set of temporal attributes, Vf is a set of visual attributes, and Af is a set of audio attributes. Xf represents a set of textual attributes that describe the semantics associated with the object at different levels of abstraction. This could be metadata or any textual description a manual annotation of the content or about the content. Sf represents spatial attributes that specify the spatial bounds of the object. This pertains to the space occupied by the object in a two-dimensional plane. This could be X and Y positions of the object or a bounding box of the object. Ti represents the temporal attributes to describe the temporal bounds of the object. Temporal attributes includes start time, end time, and so forth. A time interval is a basic primitive used here. Vf represents a set of attributes that characterize the object in terms of visual features. The values of these attributes are typically the result of visual feature extraction algorithms. Examples of such attributes/features are colour histograms and motion vectors. Af represents a set of attributes that characterize the object in terms of aural features. The values of these attributes are typically the result of audio analysis and feature extraction algorithms. Examples of such attributes/features are loudness curves and pitch values.
143
Figure 2. A typical video object with two sets of attributes: audio and temporal
VO A u d io A ttr ib u te : ( n a m e , v a lu e ) L o u d n e s s h is to g r a m : [ 1 , ..,5 ] - -- - - -- - - -T e m p o r a l a ttr ib u te : ( n a m e , v a lu e ) D u r a tio n : 2 0 s e c - -- - - -- - - -- - - -- - - -
A typical video object a video shot is shown in Figure 2. The diagram shows a shot with temporal attribute duration and audio attribute loudness histogram. (To keep the diagram simple we have shown only a subset of possible attributes of a video object.) Each attribute has a value domain and is shown as a name value pair. For example, the audio attribute set has an audio attribute loudness whose value is given by a loudness histogram. Similarly, the temporal attribute set has an attribute duration whose value is given in seconds. The applications determine the content of the video objects modeled. For example, a video object could be a frame in the temporal dimension, with an attribute specifying its location in the video. In the semantic dimension, the same frame could have visual features such as colour and texture histograms.
VO Attributes
The attributes of a video object are either intensional or extensional. Extensional attribute values are text data, drawn from the character domain. The possible sources of extensional attributes are annotation, transcripts, keywords, textual description, terms from a thesaurus, and so forth. Extensional attributes fall in the feature category in Figure 1. Intensional attributes have specific value domains where the values are computed using appropriate feature extraction functions. Where the extensional attributes of the video objects are semantic in nature, relationships across such objects can be expressed as association, aggregation, and generalisation as per object-oriented or EER modeling methodologies. For intensional attributes, however, we need to establish specific relationships for each attribute type. For example, when we consider the temporal attribute whose value domain is time-interval, we need to specify temporal relationships and corresponding operations that are valid over time intervals. We define two sets of value domains for intensional attributes. The first set is the numerical domain and the second set is the linguistic domain. The purpose of providing
a linguistic domain is to allow a fuzzy description of some attributes. For example, a temporal attribute duration could have a value equal to 20 (seconds) in the numerical domain. For the same temporal attribute, we define a second domain, which is a constrained set of linguistic terms such as short, long and very long. For this, we use a fuzzy linguistic framework, where the numerical values of duration are mapped to fuzzy terms such as short, long, and so forth. (Driankov et al., 1993). The fuzzy linguistic framework is defined as a four tuple <X, LX, QX, MX> where X = denotes the symbolic name of a linguistic variable (attributes in our case) such as duration. LX = set of linguistic terms that X can take such as short, long and very long.. QX = the numeric domain where X can take the values such as time in seconds. MX = a semantic function which gives meaning to the linguistic terms by mapping Xs values from LX to QX. The function MX depends on the types of attributes and the application domain. We defined one such function in Nepal et al. (2001).
Similarly, when we consider an audio attribute value based on the loudness histogram/curve, we map the numerical values to semantic terms such as loud,, soft, very loud and so on. By using a fuzzy linguistic framework, we are utilising the ability of the human to distinguish variations in audio and visual features. Figure 3 shows the mapping of a numerical attribute-value domain to a linguistic attribute-value domain.
VO Attribute-Level Relationships
Relationships across video objects can be defined at two levels: the object and attribute levels. We will discuss the object level relationships in a later section. Here we discuss attribute-level relationships to illustrate the capabilities of the data model. Each intensional attribute value allows us to define relationships that are valid for that
Figure 3. Mapping of numerical attribute values to fuzzy attribute values
Audio Attribute: (name, value) Loudness: [1, ..,5] ----------Temporal attribute: (name, value) Duration: 20 sec ------------------
Soft, loud, very soft, very loud ------
Short, long, medium, very short, very long
145
Table 1. Attribute-level relationships

Attribute Brightness Contrast Color Loudness Pitch Size Duration Relationship Type Visual, based on light intensity Visual, based on contrast measures Visual, based on color Audio, based on sound level Audio, based on pitch Visual, based on size Temporal, based on duration Relationship Value set {Brighter, Dimmer, Similar} {Higher, Lower, Similar} (Same, Different, Resembles} {Louder, Softer, Similar} {Higher, Lower, Similar} {Bigger, Smaller, Similar} {Shorter, Longer, Equal}
particular attribute-value domain. For example, when we consider visual relationships based on the brightness attribute, we should be able to establish relationships such as brighter-than, dimmer-than, and similar by grouping light intensity values. By exploiting such attribute-level relationships across each intensional attribute, it is possible to establish a multidimensional relationship between video objects. Here each dimension reflects a perceivable variation in the computed values of the relevant intensional attribute. Table 1 shows an illustrative set of attribute-level relationships. The relationship values are drawn from a predefined constrained set of relationships valid for a particular attribute type. A typical relationship value set for brightness is {brighter-than, dimmerthan, similar}. Each relationship in the set is given a threshold based on the eyes ability to discriminate between light intensity levels. The fuzzy scale helps in accommodating subjectivity at the user level. Next we give some examples of directional relationships used for spatial reasoning in large-scale spaces, that is, spaces that cannot be seen or understood from a single point of view. Such relationships are useful in application areas such as Geographical Information Systems. In the case of video databases, it is more meaningful to define equivalent positional relationships as we are dealing with spaces within a frame that can be viewed from a point. The nine positional relationships are given by AP = {Top, TopLeft, Left, BottomLeft, Bottom, BottomRight, Right, TopRight, Centre} as shown in Figure 4. These nine relationships can be used as linguistic values for a spatial attribute position given by a bounding box. Similarly, appropriate linguistic terms can be developed for other attributes as shown in Table 1.
Video Concept Object

In order to support the specification of an attribute relationship (using linguistic terms), we define a video concept object. A Video Concept Object (VCO) is a video object with a semantic attribute value attached to each intensional attribute. The main difference
Figure 4. Positional relations in a two-dimensional space
between a VO and a VCO lies in the domain of the intensional attributes. The attributes of a VO have numerical domains, and the attributes of VCO have a domain which is a set of linguistic terms which forms a domain for semantic attribute value. The members of this set are controlled vocabulary terms that reflect the (fuzzy) semantic value for each attribute type. The domain set will be different for audio, visual, temporal and spatial attributes. Definition: A typical Video Concept Object (VCO) is a five-tuple VCO = <Xc, Sc, Tc, Ac, Vc> where Xc is a set of textual attributes that define a concept, the attribute values are drawn from the character domain, Sc represents a set of spatial attributes. The value domain is a set of (fuzzy) terms that describe relative positional and directional relationships, Tc represents a set of temporal attributes whose values are drawn from a set of fuzzy terms used to describe time interval or duration. Ac represents a set of audio attributes (an example is loudness attribute whose values are drawn from a set of fuzzy terms that describe loudness). Vc represents a visual attribute whose values are drawn from a set of fuzzy terms that describe that particular visual attribute. The relationship between a VO and a VCO is established using fuzzy linguistic mapping as shown in Figure 6. In a fuzzy linguistic model, attribute name-relationship pair is equivalent to fuzzy variable-value pair. In general, a user can query and retrieve video from a database using any primitive VO/VCO. An example is retrieve all videos where an object A is at the rightbottom of the frame. In this example, the expressive power of a query is based on a single attribute-value and is limited. A more interesting query would be retrieve all video sequences where the loudness value is very high and the duration of the sequence is short. Such queries would include VCOs and VOs.
147
Figure 6. An example VCO derived using a fuzzy linguistic model

Audio Attribute: Loudness: {very soft, soft loud, veryloud} VCO Temporal attribute: Duration: {long, short,very short, }
VO Audio Attribute: (name, value) Loudness: [1, ..,5] ----------Temporal attribute: (name, value) Duration: 20 sec -----------------Short, long, medium, very short, very long Soft, loud, very soft, very loud ------
VO and VCO Relationships

In order to retrieve sequences with multiple relationships over different VOs and VCOs, we need appropriate operators. In addition, we need some kind of knowledge or heuristic rule(s) that help us interpret these relationships. This motivates us to define a Video Semantics System. Definition: A typical Video Semantics System (VSS) is a five-tuple VSS = < VO/VCO, Sro, Tro, Fro, I> where VO/VCO is a set of Video Objects or Video Concept Objects, Sro is a set of spatial relationship operators, Tro is a set of temporal relationship operators, and Fro is a set of multimodal operators. I is an interpretation model of the video object. Next we describe some operators and interpretation models. We define spatial operators for manipulating relationships of objects over spatial attributes. This helps in describing the structure of features (Figure 1). We define temporal operators for manipulating relationships over temporal attributes. This helps to describe order of events (Figure 1). For multimodal multifeature composition, however, we define fuzzy Boolean operators.
Spatial Relationship Operators (SROs)

A number of spatial relationships with corresponding operators have been proposed and used in spatial databases, geographical information systems and multimedia database systems. We describe them into the following categories. It is important to note that SROs are valid only if the VO has attributes (values) that indicate spatial bounds. Topological Relations: Topological relations are spatial relations that are invariant under bijective and continuous transformations that have also continuous inverses. Topological equivalence does not necessarily preserve distances and directions. Instead, topological notions include continuity, interior, and boundary, which are defined in terms of neighbourhood relations. The eight topological relationship operators are given as TR = {Disjoint, Meet, Equal, Inside, Contains, Covered_By, Covers, Overlap} (Egenhofer & Franzosa, 1991) as shown in Figure 7. These relationship operators are binary. Relative Positional Relations: The relative positional relationship operators are given by RP = {Right of, Left of, Above, Below, In front of, Behind}. These operators are used to express the relationships among VCOs. These relative positional relationship operators are binary. Sro = TR RP
Temporal Relationship Operators (TROs)

As a video is a continuous medium, temporal relations provide important cues for video data retrieval. Allen (1983) has defined 13 temporal interval relations. Many variations of Allens temporal interval relations have been proposed and used in temporal and multimedia databases. These relations are used to define the temporal relationships of events within a video. For example, weather news appears after the sports news in a daily TV news broadcast. It is important to note that TROs are valid only for VOs that have attributes with temporal bounds. The 13 temporal relationships defined by Allen are given by TR = {Before, Meets, Overlaps, Finishes, Starts, Contains, Equals, During, Started by, Finished by, Overlapped by, Met by, After}. (Note: This set TR corresponds to the set of temporal relationship operators defined as Tro in VSS.)
Figure 7. Examples of eight basic topological relationships
149
Multimodal Operators
Traditionally the approach used for audiovisual content retrieval is based on similarity measures on the extracted features. In such systems, users need to be familiar with the underlying features and express their queries in terms of these (low-level) features. In order to allow users to formulate semantically expressive queries, we define a set of fuzzy terms that can be used to describe a feature using fuzzy attribute relationships. For example, when we are dealing with the loudness feature, we use a set of fuzzy terms such as very loud, loud, soft, and very soft to describe the relative loudness at the attribute level. This fuzzy measure is described at the VCO level. A query on VSS, however, can be multimodal in nature and involve many VCOs. In order to have a generic way to combine multimodal relationships, we use simple fuzzy Boolean operators. The fuzzy Boolean operator set is given by FR = {AND, OR, NOT}. (Note: This set FR corresponds to the set of multimodal relationship operators defined as Fro in VSS.) An example of a multimodal, multifeature query is shown in Figure 8.
Figure 8. An example of multimodal query using fuzzy and temporal operators on VCOs
(loudness, loud> AND <duration, short>) BEFORE (<loudness, soft> AND <duration, long>)
Temporal operators (BEFORE, AFTER, ..)
Spatial operators (RIGHT-OF, ..)
Fuzzy operators (AND, OR, NOT>
{<loudness, soft>, < loudness, very soft> , <loudness, loud>, <loudness, veryloud>} VCO {<duration, long>, <duration, short>, <duration, very short>}
Audio Attribute: (name, value) Loudness: [1, ..,5] ----------Temporal attribute: (name, value) Duration: 20 sec -----------------VO
Soft, loud, very soft, very loud ------
Short, long, medium, very short, very long
Interpretation Models
In the third section we identified five different levels of cinematic codification and description that help us interpret the meaning of a video. Lindley and Srinivasan (1998) have demonstrated empirically that these descriptions are meaningful in capturing distinctions in the way images are viewed and interpreted by a nonspecialist audience, and between this audience and the analytical terms used by filmmakers and film critics. Modeling the meaning of a video, shot, or sequence requires the description of the video object at any or all of the interpretation levels described in the third section. Interpretation Models can be based on one or several of these five levels. A model based on perceptual visual characteristics is the subject of a large amount of current research on video content-based retrieval (Ahanger & Little, 1996). Models based on cinematic constructs incorporate the expressive artifacts such as camera operations (Adam et al., 2000), lighting schemes and optical effects (Truong et al., 2001). Automated detection of cinematic features is another area of vigorous current research activity (see Ahanger & Little). While modeling at the diegetic level the basic perceptual features of an image are organised into a four-dimensional spatiotemporal world posited by a video image or sequence of video images. This includes the spatiotemporal descriptions of agents, objects, actions and events that take place within that world. The interpretation model that we illustrate in the fifth section is a diegetic model based on interpretations at the perceptual level. Examples of connotative meanings are the emotions connoted by actions or the expressions on the faces of characters. The subtextual level of interpretation involves representing specialised meanings of symbols and signifiers. For both the connotative and the subtextual levels, definitive representation of the meaning of a video is in principle impossible. The most that can be expected is an evolving body of interpretations. In the next section we show how a user/author can develop application semantics by explicitly specifying objects and relationships and interpretation models. The ability to create different models allows users/authors to contextualise the content to be modeled.
EXAMPLE
Here we illustrate the process of developing a specific application model to describe video content at different semantic levels. The process is incremental, where the different elements of the VIMET framework are used to retrieve goal segments from basketball videos. Here we have used a top-down approach to modeling, where we first developed the interpretation models from an empirical investigation of basketball videos (Nepal et al., 2001). We then develop an object model for the application. Figure 9 shows an instantiation of the metamodel. The observation of television broadcasts of basketball games gave some insights into commonly occurring patterns of events perceived during the course of a basketball game. We have considered a subset of these as key events that occur repeatedly throughout the video of the game to develop a temporal interpretation model. They were: crowd cheer, scorecard display and change in players direction. We then identified the audio-video features that correspond to the key events manually observed. The
151
Figure 9: An instance of metamodel for a video semantic Goal in basketball videos using temporal interpretation model I
Domain Knowledge
Interpretation Goal
Connotative Multi-Modal Multi-Feature Relationships Single Feature Relationship features loudness value, motion vector embedded text Order of events Crowd Cheer Before Score Board
Diegetic Events Crowd Cheer, Score Card
Audio Video Feature
Diegetic
Perceptual , Cinematic Video and audio content Clip - Basketball Videos
features identified include energy levels of audio signals, embedded text regions, and change in direction of motion. Since the focus of this chapter is on building a video semantic goal, we will not elaborate on the feature extraction algorithms in detail. We used the MPEGMaaate audio analysis toolkit (MPEG MAAATE, 2000) to identify highenergy segments from loudness values. Appropriate thresholds were applied to these high-energy segments to capture an event crowd cheer. The mapping to crowd cheer expresses a simple domain heuristic (diegetic and connotative interpretation) for sports video content. Visual feature extraction consists of identifying embedded text (Gu, 1998) and is mapped to score card displays in the sports domain. Similarly, camera pan (Srinivasan et al., 1997) motion is extracted and used to capture change in players direction. The next level of semantics is developed by exploring the temporal order of events such as crowd cheer and scorecard displays. Here we use temporal interpretation models that show relationships on multiple features from multiple modalities.
We observed basketball videos and developed five different temporal interpretation models, as follows.
Model I
This model is based on the first key event crowd cheer. Our observation shows that there is a loud cheer within three seconds of scoring a legal goal. Hence, in this model, the basic diegetic interpretation is that a loud cheer follows every legal goal, and a loud cheer only occurs after a legal goal. The model T1 is represented by T1: Goal [3 sec] crowd cheer Intuitively, one can see that the converse may not always be true, as there may be other cases where a loud cheer occurs, for example, when a streaker runs across the field. Such limitations are addressed to some extent in the subsequent models.
Model II
This model is based on the second key event scoreboard display. Our observation shows that the scoreboard display is updated after each goal. Our cinematic interpretation in this model is that a scoreboard display appears (usually as embedded text) within 10 seconds of scoring a legal goal. This is represented by the model T2. T2: Goal [10 sec] Scoreboard The limitation of this model is that the converse here may not always be true, that is, a scoreboard display may not always be preceded by a legal goal.
Model III
This model uses a combination of two key events with a view to address the limitations of T1 and T2. As pointed out earlier, all crowd cheers and scoreboard displays may not always indicate a legal goal. Ideally when we classify segments that show a shooter scoring goals, we need to avoid inclusion of events that do not show a goal, even though there may be a loud cheer. In this model, this is achieved by temporally combining the scoreboard display with crowd cheer. Here, our diegetic interpretation is that every goal is followed by crowd cheer within three seconds, and by a scoreboard display within seven seconds after the crowd cheer. This discards events that have crowd cheer, but no scoreboard and events that have scoreboards, but no crowd cheer. T3: Goal [3 sec] Audio Cheer [7 sec] Score Board
Model IV
This model addresses the strict constraints imposed in Model 3. Our observations show that while the pattern shown in three is valid most of the times, there are cases where the field goals are accompanied by loud cheers and no scoreboard display. Similarly, there are cases where goals are followed by scoreboard displays but not crowd cheer, as in the case of free throws. In order to capture such scenarios, we have used a combination of models I and II and proposed a model IV.
153
T4: T1 T2 where is the union of results from models H1 and H2.
Model V
While model IV covers most cases of legal goals, due to the inherent limitations pointed out in models I and II, model IV could potentially classify segments where there are no goals. Our observations show that Model IV captures the maximum number of goals, but it also identifies many nongoal segments. In order to retain the number of goal segments and still remove the nongoal segments, we introduce the third key event change in direction of players. In this model, if a crowd cheer appears within 10 seconds of a change in direction, or a scoreboard appears within 10 seconds of a change in direction, there is likely to be a goal within 10 seconds of the change in direction. This is represented as follows. T5: Goal [10 secs] Change in direction [10secs] Crowd cheer OR T5:Goal [10 secs] Change in direction [10secs] Scoreboard Although, in all models, the time interval between two key-events is hardwired, the main idea is to provide a temporal link between key-events to build up high-level semantics.
Object Model
For the interpretation models described above, we combine video objects characterized by different types of intensional attributes to develop new video objects or video concept objects. An instance of video data model is shown in Figure 10. For example, we use the loudness feature and a temporal feature to develop a new video concept object called crowd cheer. For example, we first define video semantics crowd cheer as. DEFINE CrowdCheer AS SELECT V1.Start-time, V1.End-time FROM VCO V1 WHERE <V1.Loudness, Very-high> AND <V1.Duration, Short> Here we use diegetic knowledge about the domain to interpret a segment where a very high loud value appeared for a short duration as a crowd cheer. Similarly, we can define scorecard as follows. DEFINE ScoreCard AS SELECT V2.Start-time, V2.End-time FROM VCO V2 WHERE <V2.Text-Region, Bottom-Right>
Figure 10. An instance of the data model for an example explained in this section.
Goal CrowdCheer BEFORE ScoreCard
Semantics
CrowdCheer
ScoreCard Temporal Operator BEFORE
<loudness, Very high> AND <Duration, Short>
<Text-Region, BottomRight>
Audio Attribute: (name, value) Loudness: {Soft, Loud, } Temporal attribute: (name, value) Start-Time: 20 sec End_Time: 25 sec. Duration: {short, long, } VCO1
Combining Operator AND
Visual Attribute: (name, value) Text-Region: {top-right, bottom-right, } Temporal attribute: (name, value) Start-Time: 20 sec End-Time: 25 sec. VCO2
Audio Attribute: (name, value) Loudness: [1, ..,5] Temporal attribute: (name, value) Start-Time: 20 sec End_Time: 25 sec. Duration: 5 sec
Fuzzy Linguistic Model
Soft, loud, very soft, very loud Bottom-right Top-Left, etc. Short, long, medium, very short, very long
Visual Attribute: (name, value) Text-Region: [240,10,255,25] Temporal attribute: (name, value) Start-Time: 30 sec End-Time: 40 sec.
VO2
VO1
We then use the temporal interpretation model to build a query to develop a video concept object that represents a goal segment. This is defined as follows: DEFINE Goal AS SELECT [10] MIN(V1.Start-time, V2.Start-time), MAX(V1.End-time, V2.End-time) FROM CrowdCheer V1, ScoreCard V2 WHERE V1 BEFORE V2
155
Table 2. A list of data sets used for our experiments and the results of our observations
Games Description Length Number of cheers 31 Number of scoreboard displays 49 Number of goals 52 Number of changes in direction of players movements 74
Australia Vs Cuba 1996 (Women) Australia Vs USA 1994 (Women) Australia Vs Cuba 1996 (Women) Australia Vs USA 1997 (Men)
00:42:00
00:30:00
27
46
30
46
00:14:51
16
16
17
51
00:09:37
13
16
18
48
Here the query returns the 10 best-fit video sequences that satisfy the query criteria - goal. We have implemented this application model for basketball videos. Implementing such a model necessarily involves developing appropriate interpretation models, which is a labour-intensive task. The VIMET framework helps this process by facilitating incremental descriptions of video content. We now present an evaluation of the interpretation models outlined in the previous section.
Evaluation of Interpretation Models

The data set used to evaluate the different temporal models for building a video semantic goal in basketball videos is shown in Table 2. The first two clips A and B are used for evaluating observations and the last two clips are used for evaluating automatic analysis. The standard precision-recall method was used to compare our automatically generated goal segments with the ones manually judged by humans. Precision is the ratio of the number of relevant clips retrieved to the total number of clips retrieved. The ideal situation corresponds to 100% precision, when all retrieved clips are relevant.
precision =
| relevant retrieved | | retrieved |
Recall is the ratio of the number of relevant clips retrieved to the total number of relevant clips. We can achieve ideal recall (100%) by retrieving all clips from the data set, but the corresponding precision will be poor.
Table 3. A summary of results of automatic evaluation of various algorithms

Clips Total number of baskets (relevant) 17 Algorithms Total (retrieved) Correct Decision (relevant retrieved) 12 15 11 16 15 11 17 10 18 16 Precision (%) Recall (%)
18
T1 T2 T3 T4 T5 T1 T2 T3 T4 T5
15 20 11 24 17 16 19 10 25 22
80 75 100 66.66 88.23 68.75 89.47 100 72.0 72.72
70.50 88.23 64.70 94.1 88.23 61.11 94.44 55.55 100 88.88
recall =
| relevant retrieved | | relevant |
We evaluated the temporal interpretation models in our example data set. The manually identified legal goals (which include field goals and free throws) of the videos in our data set are shown in column 2 in Table 3. The result of automatic analysis shows that the combination of all three key events performs much better with high recall and precision values (~88%). In model T1, the model correctly identifies 12 goal segments out of 17 for video C and 11 out of 18 for video D. However, the total number of crowd segments detected by our crowd cheer detection algorithm is15 and 16 for videos C and D, respectively. That is, there are few legal goal segments that do not have crowd cheer and vice versa. Further analysis shows that crowd cheer events resulting from other interesting events such as fast break, clever steal and great check or screen gives false positive results. We observed that most of the field goals are accompanied by crowd cheer. However, many goals scored by free throws are not accompanied by crowd cheer. We also observed that in certain cases lack of supporters among the spectators for a team yield false negative results. In model T2, the model correctly identifies 15 goals out of 17 in video C and 17 out of 18 in video D. Our further analysis confirmed that legal goals due to free throws are not often accompanied by scoreboard displays, particularly when the scoreboard is updated at the end of free throws rather than after each free throw. Similarly, our feature extraction algorithm used for scoreboard not only detects scoreboards but also other textual displays such as team coach names and the number of fouls committed by a player. Such textual features increase the number of false positive results. We plan to use the heuristics developed here to improve our scoreboard detection algorithm in the future. The above discussion is valid for algorithms T3, T4 and T5 as well.
157
CONCLUDING REMARKS
Emerging new technologies in video delivery such as streaming over the Internet have made video content a significant component of the information space. Video content holders now have an opportunity to provide new video products and services by reusing their video collections. This requires more than content-based analysis systems. We need effective ways to model, represent and describe video content at several semantic levels that are meaningful to users of video content. At the lowest level, content can be described using low-level features such as colour and texture, and at the highestlevel the same content can be described using high-level concepts. In this chapter, we have provided a systematic way of developing content descriptions at several semantic levels. In the example in the fifth section, we have shown how (audio and visual) feature extraction techniques can be effectively used together with interpretation models, to develop higher-level semantic descriptions of content. Although the example is specific to the sports genre, the VIMET framework and the associated data model provides a platform to develop several descriptions of the same video content, depending on the interpretation level of the annotator. The model supports one of the mandates of MPEG7 by supporting content descriptions, to accommodate multiple interpretations.
REFERENCES
Adam, B., Dorai, C., & Venkatesh, S. (2000). Study of shot length and motion as contributing factors to movie tempo. ACM Multimedia, 353-355. Ahanger, G., & Little, T.D.C. (1996). A survey of technologies for parsing and indexing digital video. Journal of Visual Communication and Image Representation, 7(1), 28-43. Allen, J.F. (1983). Maintaining knowledge about temporal intervals. Communication of the ACM, 26(11), 832-843. Baral, C., Gonzalez, G., & Son, T. (1998). Conceptual modeling and querying in multimedia databases. Multimedia Tools and Applications, 7(), 37-66. Bryan-Kinns, N. (2000). VCMF: A framework for video content modeling. Multimedia Tools and Applications, 10(1), 23-45. Chang, S.F., Chen, W., Meng, H.J., Sundaram, H., & Zhong, D. (1997). VideoQ: An automated content based video search system using visual cues. ACM Multimedia, 313-324, Seattle, Washington, November. Colombo, C., Bimbo, A.D., & Pala, P. (2001). Retrieval of commercials by semantic content: the semiotic perspective. Multimedia Tools and Applications, 13, 93-118. Driankov, D., Hellendoorn, H., & Reinfrank, M. (1993). An introduction to fuzzy control. Springer-Verlag. Egenhofer, M., & Franzosa, R. (1991). Point-set topological spatial relations. International Journal of Geographic Information Systems, 5(2), 161-174. Fagin, R. (1999). Combining fuzzy information from multiple systems. Proceedings of the 15th ACM Symposium on Principles of Database Systems (pp. 83-99). Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., & Steele, D. (1995). Query by image and video content: The QBIC system. Computer, 28(9), 23-32.
Frank, A.U. (1996). Qualitative spatial reasoning: Cardinal directions as an example. International Journal of Geographic Information Systems, 10(3), 269-290. Gu, L. (1998). Scene analysis of video sequences in the MPEG Domain. Proceedings of the IASTED International Conference Signal and Image Processing, October 2831, Las Vegas. Hacid, M.S., Decleir, C., & Kouloumdjian, J. (2000). A database approach for modeling and querying video data. IEEE Transaction on Knowledge and Data Engineering, 12(5), 729-750. Hammoud, R., Chen, L., & Fontaine, D (2001). An extensible spatial-temporal model for semantic video segmentation. TRANSDOC project. Online at http:://transdoc.ibp.fr/ Hampapur, A. (1999). Semantic video indexing: Approaches and issues. SIGMOD Record, 28(1), 32-39. Hjelsvold, R., & Midtstraum, R. (1994). Modelling and querying video data. Proceedings of the 20th VLDB Conference, Santiago, Chile (pp. 686-694). ISO/IEC JTC1/SC29/WG11 (2001). Overview of the MPEG-7 Standard (version 5). Singapore, March. Jain, R., & Hampapur, A. (1994). Metadata in video databases. Sigmod Record, 23(4), 2733. Lindley, C., & Srinivasan, U. (1998). Query semantics for content-based retrieval of video data: An empirical investigation. Storage and Retrieval Issues in Image- and Multimedia Databases, in conjunction with the Ninth International Conference DEXA98, August 24-28, Vienna, Austria. Meng, J., Juan, Y., & Chang, S.-F. (1995). Scene change detection in a mpeg compressed video sequence. SPIE Symposium on Electronic Imaging: Science & TechnologyDigital Video Compression: Algorithms and Technologies, 2419, San Jose, California, February. Metz, C. (1974). Film language: A semiotics of the cinema (trans. by M. Taylor). The University of Chicago Press. MPEG MAAATE (2000). The Australian MPEG Audio Analysis Tool Kit. Online at http:/ /www.cmis.csiro.au/dmis/maaate/ Nepal, S., Srinivasan, U., & Reynolds, G. (2001). Automatic detection of goal segments in basketball videos. ACM Multimedia 2001, 261-269, Sept-Oct. Nepal, S., Srinivasan, U., & Reynolds, G. (2001). Semantic-based retrieval model for digital audio and video. IEEE International Conference on Multimedia and Exposition (ICME 2001), August (pp. 301-304). Pradhan, S., Tajima, K., & Tanaka, K. (2001). A query model to synthesize answer intervals from indexed video unit. IEEE Transaction On Knowledge and Data Engineering, 13(5), 824-838. Rui, Y., Huang, T.S., & Chang, S.F (1999). Image retrieval: Current techniques, promising directions and open issues. Journal of Visual Communication and Image Representation, 10, 39-62. Schloss, G.A., & Wynblatt, M.J.(1994). Building temporal structures in a layered multimedia data model. ACM Multimedia, 271-278. Smith, J.R., & Benitez, A.B. (2000). Conceptual modeling of audio-visual content. IEEE International Conference on Multimedia and Expo (ICME 2000), July-Aug. (p. 915).
159
Smith, J.R., & Chang, S.-F. (1997). Querying by color regions using the VisualSEEk content-based visual query system. In M. T. Maybury (Ed.), Intelligent multimedia information retrieval. IJCAI. Srinivasan, U., Gu, L., Tsui, & Simpsom-Young, B. (1997). A data model to support content-based search on digital video libraries. The Australian Computer Journal, 29(4), 141-147. Srinivasan, U., Lindley, C., & Simpsom-Young, B. (1999). A multi-model framework for video information systems. Database Semantics- Semantic Issues in Multimedia Systems, January (pp. 85-108). Kluwer Academic Publishers. Srinivasan, U., Nepal, S., & Reynolds, G. (2001). Modelling high level semantics for video data management. Proceedings of ISIMP 2001, Hong Kong, May (pp. 291-295). Tansley, R., Dobie, M., Lewis, P., & Hall, W. (1999). MAVIS 2: An architecture for content and concept based multimedia information exploration. ACM Multimedia, 203. Truong, B.T., Dorai, C., & Venkatesh, S. (2001). Determining dramatic intensification via flashing lights in movies. International Conference on Multimedia and Expo, August 22-25, Tokyo (pp. 61-64). Yap, K., Simpson-Young, B., & Srinivasan, U. (1996). Enhancing video navigation with existing alternate representations. First International Conference on Image Databases and Multimedia Search, Amsterdam, August.
160 Pfeiffer, Parker & Pang
Chapter 7
Hyperlinking, Search and Retrieval of Time-Continuous Data on the Web

Silvia Pfeiffer, CSIRO ICT Centre, Australia Conrad Parker, CSIRO ICT Centre, Australia Andr Pang, CSIRO ICT Centre, Australia
Continuous Media Web:
The Continuous Media Web project has developed a technology to extend the Web to time-continuously sampled data enabling seamless searching and surfing with existing Web tools. This chapter discusses requirements for such an extension of the Web, contrasts existing technologies and presents the Annodex technology, which enables the creation of Webs of audio and video documents. To encourage uptake, the specifications of the Annodex technology have been submitted to the IETF for standardisation and open source software is made available freely. The Annodex technology permits an integrated means of searching, surfing, and managing a World Wide Web of textual and media resources.
ABSTRACT
INTRODUCTION
Nowadays, the main source of information is the World Wide Web. Its HTTP (Fielding et al., 1999), HTML (World Wide Web Consortium, 1999B), and URI (BernersLee et al., 1998) standards have enabled a scalable, networked repository of any sort of
Continuous Media Web 161
information that people care to publish in textual form. Web search engines have enabled humanity to search for any information on any public Web server around the world. URI hyperlinks in HTML documents have enabled surfing to related information, giving the Web its full power. Repositories of information within organisations are also building on these standards for much of their internal and external information dissemination. While Web searching and surfing has become a natural way of interacting with textual information to access their semantic content, no such thing is possible with media. Media on the Web is cumbersome to use: it is handled as dark matter that cannot be searched through Web search engines, and once a media document is accessed, only linear viewing is possible no browsing or surfing to other semantically related documents. Multimedia research of the recent years has realised this issue. One means to enable search on media documents is to automate the extraction of content, store the content as index information, and provide search facilities through that index information. This has led to extensive research on the automated extraction of metadata from binary media data, aiming at bridging the semantic gap between automatically extracted low level image, video, and audio features, and the high level of semantics that humans perceive when viewing such material (see, e.g., Dimitrova et al., 2002). It is now possible to create and store a large amount of metadata and semantic content from media documents be that automatically or manually. But how do we exploit such a massive amount of information in a standard way? What framework can we build to satisfy the human need to search for content in media, to quickly find and access it for reviewing, and to manage and reuse it in an efficient way? As the Web is the most commonly used means for information access, we decided to develop a technology for time-continuous documents that enables their seamless integration into the Webs searching and surfing. Our research is thus extending the World Wide Web with its familiar information access infrastructure to time-continuous media such as audio and video, creating a Continuous Media Web. Particular aims of our research are: to enable the retrieval of relevant clips of time-continuous documents through familiar textual queries in Web search engines, to enable the direct addressing of relevant clips of time-continuous documents through familiar URI hyperlinks, to enable hyperlinking to other relevant and related Web resources while reviewing a time-continuous document, and to enable automated reuse of clips of time-continuous documents.
This chapter presents our developed Annodex (annotation and indexing) technology, the specifications of which have been published at the IETF (Internet Engineering Task Force) as Internet-Drafts for the purposes of international standardisation. Implementations of the technology are available at http://www.annodex.net/. In the next section we present related works and their shortcomings with respect to our aims. We then explain the main principles that our research and development work adheres. The subsequent section provides a technical description of the Continuous Media Web (CMWeb) project and thus forms the heart of this book chapter. We round it off with a
view on research opportunities created by the CMWeb, and conclude the paper with a summary.
BACKGROUND
The World Wide Web was created by three core technologies (Berners-Lee et al., 1999): HTML, HTTP, and URIs. They respectively enable: the markup of textual data integrated with the data itself giving it a structure, metadata, and outgoing hyperlinks, the distribution of Web documents over the Internet, and the hyperlinking to and into Web documents.
In an analogous way, what is required to create a Web of time-continuous documents is: a markup language to create addressable structure, searchable metadata, and outgoing hyperlinks for a continuous media document, an integrated document format that can be distributed via HTTP making use of existing caching HTTP proxy infrastructure, and a means to hyperlink into a continuous media document.
One expects that the many existing standardisation efforts in multimedia would cover these requirements. However, while the required pieces may exist, they are not packaged and optimised for addressing the issues and for solving them in such a way as to make use of the existing Web infrastructure with the least necessary adaptation efforts. Here we look at the three most promising standards: SMIL, MPEG-7, and MPEG-21.
SMIL
The W3Cs SMIL (World Wide Web Consortium, 2001), short for Synchronized Multimedia Interaction Language, is an XML markup language used for authoring interactive multimedia presentations. A SMIL document describes the sequence of media documents to play back, including conditional playback, loops, and automatically activated hyperlinks. SMIL has outgoing hyperlinks and elements that can be addressed inside it using XPath (World Wide Web Consortium, 1999A) and XPointer (World Wide Web Consortium, 2002). Features of SMIL cover the following modules: 1. 2. 3. 4. Animation: provides for incorporating animations onto a time line. Content Control: provides for runtime content choices and prefetch delivery. Layout: allows positioning of media elements on the visual rendering surface and control of audio volume. Linking: allows navigations through the SMIL presentation that can be triggered by user interaction or other triggering events. SMIL 2.0 provides only for in-line link elements.
5.
6. 7.
8. 9. 10. 11.
Media Objects: describes media objects that come in the form of hyperlinks to animations, audio, video, images, streaming text, or text. Restrictions of continuous media objects to temporal subparts (clippings) are possible, and short and long descriptions may be attached to a media object. Metainformation: allows description of SMIL documents and attachment of RDF metadata to any part of the SMIL document. Structure: structures a SMIL document into a head and a body part, where the head part contains information that is not related to the temporal behaviour of the presentation and the body tag acts as a root for the timing tree. Timing and Synchronization: provides for different choreographing of multimedia content through timing and synchronization commands. Time Manipulation: allows manipulation of the time behaviour of a presentation, such as control of the speed or rate of time for an element. Transitions: provides for transitions such as fades and wipes. Scalability: provides for the definition of profiles of SMIL modules (1-10) that meet the needs for a specific class of client devices.
SMIL is designed for creating interactive multimedia presentations, not for setting up Webs of media documents. A SMIL document may result in a different experience for every user and therefore is not a single, temporally addressable time-continuous document. Thus, addressing temporal offsets does not generally make sense on a SMIL document. SMIL documents cannot generally be searched for clips of interest as they dont typically contain the information required by a Web search engine: SMIL does not focus on including metadata, annotations and hyperlinks, thus it does not provide for the information necessary to be crawled and indexed by a search engine. In addition, SMIL does not integrate the media documents required for its presentation in one single file, but instead references them from within the XML file. All media data is only referenced and there is no transport format for a presentation that includes all the relevant metadata, annotations, and hyperlinks interleaved with the media data to provide a streamable format. This would not make sense anyway as some media data that is referenced in a SMIL file may never be viewed by users as they may never activate the appropriate action. SMIL interactions media streams will be transported on separate connections to the initial SMIL file, requiring the client to perform all the media synchronization tasks, and proxy caching can happen only on each file separately, not on the complete interaction. Note, however, that a single SMIL interaction, if recorded during playback, can become a single time-continuous media document, which can be treated with our Annodex technology to enable it to be searched and surfed. This may be interesting for archiving and digital record-keeping.
MPEG-21
The ISO/MPEGs MPEG-21 (Burnett et al., 2003) standard is building an open framework for multimedia delivery and consumption. It thus focuses on addressing how to generically describe a set of content documents that belong together from a semantic point of view, including all the information necessary to provide services on these digital
items. This set of documents is called a Digital Item, which is a structured representation in XML of a work including identification, and metadata information. The representation of a Digital Item may be composed of the following descriptors: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Container: is a structure that groups items and/or containers. Item: a group of subitems and/or components bound to relevant descriptors. Component: binds a resource to a set of descriptors including control or structural information of the resource. Anchor: binds a descriptor to a fragment of a resource. Descriptor: associates information (i.e. text or a component) with the enclosing element. Condition: makes the enclosing element optional and links it to the selection(s) that affect its inclusion. Choice: is a set of related sections that can affect an items configuration. Selection: is a specific decision that affects one or more conditions somewhere within an item. Annotation: is a set of information about an identified element of the model. Assertion: is a fully or partially configured state of choice by asserting true/false/ undecided for predicates associated with the selections for that choice. Resource: is an individually identifiable asset such as a video clip, audio clip, image or textual asset, or even a physical object, locatable via an address. Fragment: identifies a specific point or range within a resource. Statement: is a literal text item that contains information, but is not an asset. Predicate: is an identifiable declaration that can be true/false/undecided.
MPEG-21 further provides for the handling of rights associated with Digital Items, and for the adaptation of Digital Items to usage environments. As an example for a Digital Item, consider a music CD album. When it is turned into a digital item, the album is described in an XML document that contains references to the cover image, the text on the CD cover, the text on an accompanying brochure, references to a set of audio files that contain the songs on the CD, ratings of the album, rights associated with the album, information on the different encoding formats in which the music can be retrieved, different bitrates that can be supported when downloading etc. This description supports the handling of a digital CD album as an object: it allows you to manage it as an entity, describe it with metadata, exchange it with others, and collect it as an entity. An MPEG-21 document does not typically describe just one time-continuous document, but rather several. These descriptions are temporally addressable and hyperlinks can go into and out of them. Metadata can be attached to the descriptions of the documents making them searchable and indexable for search engines. As can be seen, MPEG-21 addresses the problem of how to handle groups of files rather than focusing on the markup of a single media file, and therefore does not address how to directly link into time-continuous Web resources themselves. There is a important difference between linking into and out of descriptions of a time-continuous document and linking into and out of a time-continuous document itself integrated handling provides for cacheablility and for direct URI access.
The aims of MPEG-21 are orthogonal to the aims that we pursue. While MPEG-21 enables a better handling of collections of Web resources that belong together in a semantic way, Annodex enables a more detailed handling of time-continuous Web resources only. Annodex provides a granularity of access into time-continuous resources that an MPEG-21 Digital Item can exploit in its descriptions of collections of Annodex and other resources.
MPEG-7
ISO/MPEGs MPEG-7 (Martinez et al., 20020) standard is an open framework for describing multimedia entities, such as image, video, audio, audiovisual, and multimedia content. It provides a large set of description schemes to create markup in XML format. MPEG-7 description schemes can provide the following features: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Specification of links and locators (such as time, media locators, and referencing description tools). Specification of basic information such as people, places, textual annotations, controlled vocabularies, etc. Specification of the spatio-temporal structure of multimedia content. Specification of audio and visual features of multimedia content. Specification of the semantic structure of multimedia content. Specification of the multimedia content type and format for management. Specification of media production information. Specification of media usage (rights, audience, financial) information. Specification of classifications for multimedia content. Specification of user information (user description, user preferences, usage history). Specification of content entities (still region, video/audio/audiovisual segments, multimedia segment, ink content, structured collections). Specification of content abstractions (semantic descriptions, media models, media summaries, media views, media variations).
The main intended use of MPEG-7 is for describing multimedia assets such that they can be queried or filtered. Just like SMIL and MPEG-21, the MPEG-7 descriptions are regarded completely independent of the content itself. An MPEG-7 document is an XML file that contains any sort of meta information related to a media document. While the temporal structure of a media document can be represented, this is not the main aim of MPEG-7 and not typically the basis for attaching annotations and hyperlinks. This is the exact opposite approach to ours where the basis is the media document and its temporal structure. Much MPEG-7 markup is in fact not time-related and thus does not describe media content at the granularity we focus on. Also, MPEG-7 does not attempt to create a temporally interleaved document format that integrates the markup with the media data. Again, the aims of MPEG-7 and Annodex are orthogonal. As MPEG-7 is a format that focuses on describing collections of media assets, it is a primarily database-driven approach towards the handling of information, while Annodex comes from a background
of Web-based, and therefore network-based, handling of media streams. A specialisation (or in MPEG-7 terms profile) of MPEG-7 descriptions schemes may allow the creation of annotations similar to the ones developed by us, but the transport-based interleaved document format that integrates the markup with the media data in a streamable fashion is not generally possible with MPEG-7 annotations. Annotations created in MPEG-7 may however be referenced from inside an Annodex format bitstream, and some may even be included directly into the markup of an Annodex format bitstream through the meta and desc tags.
THE CHALLENGE
The technical challenge for the development of Annodex (Annodex.net, 2004) was the creation of a solution to the three issues presented earlier: Create an HTML-like markup language for time-continuous data, that can be interleaved with the media stream to create a searchable media document, and create a means to hyperlink by temporal offset into the time-continuous document. We have developed three specifications: 1. CMML (Pfeiffer, Parker, & Pang, 2003a), the Continuous Media Markup Language which is based on XML and provides tags to mark up time-continuous data into sets of annotated temporal clips. CMML draws upon many features of HTML. Annodex (Pfeiffer, Parker, & Pang, 2003b), the binary stream format to store and transmit interleaved CMML and media data. temporal URIs (Pfeiffer, Parker, & Pang, 2003c), which enables hyperlinking to temporally specified sections of an Annodex resource.
2. 3.
Aside from the above technical requirements, the development of these technologies has been led by several principles and non-technical requirements. It is important to understand these constraints as they have strongly influenced the final format of the solution. Hook into existing Web Infrastructure: The Annodex technologies have been designed to hook straight into the existing Web infrastructure with as few adaptations as necessary. Also, the scalability property of the Web must not be compromised by the solution. Thus, CMML is very similar to HTML, temporal URI queries are CGI (Common Gateway Interface) style parameters (NCSA HTTPd Development Team, 1995), temporal URI fragments are like HTML fragments, and Annodex streams are designed to be cacheable by Web proxies. Open Standards: The aim of the CMWeb project is to extend the existing World Wide Web to time-continuous data such as audio and video and create a more
powerful networked worldwide infrastructure. Such a goal can only be achieved if the different components that make up the infrastructure interoperate even when created by different providers. Therefore, all core specifications are being published as open international standards with no restraining patent issues. The Annodex Trademark: For an open standard, interoperability of different implementations is crucial to its success. Any implementation that claims to implement the specification but is not conformant and thus not interoperable will be counterproductive to the creation of a common infrastructure. Therefore, registering a Trademark on the word Annodex enables us to stop non-conformant implementations from claiming to be conformant by using the same name. Free media codecs: For the purpose of standardisation it is important that internetwide usage is encouraged to use media codecs for which no usage restrictions exist. The codecs must be legal to use in all Internet connected devices and compatible with existing Web infrastructure. This however does not mean that the technology is restricted to specific codecs on the contrary: Annodex works for any timecontinuously sampled digital data file. Please also note that we do not develop codecs ourselves, but rather provide recommendations for which codecs to support. Open Source: Open standards require reference implementations that people can learn from and make use of for building up the infrastructure. Therefore, the reference software should be published as open source software. According to Tim Berners-Lee this was essential to the development and uptake of the Web (BernersLee et al., 1999). Device Independence: As convergence of platforms continues, it is important to design new formats such that they can easily be displayed and interacted with on any networked device, be that on a huge screen, or on a small handheld device screen. Therefore, Annodex is being designed to work independently of any specific features an end device may have. Generic Metadata: Metadata for time-continuous data can come in may different structured or unstructured schemes. It can be automatically extracted or manually extracted, follow the standard Dublin Core metadata scheme (Dublin Core Metadata Initiative, 2003), or a company specific metadata scheme. Therefore, it is important to specify the metadata types in a generic manner to allow free text and any set of name-value pairs as metadata. It must be possible to develop more industry-specific sets of metadata schemes later and make full use of them in Annodex. Simplicity: Above all, the goal of Annodex is to create a very simple set of tools and formats for enabling time-continuous Web resources with the same powerful means of exploration as text content on the Web.
These principles stem from a desire to make simple standards that can be picked up and integrated quickly into existing Web infrastructure.
Surfing and Searching

THE SOLUTION
The double aim of Annodex is to enable Web users to: view, access and hyperlink between clips of time-continuous documents in the same simple, but powerful way as HTML pages, and search for clips of time-continuous documents through the common Web search engines, and to retrieve clips relevant to their query.
Figures 1 and 2 show screen shots of an Annodex Web browser and an Annodex search engine. The Annodex browsers main window displays the media data, typical Web browser buttons and fields (at the top), typical media transport buttons (at the bottom), and a representative image (also called a keyframe) and hyperlink for the currently displayed media clip. The story board next to the main window displays the list of clips that the current resource consists of, enabling direct access to any clip in this table of contents. The separate window on the top right displays the free-text annotation stored in the description for the current clip, while the one on the lower right displays the structured metadata stored for the resource or for the current clip. When crawling this particular resource, a Web search engine can index all this textual information. The Annodex search engine displayed in Figure 2 is a standard Web search engine extended with the ability to crawl and index the markup of Annodex resources. It retrieves clips that are relevant to the users query and presents ranked search results based on the relevance of the markup of the clips. The keyframe of the clip and its description are displayed.
Figure 1. Browsing a video about CSIRO astronomy research
Figure 2. Searching for radio galaxies in CSIROs science CMWeb
Architecture Overview
As the Continuous Media Web technology has to be interoperable with existing Web technology, its architecture must be the same as the World Wide Web (see Figure 3): there is a Web client, that issues a URI request over HTTP to a Web server, who resolves it and serves out the requested resource back to the client. In the case where the client is a Continuous Media Web browser, the request will be for an Annodex file,
Figure 3. Continuous Media Web Architecture
which contains all the relevant markup and media data to display the content to the user. In the case where the client is a Web crawler (e.g. part of a Web search engine), the client may add a HTTP Content-type request header with a preference for receiving only the CMML markup and not the media data. This is possible because the CMML markup represents all the textual content of an Annodex file and is thus a thin representation of the full media data. In addition it is a bandwidth-friendly means of crawling and indexing media content, which is very important for scalability of the solution.
Annodex File Format
Annodex is the format in which media with interspersed CMML markup is transferred over the wire. Analogous to a normal Web server offering a collection of HTML pages to clients, an Annodex server offers a collection of Annodex files. After a Web client has issued a URI request for an Annodex resource, the Web server delivers the Annodex resource, or an appropriate subpart of it according to the URI query parameters. Annodex files conceptually consist of multiple media streams and one CMML annotation stream, interleaved in a temporally synchronised way. The annotation stream may contain several sets of clips that provide alternative markup tracks for the Annodex file. The media streams may be complementary, such as an audio track with a video track, or alternative, such as two speech tracks in different languages. Figure 4 shows an example Annodex file with three media tracks (light coloured bars) and an annotation track with a header describing the complete file (dark bar at the start) and several interspersed clips. One way to author Annodex files is by creating a CMML markup file and encoding the media data together with the markup based on the authoring instructions found in the CMML file. Figure 5 displays the principle of the Annodex file creation process: the header information of the CMML file and the media streams are encoded at the start of the Annodex file, while the clips and the actual encoded media data are appended thereafter in a temporally interleaved fashion.
Figure 4. An example Annodex file, time increasing from left to right
The choice of a binary encapsulation format for Annodex files was one of the challenges of the CMWeb project. We examined several different encapsulation formats and came up with a list of requirements: 1. 2. 3. 4. 5. 6. 7. 8. the format had to provide framing for binary media data and XML markup, temporal synchronisation between media data and XML markup was necessary, the format had to provide a temporal track paradigm for interleaving, the format had to have streaming capabilities, for fault tolerance, resynchronisation after a parsing error should be simple, seeking landmarks were necessary to allow random access, the framing information should only yield a small overhead, and the format needed to be simple to allow handling on devices with limited capabilities.
Hierarchical formats like MPEG-4 and QuickTime did not qualify due to requirement 2, making it hard to also provide for requirements 3 and 4. An XML based format also did
Figure 5. Annodex file creation process
not qualify because binary data cannot be included in XML tags without having to encode it in base64, which is inflating the data size by about 30% and creating unnecessary additional encoding and decoding steps, and thus violating requirements 1 and 7. After some discussion, we adopted the Ogg encapsulation format (Pfeiffer, 2003) developed by Xiphophorous (Xiphophorus, 2004). That gave us the additional advantage of having Open Source libraries available on all major platforms, much simplifying the task of rolling out format support.
The Continuous Media Markup Language CMML

CMML is simple to understand, as it is HTML-like, though oriented towards a segmentation of continuous data along its time axis into clips. A sample CMML file is given below: <?xml version=1.0" encoding=UTF-8" standalone=yes?> <!DOCTYPE cmml SYSTEM cmml.dtd> <cmml> <stream timebase=0" utc=20040114T153500.00Z> <import src=galaxies.mpg contenttype=video/mpeg start=npt:0"/> </stream> <head> <title>Hidden Galaxies</title> <meta name=author content=CSIRO/> </head> <clip id=findingGalaxies start=15"> <a href=http://www.aao.gov.au/galaxies.anx#radio> Related video on Detection of Galaxies</a> <img src=galaxy.jpg/> <desc>Whats out there? ...</desc> <meta name=KEYWORDS content=Radio Telescope, Galaxies/> </clip> </cmml> As the sample file shows, CMML has XML syntax, consisting of three main types of tags: At most one stream tag, exactly one head tag, and an arbitrary number of clip tags. The stream tag is optional. It describes the input bitstreams necessary for the creation of an Annodex file in the import tags, and gives some timing information necessary for the output Annodex file. The import bitstream will be interleaved into multiple tracks of media, even if they start at different time offsets and need to be temporally realigned through the start attribute. The markup of a head tag in the CMML document contains information about the complete media document. Its essential information comprises of
structured textual annotations in meta tags, and unstructured textual annotations in the title tag.
Structured annotations are name-value pairs which can follow a new or existing metadata annotation scheme such as the Dublin Core (Dublin Core Metadata Initiative, 2003). The markup of a clip tag contains information on the various clips or fragments of the media: Anchor points provide entry points into the media document that a URI can refer to. Anchor points identify the start time and the name (id) of a clip. This enables URIs to refer to Annodex clips by name. URI hyperlinks can be attached to a clip, linking out to any other place a URI can point to, such as clips in other annodexed media or HTML pages. These are given by the a (anchor) tag with its href attribute. Furthermore, the a tag contains a textual annotation of the link, the so-called anchor text (in the example above: Related video on Detection of Galaxies) specifying why the clip is linked to a given URI. Note that this is similar to the a tag in HTML. An optional keyframe in the img tag provides a representative image for the clip and enables display of a story board for Annodex files. Unstructured textual annotations in the desc tags provide for searchability of Annodex files. Unstructured annotation is free text that describes the clip itself.
Each clip belongs to a specific set of temporally non-overlapping clips that make up one track of annotations for a time-continuous data file. The track attribute of a clip provides this attribution if it is not specified, the clip belongs to the default track. Using the above sample CMML file for authoring Annodex, the result will be a galaxies.anx file of the form given in Figure 6.
Specifying Time Segments and Clips in URIs

Linking to Time Segments and Clips in URIs
A URI points to a Web resource, and is the primary mechanism on the Web to reach information. Time-continuous Web resources are typically large data files. Thus, when a Web user wants to link to the exact segment of interest within the time-continuous resource, it is desirable that only that segment is transferred. This reduces network load and user waiting time. No standardised scheme is currently available to directly link to segments of interest in a time-continuous Web resource. However, addressing of subparts of Web resources is generally achieved through URI query specifications. Therefore, we defined a query scheme to allow direct addressing of segments of interest in Annodex files. Two fundamentally different ways of addressing information in a Annodex resources are necessary: addressing of clips and addressing of time offsets or time segments.
Figure 6. Annodex file created from the sample CMML file
Linking to Clips
Clips in Annodex files are identified by their id attribute. Thus, accessing a named clip in an Annodex (and, for that matter, a CMML) file is achieved with the following CGI conformant query parameter specification: id=clip_id Examples for accessing a clip in the above given sample CMML and Annodex files are: http://www.annodex.net/galaxies.cmml?id=findingGalaxies http://www.annodex.net/galaxies.anx?id=findingGalaxies On the Annodex server, the CMML and Annodex resources will be pre-processed as a result of this query before being served out: the file header parts will be retained, the time basis will be adjusted and the queried clip data will be concatenated at the end to regain conformant file formats.
Linking to Time Segments

It is also desirable to be able to address any arbitrary time segment of an Annodex or CMML file. This is again achieved with a CGI conformant query parameter specification: t=[time-scheme:]time_interval
Available time schemes are npt for normal play time, different smpte specifications of the Society of Motion Pictures and Television Engineers (SMPTE), and clock for a Universal Time Code (UTC) time specification. For more details see the specification document (Pfeiffer, Parker, & Pang, 2003c). Examples for requesting one or several time intervals from the above given sample CMML and Annodex files are: http://www.annodex.net/galaxies.cmml?t=85.28 http://www.annodex.net/galaxies.anx?t=npt:15.6-85.28,100.2 http://www.annodex.net/galaxies.anx?t=smpte-25:00:01:25:07 http://www.annodex.net/galaxies.anx?t=clock:20040114T153045.25Z Where only a single time point is given, this is interpreted to relate to the time interval covered from that time point onwards until the end of the stream. The same pre-processing as described above will be necessary on the Annodex server.
Restricting Views to Time Segments and Clips in URIs

Aside from the query mechanism, URIs also provide a mechanism to address subparts of Web resources locally on a Web client: URI fragment specifications. We have found that fragments are a great mechanism to restrict views on Annodex files to a specific subpart of the resource, e.g. when viewing or editing a temporal subpart of an Annodex document. Again, two fundamentally different ways of restricting a time-continuous resource are required: views on a clip and views on time segments.
Views on Clips
Restricting the view on an Annodex (or CMML) file to a named clip makes use of the value of the id tag of the clip in a fragment specification: #clip_id Examples for local clip views for the above given sample CMML and Annodex files are: http://www.annodex.net/galaxies.cmml#findingGalaxies http://www.annodex.net/galaxies.anx#findingGalaxies The Web client that is asked for such a resource will ask the Web server for the complete resource and perform its application-specific operation on the clip only. This may for example result in a sound editor downloading a complete sound file, then selecting the named clip for further editing. An Annodex browser would naturally behave analogously to an existing Web browser that receives a html page with a fragment offset: it will fast forward to the named clip as soon as that clip has been received.
Views on Time Segments

Analogously to clip views, views can be restricted to time intervals with the following specification:
#[time-scheme:]time_interval Examples for restrictions to one or several time intervals from the above given sample CMML and Annodex files are: http://www.annodex.net/galaxies.cmml#85.28 http://www.annodex.net/galaxies.anx#npt:15.6-85.28,100.2 http://www.annodex.net/galaxies.anx#smpte-25:00:01:25:07 http://www.annodex.net/galaxies.anx#clock:20040114T153045.25Z Where only a single time point is given, this is interpreted to relate to the time interval covered from that time point onwards until the end of the stream. The same usage examples as described above apply in this case, too. Specifying several time segments may make sense only in specific applications, such as an editor, where an unconnected selection for editing may result.
FEATURES OF ANNODEX
While developing the Annodex technology, we discovered that the Annodex file format addresses many challenges of media research that were not part of the original goals of its development but came to it with serendipity. Some of these will be regarded briefly in this chapter.
Multitrack Media File Format

The Annodex file format is based on the Xiph.org Ogg file format (Pfeiffer, 2003) which allows multiple time-continuous data tracks to be encapsulated in one interleaved file format. We have extended the file format such that it can be parsed and handled without having to decode any of the data tracks themselves, making Annodex a generic multitrack media file format. To that end we defined a generic data track header page which includes a Content-type field that identifies the codec in use and provides some general attributes of the track such as its temporal resolution. The multitrack file format now has three parts: 1. 2. 3. Data track identifying header pages (primary header pages) Codec header pages (secondary header pages) Data pages
For more details refer to the Annodex format specification document (Pfeiffer, Parker, and Pang, 2003b). A standardised multitrack media format is currently non-existent many applications, amongst them multitrack audio editors, will be able to take advantage of it, especially since the Annodex format also allows inclusion of arbitrary meta information.
Multitrack Annotations
CMML and Annodex have been designed to provide a means of annotating and indexing time-continuous data files by structuring their time-line into regions of interest called clips. Each clip may have structured and unstructured annotations, a hyperlink and a keyframe. A simple partitioning however does not allow for several different, potentially overlapping subdivisions of the time-line into clips. After considering several different solutions for such different subdivisions, we decided to adapt a multitrack paradigm for annotations as well: every clip of an Annodex or CMML file belongs to one specific annotation track, clips within one annotation track cannot overlap temporally, clips on different tracks can overlap temporally as needed, the attribution of a clip to a track is specified through its track attribute if its not given, its attributed to a default track.
This is a powerful concept and can easily be represented in browsers by providing a choice of the track thats visible.
Internationalisation Support
CMML and Annodex have also been designed to be language-independent and provide full internationalisation support. There are two issues to consider for text in CMML elements: different character sets and different languages. As CMML is an XML markup language, different character sets are supported through the xml processing instructions encoding attribute containing a file-specific character set (World Wide Web Consortium, 2000). A potentially differing character set for an import media file will be specified in the contenttype attribute of the source tag as a parameter to the mime type. Any tag or attribute that could end up containing text in a different language to the other tags may specify their own language. This is only necessary for tags that contain human-readable text. The language is specified in the lang and dir attributes.
Search Engine Support

Web search engines are powerful tools to explore the textual information published on Web servers. The principle they work from is that crawling hyperlinks that they find within known Web pages will lead them to more Web pages, and eventually to most of the Webs content. For all Web resources they can build a search index of their textual contents and use it for retrieval of a hyperlink in response to a search query. With binary time-continuous data files, indexing was previously not possible. However, Annodex allows the integration of time-continuous data files into the crawling and indexing paradigm of search engines through providing CMML files. A CMML file represents the complete annotation of an Annodex file with HTML-style anchor tags in its clip tags that enable crawling of Annodex files. Indexing can then happen on the level of the complete file or on the level of individual clips. For the complete file, the tags in the head element (title & meta tags) will be indexed, whereas for clips, the tags in the clip elements (desc & meta tags) are necessary. The search result display should then display
the descriptive content of the title and desc tags, and the representative keyframe given in the img tag, to provide a nice visual overview of the retrieved clip (see Figure 2). For retrieval of the CMML file encapsulated in an Annodex file from an Annodex server, HTTPs content type negotiation is used. The search engine only needs to include into its HTTP request an Accept header with a higher priority on text/x-cmml than on application/x-annodex and a conformant Annodex server will provide the extracted CMML content for the given Annodex resource.
Caching Web Proxies

HTTP defines a mechanism to cache byte ranges of files in Web proxies. With Annodex files, this mechanism can be used to also cache time intervals or clips of timecontinuous data files, which are commonly large-size files. To that end, the Web server must provide a mapping of the clip or the time intervals to byte ranges. Then, the Web proxy can build up a table of ranges that it caches for a particular Annodex resource. If it receives an Annodex resource request for a time interval or clip that it already stores, it can serve out the data straight out of its cache. Just like the Web server, it may however need to process the resource before serving it: the file header parts need to be prepended to the data, the timebase needs to be adjusted, and the queried data needs to be concatenated at the end to regain a conformant Annodex file format. As Annodex allows parsing of files without decoding, this is a fairly simple operation, enabling a novel use of time-continuous data on the Web.
Dynamic Annodex Creation

Current Web sites use scripting extensively to automatically create HTML content with up to date information extracted from databases. As Annodex and CMML provide clip structured media data, it is possible to create Annodex content by scripting. The annotation and indexing information of a clip may then be stored in a metadata database with a reference to the clip file. A script can then select clips based by querying the metadata database and create an Annodex file on the fly. News bulletins and video blogs are application examples which can be built with such a functionality.
RESEARCH OPPORTUNITIES
There are a multitude of open research opportunities related to Annodex, some of which are mentioned in this Section. Further research is necessary for exploring transcoding of metadata. A multitude of different markup languages for different kinds of time-continuous data already exist. CMML is a generic means to provide structured and unstructured annotations on clips and media files. Many of the existing ways to markup media may be transcode into CMML, and utilise the power of Annodex. Transcoding is simple to implement for markup that is also based on XML, because XSLT (World Wide Web Consortium, 1999C) provides a good tool to implement such scripts. Transcoding of metadata directly leads to the question of interoperability with other standards. MPEG-7 is such a metadata standard for which it is necessary to explore transcoding, however MPEG-7 is more than just textual metadata and there may be more
to find. Similarly, interoperability of Annodex with standards like RTP/RTSP (Schulzrinne et al. 1996 and 1998), DVD (dvdforum, 2000), MPEG-4 (MPEG Industry Forum, 2002), and MPEG-21(Burnett et al, 2003) will need to be explored. Another question that frequently emerges for Annodex is the question of annotating and indexing regions of interest within a videos imagery. We decided that structuring the spatial domain is out of scope for the Annodex technologies and may be re-visited at a future time. Annodex is very specifically designed to solve problems for timecontinuous data, and that data may not necessarily have a spatial domain (such as audio data). Also, on different devices the possible interactions have to be very simple, so e.g. selecting a spatial region while viewing a video on a mobile device is impractical. However, it may be possible for specific applications to use image maps with CMML clips to also hyperlink and describe in the spatial domain. This is an issue to explore in the future. Last but not least there are many opportunities to apply and extend existing multimedia content analysis research to automatically determine CMML markup.
CONCLUSION
This chapter presented the Annodex technology, which brings the familiar searching and surfing capabilities of the World Wide Web to time-continuously sampled data (Pfeiffer, Parker, and Schremmer, 2003). At the core of the technology are the Continuous Media Markup Language CMML, the Annodex stream and file format, and clip- and timereferencing URI hyperlinks. These enable the extensions of the Web to a Continuous Media Web with Annodex browsers, Annodex servers, and Annodex search engines. Annodex is however more powerful as it also represents a standard multitrack media file format with multitrack annotations, which can be cached on Web proxies and used in Web server scripts for dynamic content creation. Therefore, Annodex and CMML present a Web-integrated means for managing multimedia semantics.
ACKNOWLEDGMENT
The authors greatly acknowledge the comments, contributions, and proofreading of Claudia Schremmer, who is making use of the Continuous Media Web technology in her research on metadata extraction of meeting recordings.
REFERENCES
Annodex.net (2004). Open standards for annotating and indexing networked media. Retrieved January 2004 from http://www.annodex.net Berners-Lee, T., Fielding, R., & Masinter, L. (1998, August). Uniform resource identifiers (URI): Generic syntax. Internet Engineering Task Force, RFC 2396. Retrieved January 2003 from http://www.ietf.org/rfc/frc2396.txt Berners-Lee, T., Fischetti, M. & Dertouzos, M.L. (1999). Weaving the Web: The original design and ultimate destiny of the World Wide Web by its inventor. San Francisco: Harper.
Burnett, I., Van de Walle, R., Hill, K., Bormans, J., & Pereira, F. (2003). MPEG-21: Goals and achievements. IEEE Multimedia Magazine, Oct-Dec, 60-70. Dimitrova, N., Zhang, H.-J., Shahraray, B., Sezna, I., Huang, T. & Zakhor A. (2002). Applications of video-content analysis and retreival. IEEE Multimedia Magazine, July-Sept, 42-55. Dublin Core Metadata Initiative. (2003). The Dublin Core Metadata Element Set, v1.1. February. Rectrieved January 2004 from http://dublincore.org/documents/2003/ 02/04/dces dvdforum (2000, September). DVD Primer. Retrieved January 2004 from http:// www.dvdforum.org/tech-dvdprimer.htm Fielding, R., Gettys, J., Mogul, J., Nielsen, H., Masinter, L., Leach, P, & Berners-Lee, T. (1999, June). Hypertext Transfer Protocol HTTP/1.1. Internet Engineering Task Force, RFC 2616. Received January 2004 from http://www.ietf.org/rfc/rfc2616.txt Martinez, J.M., Koenen, R., & Pereira, F. (2002). MPEG-7: The generic multimedia content description standard. IEEE Multimedia Magazine, April-June, 78-87. MPEG Industry Forum. (2002, February). MPEG-4 users frequently asked questions. Retrieved January 2004 from http://www.mpegif.org/resources/mpeg4userfaq.php NCSA HTTPd Development Team (1995, June). The Common Gateway Interface (CGI). Retrieved January 2004 from http://hoohoo.ncsa.uiuc.edu/cgi/ Pfeiffer, S. (2003, May). The Ogg encapsulation format version 0. Internet Engineering Task Force, RFC 3533. Retrieved January 2004 from http://www.ietf.org/rfc/ rfc3533.txt Pfeiffer, S., Parker, C., & Pang, A. (2003a). The Continuous Media Markup Language (CMML), Version 2.0 (work in progress). Internet Engineering Task Force, December 2003. Retrieved January 2004 from http://www.annodex.net/TR/draft-pfeiffercmml-01.txt Pfeiffer, S., Parker, C., & Pang, A. (2003b). The Annodex annotation and indexing format for time-continuous data files, Version 2.0 (work in progress). Internet Engineering Task Force, December 2003. Retrieved January 2004 from http://www.annodex.net/ TR/draft-pfeiffer-annodex-01.txt Pfeiffer, S., Parker, C., & Pang, A. (2003c). Specifying time intervals in URI queries and fragments of time-based Web resources (BCP) (work in progress). Internet Engineering Task Force, December 2003. Retrieved January 2004 from http:// www.annodex.net/TR/draft-pfeiffer-temporal-fragments-02.txt Pfeiffer, S., Parker, C., & Schremmer, C. (2003). Annodex: A simple architecture to enable hyperlinking, search & retrieval of time-continuous data on the Web. Proceedings 5th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR), Berkeley, California, November (pp. 87-93). Schulzrinne, H., Casner, S., Frederick, R., & Jacobson, V. (1996, January). RTP: A transport protocol for real-time applications. Internet Engineering Task Force, RFC 1889. Retrieved January 2004 from http://www.ietf.org/rfc/rfc1889.txt Schulzrinne, H., Rao, A., & Lanphier, R. (1998, April). Real Time Streaming Protocol (RTSP). Internet Engineering Task Force, RFC 2326. Retrieved January 2003 from http://www.ietf.org/rfc/rfc2326.txt World Wide Web Consortium (1999A). XML Path Language (XPath). W3C XPath, November 1999. Retrieved January 2004 from http://www.w3.org/TR/xpath/
World Wide Web Consortium (1999B). HTML 4.01 Specification. W3C HTML, December 1999. Retrieved January 2004 from http://www.w3.org/TR/html4/ World Wide Web Consortium (1999C). XSL Transformations (XSLT) Version 1.0. W3C XSLT, November 1999. Retrieved January 2004 from http://www.w3.org/TR/xslt/ World Wide Web Consortium (2000, October). Extensible Markup Language (XML) 1.0. W3C XML. Retrieved January 2004 from http://www.w3.org/TR/2000/REC-xml20001006 World Wide Web Consortium (2001, August). Synchronized Multimedia Integration Language (SMIL 2.0). W3C SMIL. Retrieved January 2004 from http://www.w3.org/ TR/smil20/ World Wide Web Consortium (2002, August). XML Pointer Language (XPointer). W3C XPointer. Retrieved January 2004 from http://www.w3.org/TR/xptr/ Xiphophorus (2004). Building a new era of Open multimedia. Retrieved January 2004 from http://www.xiph.org/
182 Srinivasan & Divakaran
Chapter 8
Management of Multimedia Semantics Using MPEG-7

Uma Srinivasan, CSIRO ICT Centre, Australia Ajay Divakaran, Mitsubishi Electric Research Laboratories, USA
This chapter presents the ISO/IEC MPEG-7 Multimedia Content description Interface Standard from the point of view of managing semantics in the context of multimedia applications. We describe the organisation and structure of the MPEG-7 Multimedia Description schemes which are metadata structures for describing and annotating multimedia content at several levels of granularity and abstraction. As we look at MPEG-7 semantic descriptions, we realise they provide a rich framework for static descriptions of content semantics. As content semantics evolves with interaction, the human user will have to compensate for the absence of detailed semantics that cannot be specified in advance. We explore the practical aspects of using these descriptions in the context of different applications and present some pros and cons from the point of view of managing multimedia semantics.
ABSTRACT
INTRODUCTION
MPEG-7 is an ISO/IEC Standard that aims at providing a standard way to describe multimedia content, to enable fast and efficient searching and filtering of audiovisual content. MPEG-7 has a broad scope to facilitate functions such as indexing, management, filtering, authoring, editing, browsing, navigation, and searching content descripCopyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
183
tions. The purpose of the standard is to describe the content in a machine-readable format for further processing determined by the application requirements. Multimedia content can be described in many different ways depending on the context, the user, the purpose of use and the application domain. In order to address the description requirement of a wide range of applications, MPEG-7 aims to describe content at several levels of granularity and abstraction to include description of features, structure, semantics, models, collections and metadata about the content. Initial research focused on feature extraction techniques influenced the description of content at the perceptual feature level. Examples of visual features that can be extracted using image-processing techniques are colour, shape and texture. Accordingly, there are several MPEG-7 Descriptors (Ds) to describe visual features. Similarly there are a number of low-level Descriptors to describe audio content at the level of spectral, parametric and temporal features of an audio signal. While these Descriptors describe objective measures of audio and visual features, they are inadequate for describing content at a higher level of semantics to describe relationships among audio and visual descriptors within an image or over a video segment. This need is addressed through the construct called Multimedia Descriptions Scheme (MDS), also referred to simply as Description Scheme (DS). Description schemes are designed to describe higher-level content features such as regions, segments, objects and events, as well as metadata about the content, its usage, and so forth. Accordingly, there are several groups or categories of MDS tools. An important factor that needs to be considered while describing audiovisual content is the recognition that humans start to interpret and describe the meaning of the content that goes far beyond visual features and cinematic constructs introduced in films. While such meanings and interpretations cannot be extracted automatically, because they are contextual, they can be described using free text descriptions. MPEG7 handles this aspect through several description schemes that are based on structured free text descriptions. As our focus is on management of multimedia semantics, we look at MPEG-7 MDS constructs from two perspectives: (a) the level of granularity offered while describing content, and (b) the level of abstraction available to describe multimedia semantics. The second section provides an overview of the MPEG-7 constructs and how they hang together. The third section looks at MDS tools to manage multimedia semantics at multiple levels of granularity and abstraction. The fourth section takes a look at the whole framework from the perspective of different applications. The last section presents some discussions and conclusions.
MPEG-7 CONTENT DESCRIPTION AND ORGANISATION

The main elements of MPEG-7 as described in the MPEG-7 Overview document (Martnez, 2003) are a set of tools to describe the content, a language to define the syntax of the descriptions, and system tools to support efficient storage and transmission, execution and synchronization of binary encoded descriptions.
The Description Tools provide a set of Descriptors (D) that define the syntax and the semantics of each feature, and a library of Description Schemes (DS) that specify the structure and semantics of the relationships between their components, that may be both Descriptors and Description Schemes. A description of a piece of audiovisual content is made up of a number of Ds and DSs determined by the application. The description tools can be used to create such descriptions which form the basis for search and retrieval. A Description Definition Language (DDL) is used to create and represent the descriptions. DDL is based on XML and hence allows the processing of descriptions in a machine-readable format. Content descriptions created using these tools could be stored in a variety of ways. The descriptions could be physically located with the content in the same data stream or the same storage system, allowing efficient storage and retrieval. However, there could be instances where content and its descriptions may not be colocated. In such cases, we need effective ways to synchronise the content and its Descriptions. System tools support multiplexing of description, synchronization issues, transmission mechanisms, file format, and so forth. Figure 1 (Martnez, 2003) shows the main MPEG-7 elements and their relationships. MPEG-7 has a broad scope and aims to address the needs of several types of applications (Vetro, 2001). MPEG-7 descriptions of content could include Information describing the creation and production processes of the content (director, title, short feature movie). Information related to the usage of the content (copyright pointers, usage history, and broadcast schedule). Information of the storage features of the content (storage format, encoding). Structural information on spatial, temporal or spatiotemporal components of the content (example: scene cuts, segmentation in regions, region motion tracking). Information about low-level audio and visual features in the content (example: colors, textures, sound timbres, melody description).
Figure 1. MPEG-7 elements
Description Definition Language Tags Definition
D1 D5 D6
D2 D3 D7
DS1 D4
structuring
DS2 D3 DS7
DS4
Instantiation
DS5 DS6
<scene id=5> <time>. <camera> <annotation> </scene> <scene id=6>

Description
Encoding and Delivery
Descriptors
Description Schemes
185
Figure 2. Overview of MPEG-7 multimedia description schemes (Martnez, 2003)

Content organization Collection & Classification Models User interaction
Creation & production
Navigation & Access Summary
User Preferences
Media
Content management Content description Structural aspects Conceptual aspects
Usage Views User History
Variations
Basic Elements Schema tools Basic datatypes Link & media localization Basic Tools
Conceptual information of the reality captured by the content (example: objects and events, interactions among objects). Information about how to browse the content in an efficient way (example: summaries, variations, spatial and frequency subbands). Information about collections of objects. Information about the interaction of the user with the content (user preferences, usage history).
MPEG-7 Multimedia Description Schemes (MDS) are metadata structures for describing and annotating audiovisual content at several levels of granularity and abstraction (to describe what is in the content) and metadata (a description about the content). These Multimedia Descriptions Schemes are described using XML to support readability at the human level and processing capability at the machine level. MPEG-7 Multimedia DSs are categorised and organised into the following groups: Basic Elements, Content Description, Content Management, Content Organization, Navigation and Access, and User Interaction. Figure 2 shows the different categories and presents a big picture view of Multimedia DSs.
Basic Elements
Basic elements provide the fundamental constructs in defining MPEG-7 DSs. This includes basic data types and a set of extended data types such as vectors, matrices to describe the features and structural aspects of the content. The basic elements also include constructs for linking media files, localising specific segments, describing time and temporal information, place, individual(s), groups, organizations, and other textual annotations.
Content Description
MPEG-7 DSs for content description are organised into two categories: DSs for describing structural aspects, and DSs for describing conceptual aspects of the content. The structural DSs describe audiovisual content at a structural level organised around a segment. The Segment DS represents the spatial, temporal or spatiotemporal structure of an audiovisual segment. The Segment DS can be organised into a hierarchical structure to produce a table of contents for indexing and searching audiovisual content in a structured way. The segments can be described at different perceptual levels using Descriptors for colour, texture, shape, motion, and so on. The conceptual aspects are described using semantic DS, to describe objects, events and abstract concepts. The structure DSs and semantic DSs are related by a set of links that relate different semantic concepts to content structure. The links relate semantic concepts to instances within the content described by the segments. Many of the content description DSs are linked to Ds which are, in turn, linked to DSs in a content management group.
Content Management
MPEG-7 DSs for content management include tools to describe information pertaining to creation and production, media coding, storage and file formats, and content usage. Creation information provides information related to the creators of the content, creation locations, dates, other related material, and so forth. These could be textual annotations or other multimedia content such as an image of a logo. This also includes information related to classification of the content from a viewers point of view. Media information describes information including location, storage and delivery formats, compression and coding schemes, and version history based on media profiles. Usage information describes information related to usage rights, usage record, and related financial information. While rights management is not handled explicitly, the Rights DS provides references in the form of unique identifiers to external rights owners and regulatory authorities.
Navigation and Access

The DSs under this category facilitate browsing and retrieval of audiovisual content. There are DSs that facilitate browsing in different ways based on summaries, partitions and decompositions and other variations. The Summary DSs support hierarchical and sequential navigation modes. Hierarchical summaries can be described at different levels of granularity, moving from coarse high-level descriptions to more detailed summaries of audiovisual content. Sequential summaries provide a sequence of images and frames synchronised with audio, and facilitate a slide show style of browsing and navigation.
Content Organisation
The DSs under this category facilitate organising and modeling collections of audiovisual content descriptions. The Collection DS helps to describe collections at the level of objects, segments, and events, based on common properties of the elements in the collection.
187
User Interaction
This set of DSs describes user and usage preferences, usage history to facilitate personalization of content access, presentation and consumption. For more details of the full list of DSs, the reader is referred to the MPEG-7 URL at http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm and Manjunath et al. (2002).
REPRESENTATION OF MULTIMEDIA SEMANTICS

In the previous section we described the MPEG-7 constructs and the method of organising the MDS from a functional perspective, as presented in various official MPEG7 documents. In this section we look at the Ds and DSs from the perspective of addressing multimedia semantics and its management. We look at the levels of granularity and abstraction that MPEG-7 Ds and DSs are able to support. The structural aspects of content description are meant to describe content at different levels of granularity ranging from visual descriptors to temporal segments. The semantic DSs are developed for the purpose of describing content at several abstract levels in a free text, but in a structured form. MPEG-7 deals with content semantics by considering narrative worlds. Since MPEG-7 targets description of multimedia content, which is mostly narrative in nature, it is reasonable for it to view the participants, background, context, and all the other constituents of a narrative as the narrative world. Each narrative world can exist as a distinct semantic description. The components of the semantic descriptions broadly consist of entities that inhabit the narrative worlds, their attributes, and their relationships with each other.
Levels of Granularity
Let us consider a video of a play that consists of four acts. Then we can segment the video temporally into four parts corresponding to the acts. Each act can be further segmented into scenes. Each scene can be segmented into shots while a shot is defined as a temporally continuous segment of video captured by a single camera. The shots can in turn be segmented into frames. Finally, each frame can be segmented into spatial regions. Note that each level of the hierarchy lends itself to meaningful semantic description. Each level of granularity lends itself to distinctive Ds. For instance, we could use the texture descriptor to describe the texture of spatial regions. Such a description is clearly confined to the lowest level of the hierarchy we just described. The 2-D shape descriptors are similarly confined by definition. Each frame can also be described using the scalable color descriptor, which is essentially a color histogram. A shot consisting of several frames, however, has to be described using the group of frames color descriptor, which aggregates the histograms of all the constituent shots using, for instance, the median. Note that while it is possible to extend the color description to a video segment of any length of time, it is most meaningful at the shot level and below. The MotionActivity descriptor can be used to meaningfully describe any length of video,
since it merely captures the pace or action in the video. Thus, a talking head segment would be described as low action while a car chase scene would be described as high action. A one-hour movie that mostly consists of car chases could reasonably be described as high action. The motion trajectory descriptor, on the other hand, is meaningful only at the shot level and meaningless at any lower or higher level. In other words, each level of granularity has its own set of appropriate descriptors that may or may not be appropriate at all levels of the hierarchy. The aim of such description is to enable content retrieval at any desired level of granularity.
Levels of Abstraction
Note that in the previous section, the hierarchy stemmed from the temporal and spatial segmentation, but not from any conceptual point of view. Therefore such a description does not let us browse the content at varying levels of semantic abstraction that may exist at a given constant level of temporal granularity. For instance, we may be only interested in dramatic dialogues between character A and character B in one case, and in any interactions between character A and character B in another. Note that the former is an instance of the latter and therefore is at a lower level of abstraction. In the absence of multilayered abstraction, our content browsing would have to be either excessively general through restriction to the highest level of abstraction, or excessively particular through restriction to the lowest level of abstraction. Note that to a human being, the definition of too general and too specific depends completely on the need of the moment, and therefore is subject to wide variation. Any useful representation of the content semantics has to therefore be at as many levels of abstraction as possible. Returning to the example of interactions between the characters A and B, we can see that the semantics consists of the entities A and B, with their names being their attributes and whose relationship with each other consists of the various interactions they have with each other. MPEG-7 considers two types of abstraction. The first is media abstraction, that is, a description that can describe more than one instance of similar content. We can see that the description all interactions between characters A and B, is an example of media abstraction since it describes all instances of media in which A and B interact. The second type of abstraction is formal abstraction, in which the pattern common to a set of multimedia examples contains placeholders. The description interaction between any two of the characters in the play is an example of such formal abstraction. Since the definition of similarity depends on the level of detail of the description and the application, we can see that these two forms of abstraction allow us to accommodate a wide range of abstraction from the highly abstract to the highly concrete and detailed. Furthermore, MPEG-7 also provides ways to describe abstract quantities such as properties, through the Property element, and concepts, through the Concept DS. Such quantities do not result from an abstraction of an entity, and so are treated separately. For instance, the beauty of a painting is a property and is not the result of somehow generalizing its constituents. Concepts are defined as collections of properties that define a category of entities but do not completely characterize it. Semantic entities in MPEG-7 mostly consist of narrative worlds, objects, events, concepts, states, places and times. The objects and events are represented by the Object and Event DSs respectively. The Object DS and Event DS provide abstraction through
189
a recursive definition that allows, for example, subcategorization of objects into subobjects. In that way, an object can be represented at multiple levels of abstraction. For instance, a continent could be broken down into continent-country-state-district, and so forth, so that it can be described at varying levels of semantic granularity. Note that the Object DS accommodates attributes so as to allow for the abstraction we mentioned earlier, that is, abstraction that is related to properties rather than generalization of constituents such as districts. The hospitable nature of the continents inhabitants for instance cannot result from abstraction of districts to states to countries, and so forth. Semantic entities can be described by labels, by a textual definition, or in terms of properties or of features of the media or segments in which they occur. The SemanticBase DS contains such descriptive elements. The AbstractionLevel data type in the SemanticBase DS describes the kind of abstraction that has been performed in the description of the entity. If it is not present, then the description is considered concrete. If the abstraction is a media abstraction, then the dimension of the AbstractionLevel element is set to zero. If a formal abstraction is present, the dimension of the element is set to 1 or higher. The higher the value is, the higher is the abstraction. Thus, a value of 2 would indicate an abstraction of an abstraction. The Relation DS rounds off the collection of representation tools for content semantics. Relations capture how semantic entities are connected with each other. Thus, examples of a relation is doctor-patient, student-teacher, and so forth. Note that since each of the entities in the relation lends itself to multiple levels of abstraction and the relations in turn have properties, there is further abstraction that results from relations.
APPLICATIONS
As we cover MPEG-7 semantic descriptions, we realize that they provide a rich framework for static description of content semantics. Such a framework has the inherent problem of providing an embarrassment of riches, which makes the management of the browsing very difficult. Since MPEG-7 content semantics is very graph oriented, it is clear that it does not scale well as the number of concepts/events/objects goes up. Creation of a deep hierarchy through very fine semantic subdivision of the objects would result in the same problem of computational intractability. As the content semantic representation is pushed more and more towards a natural language representation, evidence from natural language processing research indicates that the computational intractability will be exacerbated. In our view, therefore, the practical utility of such representation is restricted to cases in which either the concept hierarchies are not unmanageably broad, or the concept hierarchies are not unmanageably deep, or both. Our view is that in interactive systems, the human uses will compensate for the shallowness or narrowness of the concept hierarchies through their domain knowledge. Since humans are known to be quick at sophisticated processing of data sets of small size, the semantic descriptions should be at a broad scale to help narrow down the search space. Thereafter, the human can compensate for the absence of detailed semantics through use of low-level feature-based video browsing techniques such as video summarization. Therefore, MPEG-7 semantic representations would be best used in
applications in which a modest hierarchy can help narrow down the search space considerably. Let us consider some candidate applications.
Educational Applications
At first glance, since education is, after all, intended to be systematic acquisition of knowledge, a semantics-based description of all the content seems reasonable. Our experience indicates that restriction of the description to a narrow topic allows for a rich description within the topic of research and makes for a successful learning experience for the student. Any application in which the intention is to learn abstract concepts, an overly shallow concept hierarchy will be a hindrance. Hence, our preference for narrowing the topic itself to limit the breadth of the representation so as to buy some space for a deeper representation. The so called edutainment systems fall in the same general category with varying degrees of compromise between the richness of the descriptions and the size of the database. Such applications include tourist information, cultural services, shopping, social, film and radio archives, and so forth.
Information Retrieval Applications

Applications that require retrieval from an archive based on a specific query rather than a top-down immersion in the content, typically consist of very large databases in which even a small increase in the breadth and depth of the representation would lead to an unacceptable increase in computation. Such applications include journalism, investigation services, professional film and radio archives, surveillance, remote sensing, and so forth. Furthermore, in such applications, the accuracy requirements are much more stringent. Our view is that only a modest MPEG-7 content semantics representation would be feasible for such applications. However, even a modest semantic representation would be a vast improvement over current retrieval.
Generation of MPEG-7 Semantic Meta-Data

It is also important to consider how the descriptions would be generated in the first place. Given the state of the art, the semantic metadata would have to be manually generated. That is yet another challenge posed by large-scale systems. Once again, the same strategy of either tackling modest databases, or creating modest representations or a combination of both would be reasonable. Once again, if the generation of the metadata is integrated with its consumption in an interactive application, the user could enhance the metadata over time. This is perhaps a challenge for future researchers.
DISCUSSION AND CONCLUSION

Managing multimedia content has evolved around textual descriptions and/or processing audiovisual information and indexing content using features that can be automatically extracted. The question is how do we retrieve the content in a meaningful way. How can we correlate users semantics with archivists semantics? Even though MPEG-7 DSs provide a framework to support such descriptions, MPEG-7 is still a standard for describing features of multimedia content. Although there are DSs to describe the metadata related to the content, there is still a gap in describing semantics
191
that evolves with interaction and users context. There is a static aspect to the descriptions, which limits adaptive flexibility needed for different types of applications. Nevertheless, a standard way to describe the relatively unambiguous aspects of content does provide a starting point for many applications where the focus is content management. The generic nature of MPEG-7 descriptions can be both a strength and a weakness. The comprehensive library of DSs are aimed to support a large number of applications, and there are several tools to support the development of descriptions required for a particular application. However, this requires a deep knowledge of MPEG-7, and the large scope becomes a weakness, as it becomes impossible to pick and choose from a huge library without understanding the implications of the choices made. As discussed in section 4, often a modest set of content descriptions, DSs and elements may suffice for a given application. This requires an application developer to first develop the descriptions in the context of the application domain, determine the DSs to support the descriptions, and then identify the required elements in the DSs. This is an involved process and cannot be viewed in isolation of the domain and application context. As MPEG-7 compliant applications start to be developed, it is possible that there could be context-dependent elements and DSs that are essential to the application, but not described in the standard, because the application context cannot be predetermined during the definition stage. In conclusion, these are still early days for MPEG-7 and their deployment in managing the semantic aspects of multimedia applications. As the saying goes, the proof of the pudding lies in the eating, and the success of the applications will determine the success of the standard.
Manjunath, B.S., Salembier, P., & Sikora, T. (2002). Introduction to MPEG-: Multimedia content description interface. New York: John Wiley & Sons. Martnez, J.M. (2003, March). MPEG-7 overview (version 9). ISO/IEC JTC1/SC29/ WG11N5525. Vetro, A. (2001, January). MPEG-7 applications document version10. ISO/IEC JTC1/ SC29/WG11/N3934.
REFERENCES
Section 3 User-Centric Approach to Manage Semantics
Interactive Browsing of Personal Photo Libraries
193
Chapter 9
Visualization, Estimation and User Modeling for Interactive Browsing of Personal Photo Libraries
Qi Tian, University of Texas at San Antonio, USA Baback Moghaddam, Mitsubishi Electric Research Laboratories, USA Neal Lesh, Mitsubishi Electric Research Laboratories, USA Chia Shen, Mitsubishi Electric Research Laboratories, USA Thomas S. Huang, University of Illinois, USA
Recent advances in technology have made it possible to easily amass large collections of digital media. These media offer new opportunities and place great demands for new digital content user-interface and management systems which can help people construct, organize, navigate, and share digital collections in an interactive, face-to-face social setting. In this chapter, we have developed a user-centric algorithm for visualization and layout for content-based image retrieval (CBIR) in large photo libraries. Optimized layouts reflect mutual similarities as displayed on a two-dimensional (2D) screen, hence providing a perceptually intuitive visualization as compared to traditional sequential one-dimensional (1D) content-based image retrieval systems. A framework
ABSTRACT
194 Tian, Moghaddam, Lesh, Shen & Huang
for user modeling also allows our system to learn and adapt to a users preferences. The resulting retrieval, browsing and visualization can adapt to the users (time-varying) notions of content, context and preferences in style and interactive navigation.
INTRODUCTION
Personal Digital Historian (PDH) Project
Recent advances in digital media technology offer opportunities for new storysharing experiences beyond the conventional digital photo album (Balabanovic et al., 2000; Dietz & Leigh, 2001). The Personal Digital Historian (PDH) project is an ongoing effort to help people construct, organize, navigate and share digital collections in an interactive multiperson conversational setting (Shen et al., 2001; Shen et al., 2003). The research in PDH is guided by the following principles: 1. The display device should enable natural face-to-face conversation: not forcing everyone to face in the same direction (desktop) or at their own separate displays (hand-held devices). The physical sharing device must be convenient and customary to use: helping to make the computer disappear. Easy and fun to use across generations of users: minimizing time spent typing or formulating queries. Enabling interactive and exploratory storytelling: blending authoring and presentation.
2. 3. 4.
Current software and hardware do not meet our requirements. Most existing software in this area provides users with either powerful query methods or authoring tools. In the former case, the users can repeatedly query their collections of digital content to retrieve information to show someone (Kang & Shneiderman, 2000). In the latter case, a user experienced in the use of the authoring tool can carefully craft a story out of his or her digital content to show or send to someone at a later time. Furthermore, current hardware is also lacking. Desktop computers are not suitably designed for group, face-to-face conversation in a social setting, and handheld story-telling devices have limited screen sizes and can be used only by a small number of people at once. The objective of the PDH project is to take a step beyond. The goal of PDH is to provide a new digital content user-interface and management system enabling face-to-face casual exploration and visualization of digital contents. Unlike conventional desktop user interface, PDH is intended for multiuser collaborative applications on single display groupware. PDH enables casual and exploratory retrieval, and interaction with and visualization of digital contents. We design our system to work on a touch-sensitive, circular tabletop display (Vernier et al., 2002), as shown in Figure 1. The physical PDH table that we use is a standard tabletop with a top projection (either ceiling mounted or tripod mounted) that displays on a standard whiteboard as shown in the right image of Figure 1. We use two Mimio (www.mimio.com/meet/mimiomouse) styluses as the input devices for the first set
195
Figure 1. PDH table (a) an artistic rendering of the PDH table (designed by Ryan Bardsley, Tixel HCI www.tixel.net) and (b) the physical PDH table
(a)
(b)
of user experiments. The layout of the entire tabletop display consists of (1) a large storyspace area encompassing most of the tabletop until the perimeter, and (2) one or more narrow arched control panels (Shen et al., 2001). Currently, the present PDH table is implemented using our DiamondSpin (www.merl.com/projects/diamondspin) circular table Java toolkit. DiamondSpin is intended for multiuser collaborative applications (Shen et al., 2001; Shen et al., 2003; Vernier et al., 2002). The conceptual model of PDH is to focus on developing content organization and retrieval metaphors that can be easily comprehended by users without distracting from the conversation. We adopt a model of organizing the materials using the four questions essential to storytelling: who, when, where, and what (the four Ws). We do not currently
Figure 2. An example of navigation by the four-Ws model (Who, When, Where, What)
support why, which is also useful for storytelling. Control panels located on the perimeter of the table contain buttons labeled people, calendar, location, and events, corresponding to these four questions. When a user presses the location button, for example, the display on the table changes to show a map of the world. Every picture in the database that is annotated with a location will appear as a tiny thumbnail at its location. The user can pan and zoom in on the map to a region of interest, which increases the size of the thumbnails. Similarly, by pressing one of the other three buttons, the user can cause the pictures to be organized by the time they were taken along a linear timeline, the people they contain, or the event keywords with which the pictures were annotated. We assume the pictures are partially annotated. Figure 2 shows an example of navigation of a personal photo album by the four-Ws model. Adopting this model allows users to think of their documents in terms of how they would like to record them as part of their history collection, not necessarily in a specific hierarchical structure. The user can make selections among the four Ws and PDH will automatically combine them to form rich Boolean queries implicitly for the user (Shen et al., 2001; Shen, Lesh, Vernier, Forlines, & Frost, 2002; Shen et al., 2003; Vernier et al., 2002). The PDH project combines and extends research in largely two areas: (i) humancomputer interaction (HCI) and interface (the design of the shared-display devices, user interface for storytelling and online authoring, and storylistening) (Shen et al., 2001, 2002, 2003; Vernier et al., 2002); (ii) content-based information visualization, presentation and retrieval (user-guided image layout, data mining and summarization) (Moghaddam et al.,
197
2001, 2002, 2004; Tian et al., 2001, 2002). Our work has been done along these two lines. The work by Shen et al. (2001, 2002, 2003) and Vernier et al. (2002) focused on the HCI and interface design issue of the first research area. The work in this chapter is under the context of PDH but focuses on the visualization, smart layout, user modeling and retrieval part. In this chapter, we propose a novel visualization and layout algorithm that can enhance informal storytelling using personal digital data such as photos, audio and video in a face-to-face social setting. A framework for user modeling also allows our system to learn and adapt to a users preferences. The resulting retrieval, browsing and visualization can adapt to the users (time-varying) notions of content, context and preferences in style and interactive navigation.
Related Work
In content-based image retrieval (CBIR), most current techniques are restricted to matching image appearance using primitive features such as color, texture, and shape. Most users wish to retrieve images by semantic content (the objects/events depicted) rather than by appearance. The resultant semantic gap between user expectations and the current technology is the prime cause of the poor takeup of CBIR technology. Due to the semantic gap (Smeulders et al., 2000), visualization becomes very important for user to navigate the complex query space. New visualization tools are required to allow for user-dependent and goal-dependent choices about what to display and how to provide feedback. The query result has an inherent display dimension that is often ignored. Most methods display images in a 1D list in order of decreasing similarity to the query images. Enhancing the visualization of the query results is, however, a valuable tool in helping the user navigate query space. Recently, Horoike and Musha (2000), Nakazato and Huang (2001), Santini and Jain (2000), Santini et al. (2001), and Rubner (1999) have also explored toward content-based visualization. A common observation in these works is that the images are displayed in 2D or 3D space from the projection of the high-dimensional feature spaces. Images are placed in such a way that distances between images in 2D or 3D reflect their distances in the high-dimensional feature space. In the works of Horoike and Musha (2000) and Nakazato and Huang (2001), the users can view large sets of images in 2D or 3D space and user navigation is allowed. In the works of Nakazato and Huang (2001), Santini et al. (2000, 2001), the system allows user interaction on image location and forming new groups. In the work of Santini et al. (2000, 2001), users can manipulate the projected distances between images and learn from such a display. Our work (e.g., Tian et al., 2001, 2002; Moghaddam et al., 2001, 2002, 2004) under the context of PDH shares many common features with the related work (Horoike & Musha, 2000; Nakazato & Huang, 2001; Santini et al., 2000, 2001; Rubner, 1999). However, a learning mechanism from the display is not implemented in Horoike and Musha (2000), and 3D MARS (Nakazato & Huang, 2001) is an extension to our work (Tian et al., 2001; Moghaddam et al. 2001) from 2D to 3D space. Our system differs from the work ofRubner (1999) in that we adopted different mapping methods. Our work shares some features with the work by Santini and Jain (2000) and Santini et al. (2001) except that our PDH system is currently being incorporated into a much broader system for computer human-guided navigating, browsing, archiving, and interactive storytelling with large photo libraries. The part of this system described in the remainder of this chapter is, however, specifically geared towards adaptive user modeling and relevance estimation and based primarily on
visual features as opposed to semantic annotation as in Santini and Jain (2000) and Santini et al. (2001). The rest of the chapter is organized as follows. In Content-Based Visualization, we present designs for uncluttered visualization and layout of images (or iconic data in general) in a 2D display space for content-based image retrieval (Tian et al., 2001; Moghaddam et al., 2001). In Context and User Modeling, we further provide a mathematical framework for user modeling, which adapts and mimics the users (possibly changing) preferences and style for interaction, visualization and navigation (Moghaddam et al., 2002, 2004; Tian et al., 2002). Monte Carlo simulations in the Statistical Analysis section plus the next section on User Preference Study have demonstrated the ability of our framework to model or mimic users, by automatically generating layouts according to users preference. Finally, Discussion and Future Work are given in the final section.
CONTENT-BASED VISUALIZATION
With the advances in technology to capture, generate, transmit and store large amounts of digital imagery and video, research in content-based image retrieval (CBIR) has gained increasing attention. In CBIR, images are indexed by their visual contents such as color, texture, and so forth. Many research efforts have addressed how to extract these low-level features (Stricker & Orengo, 1995; Smith & Chang, 1994; Zhou et al., 1999), evaluate distance metrics (Santini & Jain, 1999; Popescu & Gader, 1998) for similarity measures and look for efficient searching schemes (Squire et al. 1999; Swets & Weng, 1999). In this section, we present a user-centric algorithm for visualization and layout for content-based image retrieval. Image features (visual and/or semantic) are used to display retrievals as thumbnails in a 2D spatial layout or configuration which conveys pair-wise mutual similarities. A graphical optimization technique is used to provide maximally uncluttered and informative layouts. We should note that one physical instantiation of the PDH table is that of a roundtable, for which we have in fact experimented with polar coordinate conformal mappings for converting traditional rectangular display screens. However, in the remainder of this chapter, for purposes of ease of illustration and clarity, all layouts and visualizations are shown on rectangular displays only.
Traditional Interfaces
The purpose of automatic content-based visualization is augmenting the users understanding of large information spaces that cannot be perceived by traditional sequential display (e.g., by rank order of visual similarities). The standard and commercially prevalent image management and browsing tools currently available primarily use tiled sequential displays that is, essentially a simple 1D similarity based visualization. However, the user quite often can benefit by having a global view of a working subset of retrieved images in a way that reflects the relations between all pairs of images that is, N2 measurements as opposed to only N. Moreover, even a narrow view of ones immediate surroundings defines context and can offer an indication on how to explore the dataset. The wider this visible horizon, the more efficient the new query will be
199
formed. Rubner (1999) proposed a 2D display technique based on multidimensional scaling (MDS) (Torgeson, 1998). A global 2D view of the images is achieved that reflects the mutual similarities among the retrieved images. MDS is a nonlinear transformation that minimizes the stress between high-dimensional feature space and low-dimensional display space. However, MDS is rotation invariant, nonrepeatable (nonunique), and often slow to implement. Most critically, MDS (as well as some of the other leading nonlinear dimensionality reduction methods) provide high-to-low-dimensional projection operators that are not analytic or functional in form, but are rather defined on a pointby-point basis for each given dataset. This makes it very difficult to project a new dataset in a functionally consistent way (without having to build a post-hoc projection or interpolation function for the forward mapping each time). We feel that these drawbacks make MDS (and other nonlinear methods) an unattractive option for real-time browsing and visualization of high-dimensional data such as images.
Improved Layout and Visualization

We propose an alternative 2D display scheme based on Principal Component Analysis (PCA) (Jolliffe, 1996). Moreover, a novel window display optimization technique is proposed which provides a more perceptually intuitive, visually uncluttered and informative visualization of the retrieved images. Traditional image retrieval systems display the returned images as a list, sorted by decreasing similarity to the query. The traditional display has one major drawback. The images are ranked by similarity to the query, and relevant images (as for example used in a relevance feedback scenario) can appear at separate and distant locations in the list. We propose an alternative technique to MDS (Torgeson, 1998) that displays mutual similarities on a 2D screen based on visual features extracted from images. The retrieved images are displayed not only in ranked order of similarity from the query but also according to their mutual similarities, so that similar images are grouped together rather than being scattered along the entire returned 1D list.
Visual Features
We will first describe the low-level visual feature extraction used in our system. There are three visual features used in our system: color moments (Stricker & Orengo, 1995), wavelet-based texture (Smith & Chang, 1994), and water-filling edge-based structure feature (Zhou et al., 1999). The color space we use is HSV because of its decorrelated coordinates and its perceptual uniformity (Stricker & Orengo, 1995). We extract the first three moments (mean, standard deviation and skewness) from the three-color channels and therefore have a color feature vector of length 33 = 9. For wavelet-based texture, the original image is fed into a wavelet filter bank and is decomposed into 10 decorrelated subbands. Each subband captures the characteristics of a certain scale and orientation of the original image. For each subband, we extract the standard deviation of the wavelet coefficients and therefore have a texture feature vector of length 10. For water-filling edge-based structure feature vector, we first pass the original images through an edge detector to generate their corresponding edge map. We extract eighteen (18) elements from the edge maps, including max fill time, max fork count, and
so forth. For a complete description of this edge feature vector, interested readers are referred to Zhou et al. (1999).
Dimension Reduction and PCA Splats

To create such a 2D layout, Principal Component Analysis (PCA) (Jolliffe, 1996) is first performed on the retrieved images to project the images from the high-dimensional feature space to the 2D screen. Image thumbnails are placed on the screen so that the screen distances reflect as closely as possible the similarities between the images. If the computed similarities from the high-dimensional feature space agree with our perception, and if the resulting feature dimension reduction preserves these similarities reasonably well, then the resulting spatial display should be informative and useful. In our experiments, the 37 visual features (nine color moments, 10 wavelet moments and 18 water-filling features) are preextracted from the image database and stored off-line. Any 37-dimensional feature vector for an image, when taken in context with other images, can be projected onto the 2D {x, y} screen based on the first two principal components normalized by the respective eigenvalues. Such a layout is denoted as a PCA Splat. We implemented both linear and nonlinear projection methods using PCA and Kruskals algorithm (Torgeson, 1998). The projection using the nonlinear method such as the Kruskals algorithm is an iterative procedure, slow to converge and converged to the local minima. Therefore the convergence largely depends on the initial starting point and cannot be repeated. On the contrary, PCA has several advantages over nonlinear methods like MDS. It is a fast, efficient and unique linear transformation that achieves the maximum distance preservation from the original high-dimensional feature space to 2D space among all possible linear transformations (Jolliffe, 1996). The fact that it fails to model nonlinear mappings (which MDS succeeds at) is in our opinion a minor compromise given the advantages of real-time, repeatable and mathematically tractable linear projections. We should add that nonlinear dimensionality reduction (NLDR) by itself is a very large area of research and mostly beyond the scope of this chapter. We only comment on MDS because of its previous use by Rubner (1999) for CBIR. Use of other iterative NLDR techniques as principal curves or bottleneck auto-associative feedforward networks is usually prohibited by the need to perform real-time and repeatable projections. More recent advances such as IsoMap (Tennenbaum et al., 2000) and Local Linear Embedding (LLE) (Roweis & Saul, 2000) are also not amendable to real-time or closedform computation. The most recent techniques such as Laplacian Eigenmaps (Belkin & Niyogi, 2003) and charting (Brand, 2003) have only just begun to be used and may promise advances useful in this application domain, although we should hasten to add that the formulation of subspace weights and their estimation (see section on Context and User Modeling) is not as straightforward as with the case of linear dimension reduction (LDR) methods like PCA. Let us consider a scenario of a typical image-retrieval engine at work in which an actual user is providing relevance feedback for the purposes of query refinement. Figure 3 shows an example of the retrieved images by the system (which resembles most traditional browsers in its 1D tile-based layout). The database is a collection of 534 images. The first image (building) is the query. The other nine relevant images are ranked in second, third, fourth, fifth, ninth, 10th, 17th, 19th and 20 th places, respectively.
201
Figure 3. Top 20 retrieved images (ranked top to bottom and left to right; query is shown first in the list)
Figure 4. PCA Splat of top 20 retrieved images in Figure 3
Figure 4 shows an example of a PCA Splat for the top 20 retrieved images shown in Figure 3. In addition to visualization by layout, in this particular example, the sizes (alternatively contrast) of the images are determined by their visual similarity to the query. The higher the rank, the larger is the size (or higher the contrast). There is also a number next to each image in Figure 4 indicating its corresponding rank in Figure 4. The
view of query image, that is, the top left one in Figure 3, is blocked by the images ranked 19 th, fourth, and 17th in Figure 4. A better view is achieved in Figure 7 after display optimization. Clearly the relevant images are now better clustered in this new layout as opposed to being dispersed along the tiled 1D display in Figure 3. Additionally, PCA Splats convey N2 mutual distance measures relating all pair-wise similarities between images, whereas the ranked 1D display in Figure 3 provides only N.
Display Optimization
However, one drawback of PCA Splat is that some images can be partially or totally overlapped, which makes it difficult to view all the images at the same time. The overlap will be even worse when the number of retrieved images becomes larger, for example, larger than 50. To solve the overlapping problem between the retrieved images, a novel optimized technique is proposed in this section. Given a set the retrieved images and their corresponding sizes and positions, our optimizer tries to find a solution that places the images at the appropriate positions while deviating as little as possible from their initial PCA Splat positions. Assume the number of images is N. The image positions are represented by their center coordinates (xi, yi), i = 1, ..., N, and the initial image positions are denoted as (xoi, yoi), i = 1, ..., N. The minimum and maximum coordinates of the 2D screen are [xmin, x max, ymin, ymax]. The image size is represented by its radius ri for simplicity, i = 1, ..., N and the maximum and minimum image size is r max and rmin in radius, respectively. The initial image size is r oi, i = 1, ..., N. To minimize the overlap, the images can be automatically moved away from each other to decrease the overlap between images, but this will increase the deviation of the images from their initial positions. Large deviation is certainly undesirable because the initial positions provide important information about mutual similarities between images. So there is a trade-off problem between minimizing overlap and minimizing deviation. Without increasing the overall deviation, an alternative way to minimize the overlap is to simply shrink the image size as needed, down to a minimum size limit. The image size will not be increased in the optimization process because this will always increase the overlap. For this reason, the initial image size r oi is assumed to be rmax. The total cost function is designed as a linear combination of the individual cost functions taking into account two factors. The first factor is to keep the overall overlap between the images on the screen as small as possible. The second factor is to keep the overall deviation from the initial position as small as possible.
J = F ( p) + S G ( p )
(1)
where F(p) is the cost function of the overall overlap and G(p) is the cost function of the overall deviation from the initial image positions, S is a scaling factor which brings the range of G(p) to the same range of F(p), and S is chosen to be (N1)/2. is a weight and 0. When is zero, the deviation of images is not considered in overlapping minimization. When is less than one, minimizing overall overlap is more important than minimizing overall deviation, and vice versa for is greater than one.
203
Figure 5. Cost function of overlap function f(p)
The cost function of overall overlap is designed as
F( p) = f ( p)
i =1 j =i +1
u 1 e f u > 0 f ( p) = 0 u0
2
(2)
(3)
where , u = ri + rj ( xi x j ) 2 + ( yi y j ) 2 is a measure of overlapping. When u 0 , there is no overlap between the i th image and the jth image, thus the cost is 0. When u > 0, there is partial overlap between the i th image and the j th image. When u = 2 rmax, the ith image and the j th image are totally overlapped. f is a curvature-controlling factor. Figure 5 shows the plot of f(p). With the increasing value of u(u > 0), the cost of overlap is also increasing. From Figure 5, in Equation (3) is calculated by setting T=0.95 when u = rmax.
f = ln(1u T ) |u =rmax
2
(4)
The cost function of overall deviation is designed as
G ( p) = g ( p)
i =1
(5)
Figure 6. Cost function of function g(p)
g ( p) = 1 e
2 g
(6)
where v = ( xi xio )2 + ( yi yio )2 , v is the measure of deviation of the i th image from its initial position. g is a curvature-controlling factor. (xi,yi) and (xoi,yoi) are the optimized and initial center coordinates of the ith image, respectively, i = 1, ..., N. Figure 6 shows the plot of g(p). With the increasing value of v, the cost of deviation is also increasing. From Figure 6, g in Equation (6) is calculated by setting T=0.95 when v = maxsep. In our work, maxsep is set to be 2r max.
g = ln(1vT ) |v=maxsep
2
(7)
The optimization process is to minimize the total cost J by finding a (locally) optimal set of size and image positions. The nonlinear optimization method was implemented by an iterative gradient descent method (with line search). Once converged, the images will be redisplayed based on the new optimized sizes and positions. Figure 7 shows the optimized PCA Splats for Figure 3. The image with a yellow frame is the query image in Figure 3. Clearly, the overlap is minimized while the relevant images are still close to each other to allow a global view. With such a display, the user can see the relations between the images, better understand how the query performed, and subsequently formulate future queries more naturally. Additionally, attributes such as contrast and brightness can be used to convey rank. We note that this additional visual aid is essentially a third dimension of information display. For example, images with higher rank could be displayed with larger size or increased brightness to make them stand out from the rest of the layout. An interesting example is to display time or
205
Figure 7. Optimized PCA Splat of Figure 3
timeliness by associating the size or brightness with how long ago the picture was taken, thus images from the past would appear smaller or dimmer than those taken recently. A full discussion of the resulting enhanced layouts is deferred to future work. Also we should point out that despite our ability to clean-up layouts for maximal visibility with the optimizer we have designed, all subsequent figures in this chapter show Splats without any overlap minimization, because, for illustrating (as well as comparing) the accuracy of the estimation results in subsequent sections, the absolute position was necessary and important.
CONTEXT AND USER MODELING

Image content and meaning is ultimately based on semantics. The users notion of content is a high-level concept, which is quite often removed by many layers of abstraction from simple low-level visual features. Even near-exhaustive semantic (keyword) annotations can never fully capture context-dependent notions of content. The same image can mean a number of different things depending on the particular circumstance. The visualization and browsing operation should be aware of which features (visual and/or semantic) are relevant to the users current focus (or working set) and which should be ignored. In the space of all possible features for an image, this problem can be formulated as a subspace identification or feature weighting technique that is described fully in this section.
Estimation of Feature Weights

By user modeling or context awareness we mean that our system must be constantly aware of and adapting to the changing concepts and preferences of the user. A typical example of this human-computer synergy is having the system learn from a usergenerated layout in order to visualize new examples based on identified relevant/ irrelevant features. In other words, design smart browsers that mimic the user, and over time, adapt to their style or preference for browsing and query display. Given information from the layout, for example, positions and mutual distances between images, a novel feature weight estimation scheme, noted as -estimation is proposed, where is a weighting vector for features, for example, color, texture and structure (and semantic keywords). We now describe the subspace estimation of for visual features only, for example, color, texture, and structure, although it should be understood that the features could include visual, audio and semantic features or any hybrid combination thereof. In theory, the estimation of weights can be done for all the visual features if given enough images in the layout. The mathematical formulation of this estimation problem follows. The weighing vector is ={1, 2, ..., L} , where L is the total length of color, texture, and structure feature vector, for example, L = 37 in this chapter. The number of images in the preferred clustering is N, and X is an LN matrix where the i th column is the feature vector of the ith image, i, j = 1, ..., N. The distance, for example Euclidean-based between the ith image and the jth image, for i, j = 1, ..., N, in the preferred clustering (distance in 2D space) is dij. These weights 1, 2, ..., L are constrained such that they always sum to 1. We then define an energy term to minimize with an L p norm (with p = 2). This cost function is defined in Equation (8). It is a nonnegative quantity that indicates how well mutual distances are preserved in going from the original high-dimensional feature space to 2D space. Note that this cost function is similar to MDS stress, but unlike MDS, the minimization is seeking the optimal feature weights . Moreover, the low-dimensional projections in this case are already known. The optimal weighting parameter recovered is then used to weight original feature-vectors before applying a PCA Splat which will result in the desired layout.
J = (dij p k p | Xi ( k ) X j ( k ) | p )2
i =1 j =1 k =1
(8)
The global minimum of this cost function corresponding to the optimal weight parameter , is easily obtained using a constrained (nonnegative) least-squares. To minimize J, take the partial derivative of J relative to l p for l = 1, ..., L and set them to zero, respectively.
J l p
=0
l = 1,L , L
(9)
207
We thus have
| X
k =1 k p i =1 j =1
(l )
X j (l ) | p |X i ( k ) X j ( k ) | p = dij p | Xi (l ) X j (l ) | p l = 1,L , L
i =1 j =1
(10) Define
R(l , k ) = | Xi (l ) X j (l ) | p | Xi ( k ) X j ( k ) | p
i =1 j =1 N
(11)
r (l ) = dij p | Xi (l ) X j (l ) | p
i =1 j =1
(12)
and subsequently simplify Equation (10) to:
k =1
R (l , k ) = r (l ) l = 1,L , L
(13)
Using the following matrix/vector definitions
1 p r (1) R (1,1) R (1, 2) L R (1, L ) p r (2) R (2,1) R (2, 2) L R (2, L) = 2 r= R= M and M M M M M p L r ( L) R( L,1) R( L, 2) L R( L, L)
Equation (13) is simplified to

R = r
(14)
Subsequently is obtained as a constrained ( > 0) linear least-squares solution of the above system. The weighting vector kis then simply determined by the p-th root of where we typically use p=2. We note that there is an alternative approach to estimating the subspace weighting vector in the sense of minimum deviation, which we have called deviation-based -estimation. The cost function in this case is defined as follows:
J = | p ( x, y ) p ( i ) ( x, y ) |
(i ) i =1
(15)
where p(i)(x,y) and p (i ) ( x, y ) are the original and projected 2D locations of the ith image, respectively. This formulation is a more direct approach to estimation since it deals with the final position of the images in the layout. Unfortunately, however, this approach requires the simultaneous estimation of both the weight vectors as well as the projection basis and consequently requires less-accurate iterative re-estimation techniques (as opposed to more robust closed-form solutions possible with Equation (8)). A full derivation of the solution for our deviation-based estimation is shown in Appendix A. Compare two different estimation methods: stress-based and deviation-based. The former is most useful, robust, and identifiable in the control theory sense of the word. The latter uses a somewhat unstable re-estimation framework and does not always give satisfactory results. However, we still provide a detailed description for the sake of completeness. The shortcomings of this latter method are immediately apparent from the solution requirements. This discussion can be found in Appendix A. For the reasons mentioned above, in all the experiments reported in this chapter, we use only the stress-based method of Equation (8) for estimation. We note that in principle it is possible to use a single weight for each dimension of the feature vector. However, this would lead to a poorly determined estimation problem since it is unlikely (and/or undesirable) to have that many sample images from which to estimate all individual weights. Even with plenty of examples (an over-determined system), chances are that the estimated weights would generalize poorly to a new set of images this is the same principle used in a modeling or regression problem where the order of the model or number of free parameters should be less than the number of available observations. Therefore, in order to avoid the problem of over-fitting and the subsequent poor generalization on new data, it is ideal to use fewer weights. In this respect, the less weights (or more subspace groupings) there are, the better the generalization performance. Since the origin of all visual features, that is, 37 features, is basically from three different (independent) visual attributes: color, texture and structure, it seems prudent to use three weights corresponding to these three subspaces. Furthermore, this number is sufficiently small to almost guarantee that we will always have enough images in one layout from which to estimate these three weights. Therefore, in the remaining portion of the chapter, we only estimated a weighting vector = { c , t , s }T , where c is the weight for color feature of length Lc, t is the weight for texture feature of length Lt, s and is the weight for structure feature of length , respectively. These weights c, t, s are constrained such that they always sum to 1, and L = Lc + L t + Ls. Figure 8 shows a simple user layout where three car images are clustered together despite their different colors. The same is performed with three flower images (despite their texture/structure). These two clusters maintain a sizeable separation, thus suggesting two separate concept classes implicit by the users placement. Specifically, in this layout the user is clearly concerned with the distinction between car and flower regardless of color or other possible visual attributes. Applying the -estimation algorithm to Figure 8, the feature weights learned from this layout are c = 0.3729, t = 0.5269 and s = 0.1002. This shows that the most important feature in this case is texture and not color, which is in accord with the concepts of car versus flower as graphically indicated by the user in Figure 8.
209
Figure 8. An example of a user-guided layout
Figure 9. PCA Splat on a larger set of images using (a) estimated weights (b) arbitrary weights
(a)
(b)
Now that we have the learned feature weights (or modeled the user) what can we do with them? Figure 9 shows an example of a typical application: automatic layout of a larger (more complete data set) set of images in the style indicated by the user. Figure 9(a) shows the PCA Splat using the learned feature weight for 18 cars and 19 flowers. It is obvious that the PCA Splat using the estimated weights captures the essence of the configuration layout in Figure 8. Figure 9(b) shows a PCA Splat of the same images but with a randomly generated , denoting an arbitrary but coherent 2D layout, which in this case, favors color (c = 0.7629). This comparison reveals that proper feature weighting is an important factor in generating the user-desired and sensible layouts. We should point out that a random does not generate a random layout, but rather one that is still coherent, displaying consistent groupings or clustering. Here we have used such random layouts as substitutes for alternative (arbitrary) layouts that are nevertheless valid (differing only in the relative contribution of the three features to the final design
Figure 10. (a) An example layout. Computer-generated layout based on (b) reconstruction using learned feature weights, and (c) the control (arbitrary weights)
of the layout). Given the difficulty of obtaining hundreds (let alone thousands) of real user layouts that are needed for more complete statistical tests (such as those in the next section), random layouts are the only conceivable way of simulating a layout by a real user in accordance with familiar visual criteria such as color, texture or structure. Figure 10(a) shows an example of another layout. Figure 10(b) shows the corresponding computer-generated layout of the same images with their high-dimensional feature vectors weighted by the estimated , which is recovered solely from the 2D configuration of Figure 10(a). In this instance the reconstruction of the layout is near perfect, thus demonstrating that our high-dimensional subspace feature weights can in fact be recovered from pure 2D information. For comparison, Figure 10(c) shows the PCA Splat of the same images with their high-dimensional feature vectors weighted by a random . Figure 11 shows another example of user-guided layout. Assume that the user is describing her family story to a friend. In order not to disrupt the conversational flow, she only lays out a few photos from her personal photo collections and expects the computer to generate a similar and consistent layout for a larger set of images from the same collection. Figure 11(b) shows the computer-generated layout based on the learned feature weights from the configuration of Figure 11(a). The computer-generated layout is achieved using the -estimation scheme and postlinear, for example, affine transform or nonlinear transformations. Only the 37 visual features (nine color moments (Stricker & Orengo 1995), 10 wavelet moments (Smith & Chang, 1994) and 18 water-filling features (Zhou et al., 1999)) were used for this PCA Splat. Clearly the computer-generated layout
211
Figure 11: Usermodeling for automatic layout. (a) a user-guided layout: (b) computer layout for larger set of photos (four classes and two photos from each class)
(a)
(b)
is similar to the user layout with the visually similar images positioned at the userindicated locations. We should add that in this example no semantic features (keywords) were used, but it is clear that their addition would only enhance such a layout.
STATISTICAL ANALYSIS
Given the lack of sufficiently large (and willing) human subjects, we undertook a Monte Carlo approach to testing our user-modeling and estimation method. Monte Carlo simulation (Metropolis & Ulam, 1949) randomly generates values for uncertain variables over and over to simulate a model. Thereby simulating 1000 computer generated layouts (representing ground-truth values of s), which were meant to emulate 1000 actual userlayouts or preferences. In each case, estimation was performed to recover the original values as best as possible. Note that this recovery is only partially effective due to the information loss in projecting down to a 2D space. As a control, 1000 randomly generated feature weights were used to see how well they could match the user layouts (i.e., by chance alone). Our primary test database consists of 142 images from the COREL database. It has 7 categories of car, bird, tiger, mountain, flower, church and airplane. Each class has about 20 images. Feature extraction based on color, texture and structure has been done offline and prestored. Although we will be reporting on this test data set due to its common use and familiarity to the CBIR community we should emphasize that we have also successfully tested our methodology on larger and much more heterogeneous image libraries. (For example, real personal photo collections of 500+ images, including family, friends, vacations, etc.). Depending on the particular domain, one can obtain different degrees of performance, but one thing is for sure: for narrow application domain (for example, medical, logos, trademarks, etc.) it is quite easy to construct systems which work extremely well, by taking advantages of the limiting constraints in the imagery. The following is the Monte Carlo procedure that was used for testing the significance and validity of user modeling with estimation:
Figure 12. Scatter plot of estimation, estimated weights versus original weights
1. 2. 3. 4. 5. 6.
7.
Randomly select M images from the database. Generate arbitrary (random) feature weights in order to simulate a user layout. Do a PCA Splat using this ground truth . From the resulting 2D layout, estimate and denote the estimated as . Select a new distinct (nonoverlapping) set of M images from the database. Do PCA Splats on the second set using the original , the estimated , and a third random ' (as control). Calculate the resulting stress in Equation (8), and layout deviation (2D position error) in Equation (9) for the original, estimated and random (control) values of , , and ', respectively. Repeat 1,000 times.
The scatter plot of estimation is shown in Figure 12. Clearly there is a direct linear relationship between the original weights and the estimated weights . Note that when the original weight is very small (<0.1) or very large (>0.9), the estimated weight is zero or one correspondingly. This means that when one particular feature weight is very large (or very small), the corresponding feature will become the most dominant (or least dominant) feature in the PCA, therefore the estimated weight for this feature will be either one or zero. This saturation phenomenon in Figure 12 is seen to occur more prominently for the case of structure (lower left of the rightmost panel) that is possibly more pronounced because of the structure feature vector being so (relatively) high dimensional. Additionally, structure features are not as well defined compared with color and texture (e.g., they have less discriminating power). In terms of actual measures of stress and deviation we found that the -estimation scheme yielded the smaller deviation 78.4% of the time and smaller stress 72.9%. The main reason these values are less than 100% is due to the nature of the Monte Carlo testing and the fact that working with low-dimensional (2D) spaces, random weights can be close
213
Figure 13. Scatter plot of deviation scores (a) equal weights (y-axis) versus estimation weights (x-axis) (b) random weights (y-axis) versus estimated weights (x-axis)
(a) Equal versus estimated
(b) Random versus estimated
to the original weights and hence can often generate similar user layouts (in this case apparently about 25% of the time). We should add that an alternative control or null hypothesis to that of random weights = { 1 , 1 , 1}T is that of fixed equal weights. This weighting scheme corresponds 3 3 3 to the assumption that there are to be no preferential biases in the subspace of the features, that they should all count equally in forming the final layout (or default PCA). But the fundamental premise behind the chapter is that there is a change or variable bias in the relative importance of the different features as manifested by different user layout and styles. In fact, if there was to be no bias in the weights (i.e., they were set equal) then there would be no usermodeling or adaptation necessary since there would always be just one type or style of layout (the one resulting from equal weights). In order to understand this question fully, we compare the results of random weights versus equal weights (compared to the estimation framework advocated). In an identical set of experiments, replacing random weights for comparison layouts with equal weights = { 1 , 1 , 1}T , we found a similar distribution of similarity scores. In 3 3 3 particular, since the goal is obtaining accurate 2D layouts where positional accuracy is critical, we look at the resulting deviation in the case of both random weights and equal weight versus estimated weights. We carry out a large Monte Carlo experiments (10,000 trials) and Figure 13 shows the scatter plot of the deviation scores. Points above the diagonal (not shown) indicate a deviation performance worse than that of weight estimation. As can be seen here, the test results are roughly comparable for equal and random weights. In Figure 13, we noted that the -estimation scheme yielded the smaller deviation 72.6% of the time compared to equal weights (as opposed to the 78.4% compared with random weights). We therefore note that the results and conclusions of these experiments are consistent despite the choice of equal or random controls, and ultimately direct estimation of a user layout is best.
Figure 14. Comparison of the distribution of -estimation versus nearest-neighbor deviation scores
Finally we should note that all weighting schemes (random or not) define sensible or coherent layouts. The only difference is in the amount by which color, texture and structure is emphasized. Therefore even random weights generate nice or pleasing layouts, that is, random weights do not generate random layouts. Another control other than random (or equal) weights is to compare the deviation of an -estimation layout generator to a simple scheme which assigns each new image to the 2D location of its (un-weighted or equally weighted) 37-dimensional nearest neighbor (NN) from the set of images previously laid out by the user. This control scheme essentially operates on the principle that new images should be positioned on screen at the same location as their nearest neighbors in the original 37-dimensional feature space (the default similarity measure in the absence of any prior bias) and thus essentially ignores the operating subspace defined by the user in a 2D layout. The NN placement scheme would place the test picture, despite their similarity score, directly on top of whichever image, currently on the table, that it is closest to. To do otherwise, for example to place it slightly shifted away, and so forth, would simply imply the existence of a nondefault smart projection function which defeats the purpose of this control. The point of this particular experiment is to compare our smart scheme with one which has no knowledge or preferential subspace weightings and see how this would subsequently map to (relative) position on the display. The idea behind that is that a dynamic user-centric display should adapt to varying levels of emphasis to color, texture, and structure. The distributions of the outcomes of this Monte Carlo simulation are shown in Figure 14 where we see that the layout deviation using estimation (red: = 0.9691, = 0.7776) was consistently lower by almost an order of magnitude than the nearest neighbor layout approach (blue: = 7.5921 = 2.6410). We note that despite the noncoincident overlap of the distributions tails in Figure 14, in every one of the 1,000
215
random trials the -estimation deviation score was found to be smaller than that of nearest-neighbour (a key fact not visible in such a plot).
USER PREFERENCE STUDY

In addition to the computer-generated simulations, we have in fact conducted a preliminary user study that has also demonstrated the superior performance of estimation over random feature weighting used as a control. The goal was to test whether the estimated feature weights would generate a better layout on a new but similar set of images than random weightings (used as control). The user interface is shown in Figure 15 where the top panel is a coherent layout generated by a random on reference image set. From this layout, an estimate of was computed and used to redo the layout. A layout generated according to random weights was also generated and used as a control. These two layouts were then displayed in the bottom panels with randomized (A vs. B) labels (in order to remove any bias in the presentation). The users task was to select which layout (A or B) was more similar to the reference layout in the top panel. In our experiment, six nave users were instructed in the basic operation of the interface and given the following instructions: (1) both absolute and relative positions of images matter, (2) in general, similar images, like cars, tigers, and so forth, should cluster and (3) the relative positions of the clusters also matter. Each user performed 50 forced-choice tests with no time limits. Each test set of 50 contained redundant (randomly recurring) tests in order to test the users consistency. We specifically aimed at not priming the subjects with very detailed instructions (such as, Its not valid to match a red car and a red flower because they are both red.).
Figure 15. -estimation-matters user test interface
Table 1. Results of user-preference study

Preference for estimates 90% 98% 98% 95% 98% 98% 96% Preference for random weights 10% 2% 2% 5% 2% 2% 4% Users consistency rate 100% 90% 90% 100% 100% 100% 97%
User 1 User 2 User 3 User 4 User 5 User 6 Average
In fact, the nave test subjects were told nothing at all about the three feature types (color, texture, structure), the associated or obviously the estimation technique. In this regard, the paucity of the instructions was entirely intentional: whatever mental grouping that seemed valid to them was the key. In fact, this very same flexible association of the user is what was specifically tested for in the consistency part of the study. Table 1 shows the results of this user study. The average preference indicated for the -estimation-based layout was found to be 96% and an average consistency rate of a user was 97%. We note that the -estimation method of generating a layout in a similar style to the reference was consistently favored by the users. A similar experimental study has shown this to also be true even if the test layouts consist of different images than those used in the reference layout (i.e., similar but not identical images from the same categories or classes).
DISCUSSIONS AND FUTURE WORK

We have designed our system with general CBIR in mind but more specifically for personalized photo collections. An optimized content-based visualization technique is proposed to generate a 2D display of the retrieved images for content-based image retrieval. We believe that both the computational results and the pilot user study support our claims of a more perceptually intuitive and informative visualization engine that not only provides a better understanding of query retrievals but also aids in forming new queries. The proposed content-based visualization method can be easily applied to project the images from high-dimensional feature space to a 3D space for more advanced visualization and navigation. Features can be multimodal, expressing individual visual features, for example, color alone, audio features and semantic features, for example, keywords, or any combination of the above. The proposed layout optimization technique is also quite general and can be applied to avoid overlapping of any type of images, windows, frames or boxes. The PDH project is at its initial stage. We have just begun our work in both the user interface design and photo visualization and layout algorithms. The final visualization and retrieval interface can be displayed on a computer screen, large panel projection
217
screens, or, for example, on embedded tabletop devices (Shen et al., 2001; Shen et al., 2003) designed specifically for purposes of storytelling or multiperson collaborative exploration of large image libraries. Many interesting questions still remain as our future research in the area of contentbased information visualization and retrieval. The next task is to carry out an extended user-modeling study by having our system learn the feature weights from various sample layouts provided by the user. We have already developed a framework to incorporate visual features with semantic labels for both retrieval and layout. Another challenging area is automatic summarization and display of large image collections. Since summarization is implicitly defined by user preference, a estimation for user modeling will play a key role in this and other high-level tasks where context is defined by the user. Finally, incorporation of relevance feedback for content-based image retrieval based on the visualization of the optimized PCA Splat seems very intuitive and is currently being explored. By manually grouping the relevant images together at each relevance feedback step, a dynamic user-modeling technique will be proposed.
ACKNOWLEDGMENTS
This work was supported in part by Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, and National Science Foundation Grant EIA 99-75019.
REFERENCES
Balabanovic, M., Chu, L., & Wolff, G. (2000). Storytelling with digital photographs. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, The Hague, The Netherlands (pp. 564-571). Belkin, M., & Niyogi, P. (2003). Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation,15(6),1373-1396. Brand, M. (2003). Charting a manifold. Mitsubishi Electric Research Laboratories (MERL), TR2003-13. Dietz, P., & Leigh, D. (2001). DiamondTouch: A multi-user touch technology. The Proceedings of the 14th ACM Symposium on User Interface Software and Technology, Orlando, Florida (pp. 219-226). Horoike, A., & Musha, Y. (2000). Similarity-based image retrieval system with 3D visualization. Proceedings of IEEE International Conference on Multimedia and Expo, New York, New York (Vol. 2, pp. 769-772). Jolliffe, I. T. (1996). Principal component analysis. New-York: Springer-Verlag. Kang, H., & Shneiderman, B. (2000). Visualization methods for personal photo collections: Browsing and searching in the photofinder. Proceedings of IEEE International Conference on Multimedia and Expo, New York, New York. Metropolis, N. & Ulam, S. (1949). The Monte Carlo method. Journal of the American Statistical Association, 44(247), 335-341. Moghaddam, B., Tian, Q., & Huang, T. S. (2001). Spatial visualization for content-based image retrieval. Proceedings of IEEE International Conference on Multimedia and Expo, Tokyo, Japan.
Moghaddam, B., Tian, Q., Lesh, N., Shen, C., & Huang, T.S. (2002). PDH: A human-centric interface for image libraries. Proceedings of IEEE International Conference on Multimedia and Expo, Lausanne, Switzerland (Vol. 1, pp. 901-904). Moghaddam, B., Tian, Q., Lesh, N., Shen, C., & Huang, T.S. (2004). Visualization and usermodeling for browsing personal photo libraries. International Journal of Computer Vision, Special Issue on Content-Based Image Retrieval, 56(1-2), 109-130. Nakazato, M., & Huang, T.S. (2001). 3D MARS: Immersive virtual reality for contentbased image retrieval. Proceedings of IEEE International Conference on Multimedia and Expo, Tokyo, Japan. Popescu, M., & Gader, P. (1998). Image content retrieval from image databases using feature integration by choquet integral. Proceeding of SPIE Conference on Storage and Retrieval for Image and Video Databases VII, San Jose, California. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323-2326. Rubner, Y. (1999). Perceptual metrics for image database navigation. Doctoral dissertation, Stanford University. Santini, S., Gupta, A., & Jain, R. (2001). Emergent semantics through interaction in image databases. IEEE Transactions on Knowledge and Data Engineering, 13(3), 337351. Santini, S., & Jain, R. (1999). Similarity measures. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(9), 871-883. Santini, S., & Jain, R., (2000, July-December). Integrated browsing and querying for image databases. IEEE Multimedia Magazine, 26-39. Shen, C., Lesh, N., & Vernier, F. (2003). Personal digital historian: Story sharing around the table. ACM Interactions, March/April (also MERL TR2003-04). Shen, C., Lesh, N., Moghaddam, B., Beardsley, P., & Bardsley, R. (2001). Personal digital historian: User interface design. Proceedings of Extended Abstract of SIGCHI Conference on Human Factors in Computing Systems, Seattle, Washington (pp. 29-30). Shen, C., Lesh, N., Vernier, F., Forlines, C., & Frost, J. (2002). Sharing and building digital group histories. ACM Conference on Computer Supported Cooperative Work. Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1349-1380. Smith, J. R., & Chang, S. F. (1994). Transform features for texture classification and discrimination in large image database. Proceedings of IEEE International Conference on Image Processing, Austin, TX. Squire, D. M., Mller, H., & Mller, W. (1999). Improving response time by search pruning in a content-based image retrieval system using inverted file techniques. Proceedings of IEEE Workshop On Content-Based Access of Image and Video Libraries (CBAIVL), Fort Collins, CO. Stricker, M., & Orengo, M. (1995). Similarity of color images. Proceedings of. SPIE Storage and Retrieval for Image and Video Databases, San Diego, CA. Swets, D., & Weng, J. (1999). Hierarchical discriminant analysis for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5), 396-401.
219
Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290, 2319-2323. Tian, Q., Moghaddam, B., & Huang, T. S. (2001). Display optimization for image browsing. The Second International Workshop on Multimedia Databases and Image Communications, Amalfi, Italy (pp. 167-173). Tian, Q., Moghaddam, B., & Huang, T. S. (2002). Visualization, estimation and usermodeling for interactive browsing of image libraries. International Conference on Image and Video Retrieval, London (pp. 7-16). Torgeson, W. S. (1998). Theory and methods of scaling. New York: John Wiley & Sons. Vernier, F., Lesh, N., & Shen, C. (2002). Visualization techniques for circular tabletop interface. Proceedings of Advanced Visual Interfaces (AVI), Trento, Italy (pp. 257266). Zhou, S. X., Rui, Y., & Huang, T. S. (1999). Water-filling algorithm: A novel way for image feature extraction based on edge maps. Proceedings of IEEE International Conference on Image Processing, Kobe, Japan. Zwillinger, D. (Ed.) (1995). Affine transformations, 4.3.2 in CRC standard mathematical tables and formulae (pp. 265-266). Boca Raton, FL: CRC Press.
APPENDIX A
xi xi Let Pi = p ( i ) ( x, y) = and Pi = p ( i ) ( x, y) = , and Equation (15) is rewritten as yi yi
J = || Pi Pi ||2
i =1 N
(A.1)
X (ci ) Let Xi be the column feature vector of the ith image, where Xi = Xt(i ) , i = 1,L , N . X (si )
(i) X (i) , X (i) and X s are the corresponding color, texture and structure feature vector of c t
c X(i) c (i) the ith image and their lengths are L c, Lt and L s, respectively. Let X = t X t be the i s X(i) s
weighted high-dimensional feature vector. These weights c, t, s are constrained such as they always sum to 1.
Pi is estimated by linearly projecting the weighted high-dimensional features to 2D. Let X = [X1, X2, ..., XN], it is an LN matrix, where L = Lc + L t + L s. Pi is estimated by Pi = U T ( X - Xm ) i
i = 1,L , N
(A.2)
where U is a L2 projection matrix, X m is an L1 mean column vector of X'i, i = 1, ..., N. Substitute P by Equation (A.2) into Equation (A.1), the problem is therefore one of
i
seeking the optimal feature weights , projection matrix U, and column vector Xm such as J in Equation (A.3) is minimized, given Xi, Pi, i = 1, ..., N.
J = || U T ( X X m ) Pi ||2 i
i =1
(A.3)
In practice, it is almost impossible to estimate optimal , U and Xm simultaneously based on the limited available data X i, Pi, i = 1, ..., N. We thus make some modifications. Instead of estimating , U and X m simultaneously, we modified the estimation process to be a two-step re-estimation procedure. We first estimate the projection matrix U and
221
column vector Xm, and then estimate feature weight vector based on the computed U and Xm, and iterate until convergence. Let U(0) be the eigenvectors corresponding to the largest two eigenvalues of the covariance matrix of X, where X = [X1, X2, ..., XN], X (0) is the mean vector of X. m We have
(0) Pi(0) = (U (0) )T ( Xi X m )
(A. 4)
Pi(0) is the projected 2D coordinates of the unweighted high-dimensional feature vector of the i th image. Ideally its target location is Pi. To consider the alignment correction, a rigid transform (Zwillinger, 1995) is applied.
Pi(0) = A Pi(0) + T
(A. 5)
where A is a 22 matrix and T is a 21 vector. A, T are obtained by minimizing the L2-norm
of Pi Pi(0) . Therefore J in Equation (A. 3) is modified to

J = || AU (0) ( X X (0) ) ( Pi T ) ||2 m i
i =1 N
(A. 6)
Let U = UA(0), X m = X (0) and Pi = Pi T, we still have the form of Equation (A. 3). m
U11,LU1( Lc + Lt + Ls ) UT = Let us rewrite U 21,L ,U 2( Lc + Lt + Ls ) After some simplifications on Equation (A. 3), we have J = || c Ai + t Bi + s Ci Di ||2
i =1 N
(A. 7)
Lc (k ) U1k X c (i ) k =1 where Ai = L c (k ) U 2 k Xc (i) k =1

and Di = U T X m + Pi ,
Lt Ls (k ) U1( k + Lc ) Xt ( k ) (i ) U1( k + Lc + Lt ) X s (i) k =1 C = k =1 Bi = L t i Ls (k ) (k ) U 2( k + Lt ) Xt (i) U 2( k + Lc + Lt ) X s (i) k =1 k =1
Ai, Bi, Ci and Di are the 21feature vectors, respectively. To minimize J, we take the partial derivatives of J relative to c, t, s and set them to zero, respectively.
J c J t J s
=0 =0 =0
(A. 8)
We thus have:
E = f
(A. 9)
N T Ai Ai i=1 N where E = BiT Ai i=1 N T C i Ai i=1
A
i =1 N
Bi Bi Bi
B
i =1 N
C
i =1
N T Ci Ai Di i =1 i=1 N N T T Bi Ci , f = Bi Di i =1 i=1 N N T T Ci Ci Ci Di i =1 i =1
and is obtained by solving the linear Equation (A.9).
Multimedia Authoring 223
Chapter 10
Human-Computer Partnership for Harvesting Metadata from the Right Sources

Brett Adams, Curtin University of Technology, Australia Svetha Venkatesh, Curtin University of Technology, Australia
Multimedia Authoring:
This chapter takes a look at the task of creating multimedia authoring tools for the amateur media creator, and the problems unique to the undertaking. It argues that a deep understanding of both the media creation process, together with insight into the precise nature of the relative strengths of computers and users, given the domain of application, is needed before this gap can be bridged by software technology. These issues are further demonstrated within the context of a novel media collection environment, including a real- world example of an occasion filmed in order to automatically create two movies of distinctly different styles. The authors hope that such tools will enable amateur videographers to produce technically polished and aesthetically effective media, regardless of their level of expertise.
ABSTRACT
INTRODUCTION
Accessibility to the means of authoring multimedia artifacts has expanded to envelop the majority of desktops and homes of the industrialized world in the last decade. Forrester Research predicts that by 2005, 92% of online consumers will create personal
224 Adams & Venkatesh
multimedia content at least once a month (Casares et al., 2002). To take the crude analogy of the written word, we all now have the paper and pencils at hand, the means to author our masterpieces or simply communicate. Or rather, we would do if only we knew how to write. Simply adding erasers, coloured pencils, sharpeners, scissors, and other such items to the writing desk doesnt help us write War and Peace, or a Goosebumps novel, nor even a friendly epistle. Similarly, software tools that allow us to cut, copy, and paste video do not address the overarching difficulty of forming multimedia artifacts that effectively (and affectively) achieve the desired communication or aesthetic integrity. There is a large and diverse research community, utilizing techniques from a far flung variety of fields including human-computer interaction, signal processing, linguistic analysis, computer graphics, video and image databases, information sciences and knowledge representation, computational media aesthetics, and so forth, which has grown up around this problem, and offering solutions and identifying problems that are similarly varied. Issues and questions pertinent to the problem include, but are not limited to What are we trying to help the user do? What is the users role in the multimedia authoring process? Who should carry the burden for which parts of the media creation process? The nature and choice of metadata. The purpose and power of that metadata what does in enable? Computer-user roles in determining what to capture or generate.
The objective of this chapter is to emphasize the importance of clearly defining the domain and nature of the creative/authoring activities of the user that we are seeking to support with technology: The flow on effects of this decision are vitally important to the whole authoring endeavour and impact all consequent stages of the process. Simply put, definition of the domain of application for our technology the user, audience, means, mood, and so forth of the authoring situation enables us to more precisely define the lack our technology is seeking to supply, and in turn the nature and extent of metadata or semantic information necessary to achieve the result, as well as the best way of going about getting it. We will use our existing media creation framework, aimed at amateur videographers, to help demonstrate the principles and possible implementations. The structure of the remainder of the chapter is as follows: We first explore related work with reference to the traditional three-part media creation process. This alerts us to the relative density and location of research efforts, notes the importance of a holistic approach to the media creation process, and helps define the questions we need to answer when building our own authoring systems. We next examine those questions arising in more detail. In particular, we note the importance of defining the domain of the technology, which has flow on implications regarding the nature of the gap that our technology is trying to close, and the best way of doing so. Finally, we present an example system in order to offer further insight into the issues discussed.
BACKGROUND
Computer technology, like any other technology, is applied to a problem in order to make the solution easier or possible. Questions that we should ask ourselves include:
What is the potential user trying to do? What is the lack that we need to supply? What materials or information do we need to achieve this? Let us consider solutions and ongoing research in strictly video related endeavours, a subset of all multimedia authoring-related research, for the purpose of introducing some of the terms and issues pertinent to the more detailed discussion that follows. We will, at times, stray outside this to a consideration of related domains (e.g., authoring of computer-generated video) for the purpose of illuminating ideas and implications with the potential for beneficial cross-pollination. We will not be considering research aimed at abstracting media as opposed to authoring it (e.g., Informedia Video Skims (Wactlar et al., 1999) or the MoCA groups Video Abstracting (Pfeiffer & Effelsberg, 1997)), although it could be considered authoring in one sense. Traditionally the process of creating a finished video presentation contains something like the following three phases: Preproduction, production and postproduction. For real video (as opposed to entirely computer-generated media), we could be more specific and label the phases: Scripting/Storyboarding (both aural and visual), Capture, and Editing that is, decide what footage to capture and how, capture the footage, and finally compose by selection, ordering and fitting together of footage, with all of the major and minor crosstalk and feedback present in any imperfect process. In abstract terms the three phases partition the process into Planning, Execution and Polishing, and from this we can see that this partitioning is not the exclusive domain of multimedia authoring, but is indeed appropriate, even necessary, for any creative or communicative endeavour. Of course, the relative size of each stage, and the amount and nature of feedback or revisitation for each stage will be different depending on the particular kind of multimedia being authored and the environment in which it is being created, but they nevertheless remain a useful partitioning of this process1. Figure 1 is a depiction of the media creation workflow with the three phases noted. One way of classifying work in the area of multimedia authoring technology is to consider which stage(s) of this authoring process the work particularly targets or emphasizes. Irrespective of the precise domain of application, where do they perceive the most problematic lack to be?
Figure 1. Three authoring phases in relation to the professional film creation workflow
Planning Author
creativity and style language rules genre conventions
Execution Screenwriter
creativity and style adaptation constraints, screenplay rules etc. [author intent]
Polishing Editor
creativity and style film gram mar, montage etc. [director intent] [dvd index, ...]
Director/ Cameraman
creativity and style film gram mar, cinematography etc. [screenwriter intent]
Viewer
perception other movies
Novel
Raw footage
Movie
Experience
Edit
We will start from the end, the editing phase, and work back toward the beginning. Historically, this seems to have received the most attention in the development of software solutions. Technologies to aid the low-level operations of movie editing are now so firmly established as to be the staples of commercial software. Software including iMovie, Pinnacle Studio, Power Director, Video Wave and a host of others, provide the ability to cut and paste or drag and drop footage, and align video and audio tracks, usually by means of some flavour of timeline representation for the video under construction. Additionally, the user is able to create complex transitions between shots and add titles and credits. The claims about iMovie even run to it having single-handedly made cinematographers out of parents, grandparents, students (iMovie, 2003). Although the materials have changed somewhat scissors and film to mouse and bytes conceptually the film is still treated as an opaque entity, the operations blind to any aesthetic or content attributes intrinsic to the material under the scalpel. All essentially provide help for doing a mechanical process. They are the hammer and drill of the garage. As with the professional version, what makes for the quality of the final product is the skill of the hand guiding the scissors. The next rung up the ladder of solution power sees a more complex approach to adding smarts to the editing phase. It is here that we still find a lot of interest in the research community. Girgensohn et al. (2000) describe their semiautomatic video editor, Hitchcock, which uses an automatic suitability metric based upon how erratic camera work is deemed for a given piece of footage. They also employ a spring model, based upon the unsuitability metric, in order to provide a more global mechanism for manipulating the total output video length. There are a number of similar applications, both commercial and research, that follow this approach of bringing such smarts to the editing phase, including SILVER (Casares et al., 2002), muvee autoProducer and ACD Video Magic. Muvee autoProducer is particularly interesting in that it allows the user to specify a style, including such options as Chaplinesque, Fifties TV and Cinema. These presumably translate to allowable clip lengths, image filters and transitions an example of broad genre conventions influencing automatic authoring, which are themselves the end product of cinesthetic considerations. It provides an extreme example of low user burden. All that is required of the user is that they provide a video file, a song music selection, and select a style. Roughly, these approaches all infer something about the suitability or desirability or otherwise of a given piece of film based on a mapping between that property and a lowlevel feature of the video signal. This knowledge is then used to relieve the user of the burden of actually having to carry out the mechanical operations of footage cut and paste described above. The foundations of these approaches are as strong or weak as the link between low-level feature and inferred cinematic property. They offer the equivalent of a spellchecker for our text. These approaches attempt to automatically generate simple metadata about video footage, a term which is becoming increasingly prominent. Up to this point we can observe that none of these approaches either demand or make use of information about footage related to its meaning, its semantics. They might be able to gauge that a number of frames are the result of erratic camera work, but they
are not able to tell us that the object being filmed so poorly is the users daughter, who also happened to be the subject of the preceding three shots. Lindley et al. (2001) present work on interactive and adaptive generation of news video presentations, using Rhetorical Structure Theory (RST) as an aid to construction. Video segments, if labeled with a rhetorical functional role, such as elaboration or motivation, may be combined automatically into a hierarchical structure of RST relations, and thence into one or more coherent linear or interactive video presentations by traversing that structure. Here the semantic information relates to rhetorical role and the emphasis is on the coherency of the presentation. Lindley et al. (2001, p. 8) note that, Content representation scheme[s] that can indicate subject matter and bibliographical material such as the sources and originating dates of the video contents of the database is necessary supplemental metadata for their chosen genre of news presentation generation. Additionally, they suggest that narrative or associative/categorical techniques may help provide an algorithmic basis for sequencing material so as to avoid problems such as continuity within subtopics. This introduces the potential need for narrative theory or, more broadly, some sort of discoursive theory in addition to the necessary intrinsic (denotative and connotative) content semantic information. Of interest is the work of Nack and Parkes (1995) and Nack (1996), who propose a film editing model, and attempt to automatically compile humorous scenes from existing footage. Their system, AUTEUR, aims to achieve a video sequence that realises an overall thematic specification. They define the two prime problems of the video editing process: Composing the film such that it is perceptible in its entirety, and in a manner that engages the viewer emotionally and intellectually. The presentation should be understandable and enthralling, or at least kind of interesting. In addition to its own logic of story generation, the editor draws upon a knowledge base which is labeled as containing World, Common sense, Codes, Filmic representations, Individual. Such knowledge is obviously hard to come by, but illustrates the type of metadata that needs to be brought to bear upon the undertaking. An interesting tangent to this work is the search for systems that resolve the ageold dilemma of the interactive narrative. For example, see Skov and Andersen (2001), Lang (1999), and Sack and Davis (1994). It is apparent that these last few approaches we casually lump together by virtue of their revolving around (a) some theory of presentation, be it narrative, rhetorical or whatever is appropriate for the particular domain of application, coupled with (b) some knowledge about the available or desirable raw material by which inferences relating to their function in terms of that theory may be made. It would also be fair to say that the tacit promise is: The more you tell me about the footage, the more I can do for you.
Capture
Capture refers to the process of actually creating the raw media. For a home movie, it means the time when the camera is being toted, gathering all of those waves and embarrassed smiles. Unlike the process of recording incident sound, where the creative input into the capture perhaps runs to adjusting the overall volume or frequency filter, there is a whole host of parameters that come into play, which the average home videographer is unaware of (due in part to the skill of the professional filmmaker, no
doubt; Walter Murch, a professional film editor, states that the best cut is an invisible one). Any given shot capture attempt, where shot is defined to be the contiguous video captured between when the start and stop buttons are pressed for a recording, or its analog in simulated camera footage, be it computer generated or traditionally animated, veritably bristles with parameters; light, composition, camera and object motion, camera mounting, z-axis blocking, duration, focal length, and so forth, all impact greatly on the final captured footage, its content and aesthetic potential. So, given this difficulty for the amateur (or even professional) camera operator, what can we do? To belabour our writing analogy, what do we do when we recognize that the one holding the pen has little idea about what is required of them? We give them a form. Or we provide friendly staff to answer their queries or offer suggestions. There is research which focuses on the difficulties of this stage of the multimedia authoring process. Bobick and Pinhanez (1995) describe work focusing on smart cameras, able to follow simple framing requests originating with the director of a production within a constrained environment (their example is a cooking show). This handles the translation of cinematic directives to physical camera parameters, no mean contribution on its own, but is reliant on the presence of a knowledgeable director. There also exists a large literature that addresses the problem of capturing shots in a virtual environment all the more applicable in these days of purely computergenerated media. The system described by He et al. (1996) takes a description of the events taking place in the virtual world and uses simple film idioms, where an idiom might be an encoded rule as to how to capture a two-way conversation, to produce camera placement specifications in order to capture the action in a cinematically pleasing manner. Or see Tomlinson et al. (2000) who attempt automated cinematography via a cameracreature that maps viewed agent emotions to cinematic techniques that best express those emotions. These approaches, however, are limited to highly constrained environments, real or otherwise, which rules out the entire domain of the home movie. Barry and Davenport (2003) describe interesting work aimed at transforming the role of the camera from tool to creative partner. Their approach aims to merge subject sense knowledge, everyday common sense knowledge stored in the Openmind Commonsense database, and formal sense knowledge, the sort of knowledge gleaned from practiced videographers, in order to provide on-the-spot shot suggestions. The aim is to help during the capture process such that the resulting raw footage has the potential to be sculpted into an engaging narrative come composition time. If taken, shot suggestions retain their own metadata about the given shot. The idea here is to influence the quality of the footage that will be presented to the editing phase for those tools to take advantage of, because, from the point of view of the editor, garbage in, garbage out.
Planning
Well, if that old adage is just as applicable here, why not shift the quality control even farther upstream? Instead of the just-in-time (JIT) approach, why not plan for a good harvest of footage right from the beginning? There are some who take this approach. There are also tools for mocking up visualizations of the presentation in a variety of manifestations. The storyboard is a popular representation a series of panels on
which sketches depicting important shots or scenes are arranged and has been a staple of professional filmmaking for a long time. Productions today often take advantage of three- dimensional (3D) modeling techniques to get an initial feel for appropriate shots and possible difficulties. Baecker et al. (1996) provide an example in their Movie [and lecture presentation] Authoring and Design (MAD) system. Aimed at a broad range of users, it uses a variety of metaphors and views, allowing top-down and bottom-up structuring of ideas, and even the ability to preview the production as it begins to take shape. They note that an 11-year-old girl was able to create a two-minute film about herself using a rough template for an autobiography provided by the system designers, and this without prior experience using the software. Bailey et al. (2001) present a multimedia storyboarding tool targeted at exploring numerous behavioral design ideas early in the development of an interactive multimedia application. One goal of the editor is to help authors determine narration and visual content length and synchronization in order to achieve a desirable pace to the presentation. In the field of (semi)automated media production, Kennedy and Mercer (2001) state that, There is a rich environment for automated reasoning and planning about cinematographic knowledge (p. 1), referring to the multitude of possibilities available to the cinematographer for mapping high-level concepts, such as mood, into decisions regarding which cinematic techniques to use. They present a semiautomated planning system that aids animators in presenting intentions via cinematographic techniques. Instead of limiting themselves to a specific cinematic technique, they operate at the meta level, focusing on animator intentions for each shot. The knowledge base that they refer to is part of the conventions of film making (e.g., see Arijon, 1976, or Monaco, 1981), including lighting, colour choice, framing, and pacing to enhance expressive power. This is an example of a planning tool that leverages a little knowledge about the content of the production. This technology, where applicable, moves the problem of getting decent footage to the editing phase one step earlier it is no longer simply impromptu help at capture time, but involves prior cognition.
Holistic: Planning to Editing

If any one of the above parts of the media authoring process is able to result in better multimedia artifacts, then does it follow that considering the whole process within a single clearly defined framework will result in larger gains? Some work in the area appears to place certain demands on all phases of the media creation process, to be overtly conscious of the relative place and importance of each, resulting in interdependencies each with the other. Agamanolis and Bove (2003) present interesting work aimed at video productions that can re-edit themselves. Although their emphasis is on the re-editable aspect of the production, they are nevertheless decidedly whole-process conscious. Rather than the typical linear video stream produced by conventional systems, the content here is dynamic, able to be altered client-side in response to a viewers profile. Their system, VIPER, is designed to be a common framework, able to support responsive video applications of differing domain. The purpose of the production, for example, educational or home movie, dictates the nature and scope of manual annotations attached to the video
clips, which in turn support the editing model. Editing models are constituted by rules written in a computer language specifically for the production at hand. They leverage designer-declared, viewer-settable response variables, which are parameterizations that dictate the allowable final configurations of the presented video, and are coded specifically for a given production in a computer language. An example might be a tell me the same story faster button. Davis (2003) frames the major problems of multimedia creation to be solved as enabling mass customization of media presentations and making media creation more accessible for the average home user. He calls into question the tripartite media production process outlined above as being inappropriate to the defined goals. That issue aside for now, it is interesting to note the specific deficiencies that his Media Streams system seeks to address. In calling for a new paradigm for media creation, the needs illuminated include: (1) capture of guaranteed quality reusable assets and rich metadata by means of an Active Capture model, and (2) the redeploying of (a) domain knowledge to the representation in software of media content and structure and the software functions that dictate recombination and adaptation of those artifacts, and (b) the creative roles to the designers of adaptive media templates, which are built of those functions in order to achieve a purpose whilst allowing the desired personalization of that message (or whatever). He uses two analogies to illustrate the way the structure is fixed in one sense whilst customisable in another: Lego provides a building block, a fixed interface, simple parts from which countless wholes may be made this is analagous to syntagmatic substitution; Mad Libs, a game involving blind word substitution into an existing sentence template, is an example of paradigmatic substitution, where the user shapes the (probably nonsensical, but nevertheless amusing) meaning of the sentence within the existing syntactical bounds of the sentence. The title, Editing out editing, alludes to the desire to enable the provider or user to push a button and have the media automatically assembled in a well-formed manner: movies as programs. These approaches rely on the presence of strong threads running through the production process from beginning to end. The precise nature of those threads varies. It may be clearly defined production purpose (this will be an educational presentation, describing the horrors of X), or a guaranteed chain of consistent annotation (e.g., content expressed reliably in terms of the ontology in play this family member is present in this shot), or assumptions about the context of captured material and so forth, or a combination of these, but it is this type of long-range coherency and consistency of assumptions and information transmission that enables a quality media artifact in the final analysis, by whatever criteria quality is judged in the particular instance. There is another aspect to the multimedia creation process, which we have neglected thus far. It has to do with reusing or repurposing media artifacts, and the corresponding phase might be called packaging for further use. The term reuse is getting a lot of press these days in connection with multimedia. It is a complex topic and we will not deal with it here except to say that some of the issues that come into play have already cropped up in the preceding discussion. For example, those systems that automatically allow for multiple versions of the same presentations contain a sort of reuse. In order to do that our system needs to know something about the data and about the context into which we are seeking to insert it.
ISSUES IN DESIGNING MULTIMEDIA AUTHORING TOOLS

Defining Your Domain
The preceding, somewhat loosely grouped, treatment of research related to multimedia authoring serves to highlight an important issue, one so obviously necessary as to be often neglected: namely, definition of the domain of application of the ideas or system being propounded (do we assume our audience is well aware of it? ours is the most interesting after all, isnt it?) By domain is meant the complex of assumed context (target user, physical limitations, target audience(s), etc.) and rules, conventions or proprieties operating (including all levels of genre as commonly understood, and possibly even intrinsic genre ala Hirsch (1967)) which together constitute the air in which the solution lives and becomes efficient in achieving its stated goals. Consider, just briefly, a few of the possibilities: Intended users can be reluctantly involved hobbyists, amateurs or professionals. Grasp of, and access to, technology may range from knowing where the record button is on the old low-resolution, mono camera, to being completely comfortable with multiple handheld devices and networks of remote computing power. Some things might come easily to the user, while others are a struggle. The setting might be business or pleasure, the environment solo or collaborative. The user might have a day to produce the multimedia artifact, or a year. The intended audience may be self, family and friends, the unknown interested, the unknown uninterested, or all of the above, each with constraints reflecting his or her own. That is not to say that each and every aspect must be explicitly enumerated and instantiated, nor that they should all be specified to a fine point. Genres and abstractions, catchalls, exist precisely because of their ability to specify ranges or sets that are more easily handled. In one sense, they are the result of a consideration of many of the above factors and serve to funnel all of those concerns into a set of manageable conventions. What is important, though, is that the bounds you are assuming are made explicit.
Defining the Gap

But why is domain important to know? Cant we forget issues of context and purpose they are kind of hard to determine sometimes anyway and concentrate on simple concretes, such as the type of data that were dealing with? Video capture and manipulation, how hard can that be? Such a data-type-centric approach is appealing, but it doesnt help define the contours of the chief problem of technology highlighted above: the lack our technology is trying to supply. What is the nature of the gap that we are trying to bridge? Consider an example by way of illustration: In the editing room of a professional feature film, the editor often needs to locate a piece of footage known to exist to fill a need the problem is retrieval, and the gap is information regarding the whereabouts of the desired artifact. For example, in making the movies of Lord of the Rings, literally hours of footage was chiseled to seconds in some scenes. For the computer-generated movie Final Fantasy the problem was locating resources in the vast web of remote repositories, versions and stages of processing. Contrast that situation with a home videographer trying to assemble a
vacation movie from assembled clips. How much of what and which should go where? The problem in this case is which sequence makes for an interesting progression, and the corresponding gap is an understanding of film grammar and narrative principles (or some other theory of discourse structure). The difference is between finding a citation and knowing the principles of essay writing.
Closing the Gap

Having named the gap, there are, no doubt, a few possible solutions which offer themselves. In the case of the amateur videographer above, which we will take for our example from here on in, the solution might be to either educate the user or lower his expectations. (Education of expectations might be needful prior to education in the actual craft of film making, as sometimes the problem is that the user does not know what is possible!) Dismissing the latter as embarrassingly defeatist, we may set ourselves to educate the user. In the final analysis, this would result in the highest quality multimedia artifact, but, short of sending the user to film school, and noting that there is probably a reason that the user is an amateur in the first place (I dont have the time or energy for a full-blown course on this stuff; no, not even a book), what is possible? What would we do for the writer of our analogy? We can provide a thesaurus and dictionary, or put squiggly lines under the offending text (there are way too many under these words). We could go a step further and provide grammar checking. By doing these things we are offering help regarding the isolated semantics (partial semantics really, as linguists tell us semantics come in sentences (Hirsch, 1967, p. 232), The sentence is the fundamental unit of speech.), in the case of the spellcheck, dictionary, and thesaurus, and syntactical aid via the grammar checker. But this essentially doesnt help us to write our essay! How do we connect our islands of meaning into island chains, then into a cogent argument or message call it discourse and finally express that discourse well in the chosen medium and genre? Let us consider the problem as one of first formulating the discourse we wish to convey (e.g., this scene goes before this one and supports this idea), and secondly expressing that discourse in a surface manifestation (e.g., video) using the particular powers of the medium to heighten its effectiveness2.
Formulate the Discourse

Let us delve a level deeper. Continuing with the example of the amateur videographer, what can we fairly assume that he or she does know? Humans are good at knowing the content (this shot footage contains my son and me skiing) and, at least intuitively, the relationships among the parts (sons have only one father, and skiing can be fun and painful). This is precisely the sort of knowledge that we find difficult to extract from media, model and ply with computation. Capturing this sense common to us all and the explosion of connotational relationships among its constituents is an active and, to put it mildly, daunting area of research in the knowledge modeling community. Humans, on the other hand, do this well (is there another standard by which to make the comparison?). But it is the knitting together of these parts, film clips in this case, which is the first part of the media authoring process the amateur finds difficult. We can liken the selection and ordering of footage to the building of an argument. Knowledge about the isolated pieces (sons and fathers) isnt sufficient to help us form a reasoned chain of logic. We
might sense intuitively where we want to end up (an enjoyable home movie), but we dont know how to get there. Those very same pieces of knowledge (sons and fathers) do not tell us anything about how they should be combined when plucked from their semantic webs and put to the purpose of serving the argument. But this is the sort of thing we have had success at getting computers to do. Enumerating possibilities from a given set of rules and constraints is something that we find easy to express algorithmically. The particular model that helps us generate our discourse, its level of complexity and emphasis, will vary. But deciding upon an appropriate model is the necessary first step. If the target is an entertaining home movie, one choice is some form of narrative model: it may be simple resolutions follow climaxes and so forth or more involved, like Dramatica (www.dramatica.com), involving a highly developed view of story as argument, with its Story Mind and four throughlines (e.g., Impact character throughline, who represents an alternate approach to the main character). Following Dramaticas queues is meant to help develop stories without holes in the argument. If the target is verity, something like RST may be more appropriate, given its emphasis on determining an objective basis for the coherency of a document. The nature of its relations seem to be appropriate for a genre like news, which purports to deal in fact, and we have already seen Lindley et al. (2001) use it for this purpose. Given that we have a discourse laid out for us in elements. How do we fill them in with content? (Lang, 1999) uses a generative grammar to produce terminals that are firstorder predicate calculus schemas about events, states, goals and beliefs. The story logic has helped us to define what is needed at this abstract level, but how do we instantiate the terminal with a piece of living, breathing content that matches the abstract contract but is nuanced by the superior human world model?
Express the Discourse

In addition to not knowing how to build the discourse, the user doesnt understand how to express those discourse elements (now instantiated to content) using the medium of choice, film in this case. We desire to produce a surface manifestation that uses the particular properties of the medium and genre to heighten the effectiveness of the underlying discourse. Our story formulation might have generated a point of conflict. Fear can produce conflict, and perhaps we instantiate this with an eye to our childs first encounter with a swimming pool. But how should I film the child? One way of supporting the idea that the child is fearful is to use the well-known high camera angle, thus shrinking him in proportion to the apparently hostile environment and exacerbating that fear in the viewers eyes. The amateur videographer cannot be expected to know this. We noted earlier that there are many parameters to be considered to a given video clip capture. However, there are conventions known to the educated user (i.e., professional filmmakers) that define appropriate parameterizations for given discourse-related goals. For example, relating to emphasis, such as shot duration patterning and audio volume, and emotional state, such as colour atmosphere, and spatial location, such as focal depth and audio clues, and temporal location, such as shot transition type and textual descriptions. In other words, there exists the possibility of encoding a mapping of discourse goals to cinematic parameters.
Figure 2. Role of human and computer in the generation of quality media

Computer Structure of discourse Human
Content Instantiated manifestation
Computer
Manifestation directive Final (Re-) manifestation
Human
Computer
* Constrained by success of instantiation (i.e., captured footage)
Where would this leave us in relation to respective roles for human and computer in the generation of quality media which communicates or entertains and does so well? An important point is that the degree of freedom of the manifestation parameterization asked for, for example, give me a medium shot of your son next to his friend, is not unlimited if the context of the user is impromptu or the videographer is purely an observer and cannot effect the environment being filmed, that attend the home movie maker (unlike, for example, automated text generation).
What about Metadata?

It could be argued that if the process under discussion proceeds from the conception of the multimedia creation to the capturing of all manifestation directives perfectly, there need not be any metadata attached to those manifestations. The job is done. But if our domain specifies that it is a particularly noisy process liable to missed or skewed directive following and this is the case with the amateur videographer, or if we desire the ability to make changes to the presentation at the discourse level after the capturing of the manifestation (or indeed we seek a new home for the media via reuse at a later date) then we need enough semantic information about the clips in order to reconnect them with the discourse logic and recompile them into new statements. What do we need to know about the captured media in order to generate a longer version of the holiday with the kids, if there is footage, for the grandparents, and a shorter version for our friends? Or maybe we want a more action-packed version for the kids themselves, and a more thoughtful version for the parents to watch. The first point would be to not attempt to record everything. Metadata by definition is data about data. But where does it stop? Surely we could talk about data about data about data ad infinitum. Who can circumnavigate this semantic web? This forces upon us the need for a relevant scope of metadata. We suggest that the domain and genre provide the scope and nature of metadata required. Obviously, we must record metadata related to discoursive function. For example, this shot is part of the act climax and contains the protagonist. But we must also record manifestation-related metadata, (e.g., this is a high-angle, medium shot of three actors). Indeed these representations are needed in order to generate the manifestation directives in the first place, but also to regenerate them in reaction to changes at the discourse level. They should be as domain
specific as possible, because without precise terminology, appropriate representations and structures, we can only make imprecise generative statements about the media. For example, for the case of the movie domain, representations such as shots, scenes, and elements of cinematography such as framing type and motion type, are appropriate, as they form the vocabulary of film production, theory and criticism. They also must provide a sound basis for inferring other, higher-order, properties of the media and at differing resolutions, dependent upon whether more or less detail can be called for or reliably known. For example, representations like shot and motion allow us to infer knowledge of the film tempo (Adams et al., 2002), of which they are constituents. Certain types of inference may require a degree of orderliness to the representations used, such as that imparted by means of hierarchical systems of classification among parts taxonomic, partonomic ontological. For example, knowledge that a particular location forms the setting for a scene implies that it is always present, always part of the scene (even if it is not visible due to other filming parameters), whereas actors may enter and leave the scene. Another example: a partonomy of threeact narrative structure allows us to infer that any dramatic events within the first act are likely to carry the dramatic function of setup, that is, some aspect of the characters or situation is being introduced. Finally, some forethought as to how easy the user will find the metadata to understand and verify is pertinent also. Asking an amateur videographer to film a shot with a cluttered z-axis (a term associated with placement of objects or actors in an imaginary line along the cameras point of view into the scene) probably isnt all that wise. However, creative ways of conveying ideas not familiar to a user will be necessary in any case. A 3D mockup might help in this case. In short, the discussion could be reduced to one principle: Get humans to do what they do well and computers to do what they do well. Our multimedia authoring tool may well need to be a semiautomatic solution, depending on your domain, of course.
EXAMPLE SOLUTION TO THE RAISED ISSUES

In this section, in order to further concretize the issues raised, we will consider a specific example of a multimedia authoring system which endeavours to address them.
Our Domain
The domain of this system is, for the amateur/home user, seeking to make media to be shared with friends and family, with potentially enough inherent interest to be sharable with a wider audience for pleasure. Planning level is assumed to be anything up to last minute, with obvious benefits for the more prior notice. Some mechanism for adapting to time constraints, such as I only have an hour of filming left, is desirable. At present we only assume a single camera available, and ideally a handheld personal computer (PC). Assumed cinematographic skill of user is from point and click upward. Scope for personal reuse and multiple views of the same discourse parameterized by genre is a desired goal. Generally impromptu context with some ability to manipulate scene existents is assumed to be possible although not necessary.
The user in this context generally wants to better communicate with their home movies, and implicitly wants something a bit more like what can be seen on TV or at the movies. We can visualize what amateur videographers produce on a scale from largely unmediated record easily obtained by simply pointing the camera at anything vaguely interesting to the user being what they have to a movie, edited well, expressing a level of continuity and coherency, with a sense of implicit narrative or even overt, and using the particulars of the film medium (camera angle, motion, focal distance, etc.) to heighten expression of content being what they want. It is instructive to consider the genres on this scale more closely. Record: This is what we are calling the most unmediated of all footage which the home user is likely to collect. It may be a stationary camera capturing anything happening within its field of view (golf swing, party, etc.). We note that many of the low-level issues, such as adjustment for lighting conditions and focus, are dealt with automatically by the hardware. But that is all you get. That means the resulting footage, left as is, is only effective for a limited range of purposes mainly as information, as hinted at by the label record. In other words, this is what my golf swing looks like where the purpose might be to locate problems. In the case of the party, the question might be who was there? Moving photo-album: This is where the video camera is being used something like a still camera but with the added advantage of movement and sound. The user typically walks around and snaps content of interest. As with still images, he may more or less intuitively compose the scene for heightened mediation for example, close-ups of the faces of kids having fun at the sea but there is little thought of clip-to-clip continuity. The unit of coherence is the clip. Thematic or revelatory narrative: Here is where we first find loose threads of coherence between clips/shots, and that coherence lies in an overarching theme or subject. In the case of a thematic continuity, an example might be a montage sequence of baby. By revelatory narrative, we mean concentrated on simply observing the different aspects and nuances of a situation as it stands, unconcerned with a logical or emotional progression of any sort. There are ties that bind shots together which need to be observed, but they are by no means stringent. Traditional narrative home movie: By traditional narrative we mean a story in the generally conceived sense, displaying that movement toward some sort of resolution, where There is a sense of problem-solving, of things being worked out in some way, of a kind of ratiocinative or emotional teleology (Chatman, 1978, p. 48). The units of semantics are larger than shots, which are subordinated to the larger structures of scenes and sequences. Greater demands are placed upon the humble shot, as it must now snap into scenes. The bristle of shot parameters (framing type, motion, angle) now feed into threads, such as continuity, running between shots that must be observed, makes them more difficult to place. Therefore we need more forethought so that when we get to the editing stage we have the material we need.
How Do We Close the Gap?

How do we move the user and his authored media up this scale? What does the user lack in order to move up it? The gap is the understanding of discourse in this case
narrative and the transformation of discourse into well-formed surface manifestations in this case the actual video footage, with its myriad parameters. That is, we note that what got harder above was determining good sequences as well as the links we require between shots. Lets consider a solution aimed at helping our user achieve something like a traditional narrative in their home movies, and leave realization of the other genres for another time. Incidentally, there is a lot of support for the notion of utilizing narrative ideas as a means to improving the communicative and expressive properties of home movies. Schultz and Schultz (1972, p. 16) note that The essential trouble with home movies ... is the lack of a message, and Beal (1974, p. 60) comments that the raison detre of amateur film, the family film, needs, at least, the tenuous thread of a story. We need A method of generating simple narratives say three-act narratives built of turning points, climaxes and resolutions, as used by many feature films with reference to user provided constraints (e.g., there are n actors). A method of transforming a generated story structure into manifestation directives, specifically shot directives, with reference to a user parameterizable style or genre. A way for the user to verify narrative and cinematic metadata of captured footage which adds little extra burden to the authoring process. A method for detecting poorly captured shot directives, assessing their impact on local and global film aesthetics, such as continuity or tempo, and transformations to recover from this damage where possible, without requiring that the shot directives be refilmed.
System Overview
We have implemented a complete video production system that attempts to achieve these goals, constituted by a storyboard, direct and edit life cycle analogous to the professional film production model. Space permits only a high-level treatment of the system. The salient components of the framework are a narrative template: an abstraction of an occasion, such as a wedding or birthday party, viewed in narrative terms (climaxes, etc.), which may be easily built for a given occasion or obtained from a preexisting library, user-specified creative purpose, in terms of recognizable genres, such as action or documentary, which are in turn mapped to affective goals, those goals are then taken up by a battery of aesthetic structuralizing agents, which automatically produce a shooting-script or storyboard from the narrative template and user purpose, which provides a directed interactive capture process, resulting in footage that may be automatically edited into a film or altered for affective impact.
Figure 3 to Figure 5 present different views of the media creation process. We will now discuss what happens at each stage of the workflow depicted in Figure 5. Note that, while similar to Figure 1, Figure 5 is the amateur workflow in this case.
Figure 3. Overview of media creation framework
Purpose
STAR T
Storyboarding
Shooting-scripter
STO P
Directing Editing
Aesthetic structuralizers Shot to footage map Capture record Film assembler
Storyboard Narrative template Movie Fim l
Figure 4. Functional transformation view of media creation framework

Role Stage Directives Author
(author) Story (mediate)
Screenwriter
(affect)
Director/Cameraman
(capture) (align)
Editor
(redress)
Media
Potential "Video" Event or Scene Shot directive Potential Movie Raw Footage "Rough Cut" Movie
Author: The purpose of the first stage is to create the abstract, media nonspecific story for the occasion (wedding, party, anything) that is to be the object of the home movie. That is to say, the given occasion separated into parts, selected and ordered, and thus made to form the content of a narrative. It culminates in a narrative template, which is the deliverable passed to the next stage. Events may have a relative importance attached to them, which may be used when time constraints come into play. Templates may be created by the user, either from scratch or through composition, specialization or generalization of existing templates, built with the aid of a wizard using a rule-based engine or generative grammar seeded
Figure 5. Workflow of amateur media creation

Planning Author
[apply creativity] select story idea story m etadata formulated adaption constraints cinesthetic metadata
Execution Screenwriter
story m etadata
Polishing Editor
story m etadata cinesthetic metadata cinematic m etadata formulate edit rules cinematic m etadata rich com posite m etadata (MPEG-7 stream, ...)
Director/C'man
story m etadata cinesthetic metadata formulated rules of film gramm ar cinematic m etadata
Viewer
perception other movies
Story Idea
Raw footage
Movie
Experience
with user input or else simply selected as is from a library. The user is offered differing levels of input allowing for his creativity or lack thereof. We have our discourse. Mediate: The purpose of this stage is to apply or specialize the narrative template obtained in the author stage to our specific media and domain of the home movie. This stage encapsulates the knowledge required to manifest the abstract events of the discourse in a concrete surface manifestation in this case the pixels and sound waves of video. Affect: The purpose of this stage is to transform the initial media-specific directives produced by the mediate stage into directives that maintain correct or well-formed use of the film medium, such as observing good film convention like continuity, and also better utilize the particular expressive properties of film in relation to the story, such as raising the tempo toward a climax, and this with reference to the style or genre chosen by the user, for example, higher tempo is allowed if the user wants an action flick. The end result is a storyboard of shot directives for the user to attempt to capture. Shot directives are the small circles below the larger scene squares of the storyboard in Figure 3. They can also be seen as small circles in the rectangular storyboards of Figure 4. The affect and mediate stages taken together achieve the transformation from story structure to surface manifestation. Capture: The purpose of this stage is simply to realize all shot directives with actual footage. The user attempts to capture shots in the storyboard. He is allowed to do this in any order, and may attempt a given shot directive any number of times if unhappy with it. A capture is deemed a success or failure with respect to the shot directive, consisting of all of its cinematic parameters. For example, a shot directive might require the user to capture the bride and groom in medium shot, at a different angle than the previous shot. Some parameters are harder to capture than others, and the level of difficulty may be thresholded by the user. But this does affect which metadata are attached to the footage when the user verifies it as a success. For example, if the user is currently not viewing the camera angle parameter of the shot directive, it is not marked as having been achieved in the given footage. This simple
Success/Failure protocol with respect to a stated target avoids burdensome annotation but is deceptively powerful. Align: The purpose of this stage is to check (where possible) whether the footage captured for a given shot directive has indeed been captured according to the directive, and in the case where more than the required duration has been captured, select the optimal footage according to the shot directive. The shot directive plus the users verification of it as applying to the footage provides a context that can be used to seed and improve algorithms which normally have low reliability in a vacuum, such as face recognition. It can also potentially point out inconsistencies in the users understanding of the meaning of shot directive parameters. For example, You consider these three shots to be close-ups, yet the face in the third is twice as large as the other two; do you understand what framing type is? Redress: Following the align stage, we may have affect goals, that is, metashot properties a tempo ramp turned into a plateau or broken continuity, gone awry due to failures in actual capture. Therefore, this stage attempts to achieve, or get closer to, the original affect goals using the realized shot directives. For example, can we borrow footage from another shot to restore the tempo ramp?
Discussion of Problems that Surfaced

We will now highlight two problems that directly impinge on the crucial issue of leveraging human knowledge in order to verify metadata in a manner which puts little extra burden on these problems.
Misunderstanding of Directives
The unit of instruction for the user is the shot directive. Each shot directive consists of a number of cinematic primitives which the user is to attempt to achieve in the shot. The problem lies in the fact that even these terms are subject to misinterpretation by our average user, remembering that the user may have little grasp of cinematic concepts. The system has allowances for differing degrees of user comfort with cinematic directives, a requirement stemming from definition of our target user as average that is, variable in skill. Shot directive parameters may be thresholded in number and difficulty. For example, a novice might only want to know what to shoot and whether to use camera motion, whereas someone more comfortable with the camera might additionally want directives concerning angle and aspect and perhaps even motivation for the given shot in term of high-level movie elements, such as tempo. But this still doesnt solve the problem of when a user thinks he understands a shot directive parameter, but in actual fact does not. In these thumbnails, Figure 6, taken from a recent home movie of a holiday in India, built with the authoring system, we see that the same shot directive parameter framing type, in this case calling for a close-up in two shots, has been shot once correctly as a close-up and then as something more like a medium shot. The incorrectly filmed shot will undoubtedly impact negatively on whatever aesthetic or larger scene orchestration goals for which it was intended. One possible solution to this problem would be to use a face detection algorithm to cross-check the users understanding. The algorithm may be improved with reference to the other cinematic primitives of shot directive in question (e.g., the subject is at an
Figure 6. Thumbnails of incorrectly and correctly shot close-ups, respectively
oblique angle) as well as footage resulting from similar shot directives. Obviously this solution is only applicable to shots where the subject is human. The goal is to detect this inconsistency and alert the user to the possibility that they have a misunderstanding about the specific parameter request. Of course, good visualization, 3D or otherwise, of what the system is requesting goes a long way in helping the user understand.
Imprecision in Metadata Verification

A second problem has to do with metadata imprecision. The system uses humans to verify content metadata, and that metadata in turn is coupled with the narrative metadata in order to enable automated (re)sequencing and editing of footage via inferences about high-level film elements (e.g., continuity, visual approach, movie tempo and rhythm ). Some of these inferences rely on an assumed level of veracity in the metadata values. An example will help. Figure 7 contains thumbnails from three shots intended to serve the scene function of Familiar Image. A familiar image shot is a view that is returned to again and again throughout a scene or section of a scene in order to restabilize the scene for the viewer
Figure 7. Thumbnails from three shots intended to serve the scene function of Familiar Image
after the new information of intervening shots. It cues the viewer back into the spatial layout of the scene, and provides something similar to the reiteration of points covered so far of an effective lecturing style. The two shots on the bottom are shot in a way that the intended familiar image function is achieved; they are similar enough visually. The first shot, however, is not. It was deemed close enough by the user, with reference to the shot directive, but in terms of achieving the familiar image function for which the footage is intended, it fails. In actual fact, the reason for the difference stemmed from an uncontrollable element; a crowd gathered at the first stall. Here, given the impromptu context of the amateur videographer, part of the solution lies in stressing the importance of the different shot directive parameters. There is the provision in the system for ascribing differing levels of importance to shot parameters on a shot-by-shot basis. For a shot ultimately supposed to provide the familiar image function, this would amount to raising the importance of all parameters related to achieving a visual shot composition similar to another shot (aspect, angle, etc., included). This in turn requires that the system prioritize from the discourse level down; At this point, is the stabilizing function of the familiar image more important, impacting on the clarity of the presentation, or is precise subject matter more needful?
FUTURE FOR MULTIMEDIA AUTHORING

If we had to guess at a single factor that will cause the greatest improvement of multimedia authoring technology, it would be the continued drawing together of different fields of knowledge over all of the phases of authoring, from purpose to repackaging. It should never be easy to say, Not my problem. Weve seen discourse theory, media aesthetics, human computer interaction issues, all the way through to content description standards (e.g., MPEG-7) impact authoring technology. But none of these fields has the ultimate answer, none is stagnant, and therefore must continually be queried for new insights that may impact in the authoring endeavour. We use the field of applied media aesthetics to structure our surface manifestations in order to achieve a certain aesthetic goal. A question we encountered when designing the system was: How do we resolve conflict among agents representative of different aesthetic elements (e.g., movie clarity and complexity) who desire mutually conflicting shot directive parameterizations? We prioritized in a linear fashion, first come first served, after first doing our best to separate the shot directive parameters that each agent was able to configure. Is this the only way to do it? Is it the best way? We can ask our friends in media aesthetics, and in so doing we might even benefit their field. It is interesting to note that Rhetorical Structure Theory was originally developed as part of studies of computer-based text generation, but now has a status in linguistics that is independent of its computational uses (Mann, 1999). Improvements in hardware, specifically camera technology, will hopefully allow us to provide one point of reference for the storyboard and its shot directives the camera rather than the current situation which requires the videographer to have both the camera and a handheld device. This may seem a trivial addition, but it helps cross the burden line for the user, and is therefore very much part of the problem. Undoubtedly, technology that sees humans and computers doing what they each do best will be the most effective.
CONCLUSION
We have considered the problem of creating effective multimedia authoring tools. This need has been created by the increased ease with which we generate raw media artifacts images, sound, video, text a situation resulting from the growing power of multimedia enabling hardware and software, coupled with an increasing desire by would-be authors to create and share their masterpieces. In surveying existing approaches to the problem, we considered the emphases of each in relation to the traditional process of media creation: planning, execution, and polishing. We finished with a treatment of some examples of research with a particular eye to the whole process. We then identified some of the key issues to be addressed in developing multimedia authoring tools. They include definition of the target domain, recognition of the nature of the gap our technology is trying to bridge, and the importance of considering both the deeper structures relating to content and how it is sequenced and the surface manifestations in media to which they give rise. Additionally, we highlighted the issue of
deciding upon the scope and nature of metadata, and the question of how it is instantiated. We then presented an implementation of a multimedia authoring system for building home movies, in order to demonstrate the issues raised. In simple terms, we found it effective to algorithmically construct the underlying discourse humanly fill metadata and algorithmically shape the raw media to the underlying discourse and given genre by means of that same metadata.
REFERENCES
Adams, B., Dorai, C., & Venkatesh, S. (2002). Towards automatic extraction of expressive elements from motion pictures: Tempo. IEEE Transactions on Multimedia, 4(4), 472-481. Agamanolis, S., & Bove, V., Jr. (2003). Viper: A framework for responsive television. 10(I), 88-98. Arijon, D. (1976). Grammar of the film language. Silman-James Press. Baecker, R., Rosenthal, A., Friedlander, N., Smith, E., & Cohen, A. (1996). A multimedia system for authoring motion pictures. ACM Multimedia, 31-42. Bailey, B., Konstan, J., & Carlis, J. (2001). DEMAIS: Designing multimedia applications with interactive storyboards. In the Ninth ACM International Conference on Multimedia (pp. 241-250). Barry, B., & Davenport, G. (2003). Documenting life: Videography and common sense. In the 2003 International Conference on Multimedia and Expo, Baltimore, MD. Beal, J. (1974). Cine craft. London: Focal Press. Bobick, A., & Pinhanez, C. (1995). Using approximate models as source of contextual information for vision processing. In Proceedings of the ICCV95 Workshop on Context-Based Vision (pp. 13-21). Casares, J., Myers, B., Long, A., Bhatnagar, R., Stevens, S., Dabbish, L., et al. (2002). Simplifying video editing using metadata. In Proceedings of Designing Interactive Systems (DIS 2002) (pp. 157-166). Chatman, S. (1978). Story and discourse: Narrative structure in fiction and film. Ithaca, NY: Cornell University Press. Davis, M. (2003). Editing out editing. In IEEE Multimedia Magazine (Special Edition) (pp. 54-64). Computational Media Aesthetics. IEEE Computer Society. Girgensohn, A., Boreczky, J., Chiu, P., Doherty, J., Foote, J., Golovchinsky, G., et al. (2000). A semi-automatic approach to home video editing. In Proceedings of the 13th Annual ACM Symposium on User Interface Software and Technology (pp. 8189). He, L.-W., Cohen, M., & Salesin, D. (1996). The virtual cinematographer: A paradigm for automatic real-time camera control and directing. Computer Graphics, 30(Annual Conference Series), 217-224. Hirsch Jr., E. (1967). Validity in interpretation. New Haven, CT:Yale University Press. iMovie, A. (2003). The new imovie video and audio snap into place. [Brochure]. Kennedy, K., & Mercer, R. E. (2001). Using cinematography knowledge to communicate animator intentions. In Proceedings of the First International Symposium on Smart Graphics, Hawthorne, New York (pp. 47-52).
Lang, R. (1999). A declarative model for simple narratives. In AAAI Fall Symposium on Narrative Intelligence (pp. 134-141). Lindley, C. A., Davis, J., Nack, F., & Rutledge, L. (2001) The application of rhetorical structure theory to interactive news program generation from digital archives. CWI technical report INS-R0101. Mann, B. (1999). An introduction to rhetorical structure theory (RST). Retrieved from http://www.sil.org/linguistics/rst/rintro99.htm Monaco, J. (1981). How to read a film: The art, technology, language, history and theory of film and media. Oxford, UK: Oxford University Press. Nack, F. (1996). AUTEUR - The Application of Video Semantics and Theme Representation for Automated Film Editing (pp. 82-89). Doctoral dissertation, Lancaster University, UK. Nack, F. & Parkes, A. (1995). Auteur: The creation of humorous scenes using automated video editing. IJCAI-95 Workshop on AI Entertainment and AI/Alife. Pfeiiffer, R. L. S., & Effelsberg, W. (1997). Video abstracting. Communicationsof the ACM, 40(12), 54-63. Sack, W., & Davis, M. (1994). IDIC: Assembling video sequences from story plans and content annotations. Proceedings of IEEE International Conference on Multimedia Computing and Systems, (pp. 30-36). Schultz, E. & Schultz, D. (1972). How to make exciting home movies and stop boring your friends and relatives. London: Robert Hale. Skov, M., & Andersen, P. (2001). Designing interactive narratives. In COSIGN 2001 (pp. 59-66). Tomlinson, B., Blumberg, B., & Nain, D. (2000). Expressive autonomous cinematography for interactive virtual environments. In Proceedings of the Fourth International Conference on autonomous Agents (pp. 317-324), Barcelona, Spain, June 3-7 (AGENTS2000). Wactlar, H., Christel, M., Gong, Y., & Hauptmann, A. (1999). Lessons learned from building a terabyte digital video library. IEEE Computer Magazine, 32, 66-73.
ENDNOTES
1 2
In the literary world, surveys turn up a remarkable variety of authoring styles and an interesting analogy for us, in that they, nevertheless, generally still evince these three distinct creative phases. The reader might note the parallels in this view to the text planning stage and surface realisation stage of a natural language generator.
246 Scherp & Boll
Chapter 11
A Framework for Creating Personalized Multimedia Content

Ansgar Scherp, OFFIS Research Institute, Germany Susanne Boll, University of Oldenburg, Germany
MM4U:
In the Internet age and with the advent of digital multimedia information, we succumb to the possibilities that the enchanting multimedia information seems to offer, but end up almost drowning in the multimedia information: Too much information at the same time, so much information that is not suitable for the current situation of the user, too much time needed to find information that is really helpful. The multimedia material is there, but the issues of how the multimedia content is found, selected, assembled, and delivered such that it is most suitable for the users interest and background, the users preferred device, network connection, location, and many other settings, is far from being solved. In this chapter, we are focusing on the aspect of how to assemble and deliver personalized multimedia content to the users. We present the requirements and solutions of multimedia content modeling and multimedia content authoring as we find it today. Looking at the specific demands of creating personalized multimedia content, we come to the conclusion that a dynamic authoring process is needed in which just in time the individual multimedia content is created for a specific user or user group. We designed and implemented an extensible software framework, MM4U (short for MultiMedia for you), which provides generic functionality for typical tasks of a
ABSTRACT
The MM4U Framework 247
dynamic multimedia content personalization process. With such a framework at hand, an application developer can concentrate on creating personalized content in the specific domain and at the same time is relieved from the basic task of selecting, assembling, and delivering personalized multimedia content. We present the design of the MM4U framework in detail with an emphasis for the personalized multimedia composition and illustrate the frameworks usage in the context of our prototypical applications.
Multimedia content today can be considered as the composition of different media elements, such as images and text, audio, and video, into an interactive multimedia presentation like a guided tour through our hometown Oldenburg. Features of such a presentation are typically the temporal arrangement of the media elements in the course of the presentation, the layout of the presentation, and its interaction features. Personalization of multimedia content means that the multimedia content is targeted at a specific person and reflects this persons individual context, specific background, interest, and knowledge, as well as the heterogeneous infrastructure of end devices to which the content is delivered and on which it is presented. The creation of personalized multimedia content means that for each intended context a custom presentation needs to be created. Hence, multimedia content personalization is the shift from one-size-fits-all to a very individual and personal one-to-one provision of multimedia content to the users. This means in the end that the multimedia content needs to be prepared for each individual user. However, if there are many different users that find themselves in very different contexts, it soon becomes obvious that a manual creation of different content for all the different user contexts is not feasible, let alone economical (see Andr & Rist, 1996). Instead, a dynamic, automated process of selecting and assembling personalized multimedia content depending on the user context seems to be reasonable. The creation of multimedia content is typically subsumed under the notion of multimedia authoring. However, such authoring today is seen as the static creation of multimedia content. Authoring tools with graphical user interfaces (GUI) allow us to manually create content that is targeted at a specific user group. If the content created is at all personalizable, then only within a very limited scope. First research approaches in the field of dynamic creation of personalized multimedia content are promising; however, they are often limited to certain aspects of the content personalization to the individual user. Especially when the content personalization task is more complex, these systems need to employ additional programming. As we observe that programming is needed in many cases anyway, we continue this observation consequently and propose MM4U (short for MultiMedia for you), a component-based object-oriented software framework to support the software development process of multimedia content personalization applications. MM4U relieves application developers from general tasks in the context of multimedia content personalization and lets them concentrate on the application domain-specific tasks. The frameworks components provide generic functionality for typical tasks of the multimedia content personalization process. The design of the framework is based on a comprehensive analysis of the related approaches in the field of user profile modeling, media data modeling, multimedia composition, and multimedia
INTRODUCTION
248 Scherp & Boll
presentation formats. We identify the different tasks that arise in the context of creating personalized multimedia content. The different components of the framework support these different tasks for creating user-centric multimedia content: They integrate the generic access to user profiles, media data, and associated meta data, provide support for personalized multimedia composition and layout, as well as create the context-aware multimedia presentations. With such a framework, the development of multimedia applications becomes easier and much more efficient for different users with their different (semantic) contexts. On the basis of the MM4U framework, we are currently developing two sample applications: a personalized multimedia sightseeing tour and a personalized multimedia sports news ticker. The experiences we gain from the development of these applications give us important feedback on the evaluation and continuous redesign of the framework. The remainder of this chapter is organized as follows: To review the notion of multimedia content authoring, in Multimedia Content Authoring Today we present the requirements of multimedia content modeling and the authoring support we find today. Setting off from this, Dynamic Authoring of Personalized Content introduces the reader to the tasks of creating personalized multimedia content and why such content can be created only in a dynamic fashion. In Related Approaches, we address the related approaches we find in the field before we present the design of our MM4U framework in The Multimedia Personalization Framework section. As the personalized creation of multimedia content is a central aspect of the framework, Creating Personalized Multimedia Content presents in detail the multimedia personalization features of the framework. Impact of Personalization to The Development of Multimedia Applications shows how the framework supports application developers and multimedia authors in their effort to create personalized multimedia content. The implementation and first prototypes are presented in Implementation and Prototypical Applications before we come to our summary and conclusion in the final section.
MULTIMEDIA CONTENT AUTHORING TODAY

In this section, we introduce the reader to current notions and techniques of multimedia content modeling and multimedia content authoring. An understanding of requirements and approaches in modeling and authoring of multimedia content is a helpful prerequisite to our goal, the dynamic creation of multimedia content. For the modeling of multimedia content we present our notion of multimedia content, documents, and presentation and describe the central characteristics of typical multimedia document models in the first subsection. For the creation of multimedia content, we give a short overview of directions in multimedia content authoring today in the second subsection.
Multimedia Content
Multimedia content today is seen as the result of a composition of different media elements (media content) in a continuous and interactive multimedia presentation. Multimedia content builds on the modeling and representation of the different media elements that form the building bricks of the composition. A multimedia document
represents the composition of continuous and discrete media elements into a logically coherent multimedia unit. A multimedia document that is composed in advance to its rendering is called preorchestrated in contrast to compositions that take place just before rendering that are called live or on-the-fly. A multimedia document is an instantiation of a multimedia document model that provides the primitives to capture all aspects of a multimedia document. The power of the multimedia document model determines the degree of the multimedia functionality that documents following the model can provide. Representatives of (abstract) multimedia document models in research can be found with CMIF (Bulterman et al., 1991), Madeus (Jourdan et al., 1998), Amsterdam Hypermedia Model (Hardman, 1998; Hardman et al., 1994a), and ZYX (Boll & Klas, 2001). A multimedia document format or multimedia presentation format determines the representation of a multimedia document for the documents exchange and rendering. Since every multimedia presentation format implicitly or explicitly follows a multimedia document model, it can also be seen as a proper means to serialize the multimedia documents representation for the purpose of exchange. Multimedia presentation formats can either be standardized, such as the W3C standard SMIL (Ayars et al., 2001), or proprietary such as the widespread Shockwave file format (SWF) of Macromedia (Macromedia, 2004). A multimedia presentation is the rendering of a multimedia document. It comprises the continuous rendering of the document in the target environment, the (pre)loading of media data, realizing the temporal course, the temporal synchronization between continuous media streams, the adaptation to different or changing presentation conditions and the interaction with the user. Looking at the different models and formats we find, and also the terminology in the related work, there is not necessarily a clear distinction between multimedia document models and multimedia presentation formats, and also between multimedia documents and multimedia presentations. In this chapter, we distinguish the notion of multimedia document models as the definition of the abstract composition capabilities of the model; a multimedia document is an instance of this model. The term multimedia content or content representation is used to abstract from existing formats and models, and generally addresses the composition of different media elements into a coherent multimedia presentation. Independent of the actual document model or format chosen for the content, one can say that a multimedia content representation has to realize at least three central aspects: the temporal, spatial, and interactive characteristics of a multimedia presentation (Boll et al., 2000). However, as many of todays concrete multimedia presentation formats can be seen as representing both a document model and an exchange format for the final rendering of the document, we use these as an illustration of the central aspects of multimedia documents. We present an overview of these characteristics in the following listing; for a more detailed discussion on the characteristics of multimedia document models we refer the reader to (Boll et al., 2000; Boll & Klas, 2001). A temporal model describes the temporal dependencies between media elements of a multimedia document. With the temporal model, the temporal course such as the parallel presentation of two videos or the end of a video presentation on a mouse-click event can be described. One can find four types of temporal models: point-based temporal models, interval-based temporal models (Little & Ghafoor,
250 Scherp & Boll
1993; Allen, 1983), enhanced interval-based temporal models that can handle time intervals of unknown duration (Duda & Keramane, 1995; Hirzalla et al., 1995; Wahl & Rothermel, 1994), event-based temporal models, and script-based realization of temporal relations. The multimedia presentation formats we find today realize different temporal models, for example, SMIL 1.0 (Bugaj et al., 1998) provides an interval-based temporal model only, while SMIL 2.0 (Ayars et al., 2001) also supports an event-based model. For a multimedia document not only the temporal synchronization of these elements is of interest but also their spatial positioning on the presentation media, for example, a window, and possibly the spatial relationship to other visual media elements. The positioning of a visual media element in the multimedia presentation can be expressed by the use of a spatial model. With it one can, for example, place one image about a caption or define the overlapping of two visual media. Besides the arrangement of media elements in the presentation, also the visual layout or design is defined in the presentation. This can range from a simple setting for background colors and fonts up to complex visual designs and effects. In general, three approaches to spatial models can be distinguished: absolute positioning, directional relations (Papadias et al., 1995; Papadias & Sellis, 1994), and topological relations (Egenhofer & Franzosa, 1991). With absolute positioning we subsume both the placement of a media element at an absolute position with respect to the origin of the coordinate system and the placement at an absolute position relative to another media element. The absolute positioning of media elements can be found, for example, with Flash (Macromedia, 2004) and the Basic Language Profile of SMIL 2.0, whereas the relative positioning is realized, for example, by SMIL 2.0 and SVG 1.2 (Andersson et al., 2004b). A very distinct feature of a multimedia document model is the ability to specify user interaction in order to let a user choose between different presentation paths. Multimedia documents without user interaction are not very interesting as the course of their presentation is exactly known in advance and, hence, could be recorded as a movie. With interaction models a user can, for example, select or repeat parts of presentations, speed up a movie presentation, or change the visual appearance. For the modeling of user interaction, one can identify at least three basic types of interaction: navigational interactions, design interactions, and movie interactions. Navigational interaction allows the selection of one out of many presentation paths and is supported by all the considered multimedia document models and presentation formats.
Looking at existing multimedia document models and presentation formats both in industry and research, one can see that these aspects of multimedia content are implemented in two general ways: The standardized formats and research models typically implement these aspects in different variants in a structured (XML) fashion as can be found with SMIL 2.0, HTML+TIME (Schmitz et al., 1998), SVG 1.2, Madeus, and ZYX. Proprietary approaches, however, represent or program these aspects in an adequate internal model such as Macromedias Shockwave format. Independent of the actual multimedia document model, support for the creation of these documents is
needed multimedia content authoring. We will look at the approaches we find in the field of multimedia content authoring in the next section.
Multimedia Authoring
While multimedia content represents the composition of different media elements into a coherent multimedia presentation, multimedia content authoring is the process in which this presentation is actually created. This process involves parties from different fields including media designers, computer scientists, and domain experts: Experts from the domain provide their knowledge in the field; this knowledge forms the input for the creation of a storyboard for the intended presentation. Such a storyboard forms often the basis on which creators and directors plan the implementation of the story with the respective media and with which writers, photographers, and camerapersons acquire the digital media content. Media designers edit and process the content for the targeted presentation. Finally, multimedia authors compose the preprocessed and prepared material into the final multimedia presentation. Even though we described this as a sequence of steps, the authoring process typically includes cycles. In addition, the expertise for some of the different tasks in the process can also be held by one single person. In this chapter, we are focusing on the part of the multimedia content creation process in which the prepared material is actually assembled into the final multimedia presentation. This part is typically supported by professional multimedia development programs, so-called authoring tools or authoring software. Such tools allow the composition of media elements into an interactive multimedia presentation via a graphical user interface. The authoring tools we find here range from domain expert tools to general purpose authoring tools. Domain expert tools hide as much as possible the technical details of content authoring from the authors and let them concentrate on the actual creation of the multimedia content. The tools we find here are typically very specialized and targeted at a very specific domain. An example for such a tool has been developed in the context of our previous research project Cardio-OP (Klas et al., 1999) in the domain of cardiac surgery. The content created in this project is an interactive multimedia book about topics in the specialized domain of cardiac surgery. Within the project context, an easy-to-use authoring wizard was developed to allow medical doctors to easily create pages of a multimedia book in cardiac surgery. The Cardio-OP-Wizard guides the domain experts through the authoring process by a digital storyboard for a multimedia book on cardiac surgery. The wizard hides as much technical detail as possible. On the other end of the spectrum of authoring tools we find highly generalized tools such as Macromedia Director (Macromedia, 2004). These tools are independent of the domain of the intended presentation and let the authors create very sophisticated multimedia presentations. However, the authors typically need to have high expertise in using the tool. Very often programming in an integrated programming language is needed to achieve special effects or interaction patterns. Consequently, the multimedia authors need programming skills and along with this some experience in software development and software engineering.
252 Scherp & Boll
Whereas a multimedia document model has to represent the different aspects of time, space, and interaction, multimedia authoring tools must allow the authors to actually assemble the multimedia content. However, the authors are normally experts from a specific domain. Consequently, the only authoring tools that are practicable to create multimedia content for a specific domain are those that are highly specialized and easy to use.
DYNAMIC AUTHORING OF PERSONALIZED CONTENT

The authoring process described above so far represents a manual authoring of multimedia content, often with high effort and cost involved. Typically, the result is a multimedia presentation targeted at a certain user group in a special technical context. However, the one-size-fits-all fashion of the multimedia content created does not necessarily satisfy different users needs. Different users may have different preferences concerning the content and also may access the content in networks on different end devices. For a wider applicability, the authored multimedia content needs to carry some alternatives that can be exploited to adapt the presentation to the specific preferences of the users and their technical settings. Figure 1 shows an illustration of the variation possibilities that a simple personalized city guide application can possess. The root of the tree represents the multimedia presentation for the personalized city tour. If this presentation was intended for both Desktop PC and PDA, this results in two variants of the presentation. If then some tourists are interested only in churches, museums, or palaces and would like to receive the content in either English or German, this already sums up to 12 variants. If then the multimedia content should be available in different presentation formats, the number of variation possibilities within a personalized city tour increases again. Even though different variants are not necessarily entirely different and may have overlapping content, the example is intended to illustrate that the flexibility of multimedia content to personalize to different user contexts quickly leads to an explosion of different options. And still the content can only be personalized within the flexibility range that has been anchored in the content. From our point of view, an efficient and competitive creation of personalized multimedia content can only come from a system approach that supports the dynamic authoring of personalized multimedia content. A dynamic creation of such content allows for a selection and composition of just those media elements that are targeted at the users specific interest and preferences. Generally, the dynamic authoring comprises the steps and tasks that occur also with static authoring, but with the difference that the creation process is postponed to the time when the targeted user context and the presentation is created for this specific context. To be able to efficiently create presentations for (m)any given contexts, a manual authoring of a presentation meeting the user needs is not an option; instead, a dynamic content creation is needed. As we look into the process of dynamic authoring of personalized multimedia content, it is apparent that this process involves different phases and tasks. We identify the central tasks in this process that need to be supported by a suitable solution for personalized content creation.
Figure 1. Example of the variation possibilities within a personalized city guide application
p PC
Pock e
to Desk
t PC
# Variants 2
Museums
u Ch
s he rc
Pa la ce s
Museums
u Ch
es ch
Pa la ce s
6
an an an an an an Germ
sh Engli
SMIL
sh Engli
sh Engli
sh Engli
sh Engli
sh Engli
Germ
Germ
Germ
Germ
Germ
12 ... ... ... 36 SVG HTML SMIL BLP Mobile SVG HTML
Figure 2 depicts the general process of creating personalized multimedia content. The core of this process is an application we call personalization engine. The input parameters to this engine can be characterized by three groups: The first group of input parameters is the media elements with the associated meta data that constitute the content from which the personalized multimedia presentations are selected and assembled. The second group enfolds the users personal and technical context. The user profile includes information about, for example, the users current task, the location, and environment, like weather and loudness, his or her knowledge, goals, preferences and interests, abilities and disabilities, as well as demographic data. The technical context is described by the type of the users end device, the hardware and software characteristics, as for example the available amount of memory and media player, as well as possible network connections and input devices. The third group of input parameters influences the general structure of the resulting personalized multimedia presentation and subsumes other preferences a user could have for the multimedia presentation. Within the personalization engine, these input parameters are now used to author the personalized multimedia presentation. First, the personalization engine exploits all available information about the users context and his or her end device to select by means of media meta data those media elements that are of most relevance according to the users interests and preferences and meet the characteristics of the end device at the best. In the next step, the selected media elements are assembled and arranged by the personalization engine again in regard to the user profile information and the characteristics of the end device to the personalized multimedia content, represented
254 Scherp & Boll
Figure 2. General process of personalizing multimedia content

Media data Meta data
Context dependent selection of multimedia content Context dependent composition of multimedia content in internal format Transformation of internal format to concrete presentation format
User profile Technical environment Document structure Rules and constraints Layout and style
SMIL
The Horst-Janssen Musem in Oldenburg shows Life and Work of Horst Janssen in a comprehensive permanent ...
The Horst-Janssen Musem in Oldenburg shows Th e H o rs-t J a s se n M u se u m i n O l d e b u g s h o ws L i fe a n Wo rk Ho rs t Ja n ss e n i n a n n r d c o p re h n e p e ma n n t m Lif si ve and Work of Horst Janssen in a e r e comprehensive permanent ... Horst-Janssen Musem
SVG
...
J
Flash
"Personalization engine"
select
assemble
transform
present
in an internal document model (Scherp & Boll, 2004b). This internal document model abstracts from the different characteristics of todays multimedia presentation formats and, hence, forms the greatest common denominator of these formats. Even though our abstract model does not reflect the fancy features of some of todays multimedia presentation formats, it supports the very central multimedia features of modeling time, space, and interaction. It is designed to be efficiently transformed to the concrete syntax of the different presentation formats. For the assembly, the personalization engine uses the parameters for document structure, the layout and style parameters, and other rules and constraints that describe the structure of the personalized multimedia presentation, to determine among others the temporal course and spatial layout of the presentation. The center of Figure 2 sketches this temporal and spatial arrangement of selected media elements over time in a spatial layout following the document structure and other preferences. Only then in the transformation phase, the multimedia content in the internal document model is transformed to a concrete presentation format. Finally, the just generated personalized multimedia presentation is rendered and displayed by the actual end device.
RELATED APPROACHES
In this section we present the related approaches in the field of personalized multimedia content creation. We first discuss the creation of personalizable multimedia content with todays authoring environments before we come to research approaches that address a dynamic composition of adapted or personalized multimedia content. Multimedia authoring tools like Macromedia Director (Macromedia, 2004) today require high expertise from their users and create multimedia presentations that are targeted only at a specific user or user group. Everything personalizable needs to be
programmed or scripted within the tools programming language. Early work in the field of creating advanced hypermedia and multimedia documents can be found, for example, with the Amsterdam Hypermedia Model (Hardman, 1998; Hardman et al., 1994b) and the authoring system CMIFed (van Rossum, 1993; Hardman et al., 1994a) as well as with the ZYX (Boll & Klas, 2001) multimedia document model and a domain-specific authoring wizard (Klas et al., 1999). In the field of standardized models, the declarative description of multimedia documents with SMIL allows for the specification of adaptive multimedia presentations by defining presentation alternatives by using the switch element. A manual authoring of such documents that are adaptable to many different contexts is too complex; also the existing authoring tools such as GRiNS editor for SMIL from Oratrix (Oratrix, 2004) are still tedious to handle. Some SMIL tools provide support for the switch element to define presentation alternatives; a comfortable interface for editing the different alternatives for many different contexts, however, is not provided. Consequently, we have been working on the approach in which a multimedia document is authored for one general context and is then automatically enriched by the different presentation alternatives needed for the expected user contexts in which the document is to be viewed (Boll et al., 1999). However, this approach is reasonable only for a limited number of presentation alternatives and limited presentation complexity in general. Approaches that dynamically create personalized content are typically found on the Web, for example, Amazon.com (Amazon, 1996-2004) or MyYahoo (Yahoo!, 2002). However, these systems remain text-centric and are not occupied with the complex composition of media data in time and space into real multimedia presentations. On the pathway to an automatic generation of personalized multimedia presentations, we primarily find research approaches that address personalized media presentations only: For example, the home-video editor Hyper-Hitchcock (Girgensohn et al., 2003; Girgensohn et al., 2001) provides a preprocessing of a video such that users can interactively select clips to create their personal video summary. Other approaches create summaries of music or video (Kopf et al., 2004; Agnihotri et al., 2003). However, the systems provide an intelligent and intuitive access to large sets of (continuous) media rather than a dynamic creation of individualized content. An approach that addresses personalization for videos can be found, for example, with IBMs Video Semantic Summarization System (IBM Corporation, 2004a) which is, however, still concentrating on one single media type. Towards personalized multimedia we find interesting work in the area of adaptive hypermedia systems which has been going on for quite some years now (Brusilovsky 1996; Wu et al., 2001; De Bra et al., 1999a, 2000, 2002b; De Carolis et al., 1998, 1999). The adaptive hypermedia system AHA! (De Bra et al., 1999b, 2002a, 2003) is a prominent example here which also addresses the authoring aspect (Stash & De Bra, 2003), for example, in adaptive educational hypermedia applications (Stash et al., 2004). However, though these and further approaches integrate media elements in their adaptive hypermedia presentations, synchronized multimedia presentations are not in their focus. Personalized or adaptive user interfaces allow the navigation and access of information and services in a customized or personalized fashion. For example, work done in the area of personalized agents and avatars considers presentation generation exploiting natural language generation and visual media elements to animate the agents and avatars (de Rosis et al., 1999). These approaches address the human computer interface; the general issue of dynamically creating arbitrary personalized multimedia content that meets the users information needs is not in their research focus.
256 Scherp & Boll
A very early approach towards the dynamic creation of multimedia content is the Coordinated Multimedia Explanation Testbed (COMET), which is based on an expertsystem and different knowledge databases and uses constraints and plans to actually generate the multimedia presentations (Elhadad et al., 1991; McKeown et al., 1993). Another interesting approach to automate the multimedia authoring process has been developed at the DFKI in Germany by the two knowledge-based systems, WIP (Knowledge-based Presentation of Information) and PPP (Personalized Plan-based Presenter). WIP is a knowledge-based presentation system that automatically generates instructions for the maintenance of technical devices by plan generation and constraint solving. PPP enhances this system by providing a lifelike character to present the multimedia content and by considering the temporal order in which a user processes a presentation (Andr, 1996; Andr & Rist, 1995,1996). Also a very interesting research approach towards the dynamic generation of multimedia presentations is the Cuypers system (van Ossenbruggen et al., 2000) developed at the CWI. This system employs constraints for the description of the intended multimedia programming and logic programming for the generation of a multimedia document (CWI, 2004). The multimedia document group at INRIA in France developed within the Opra project a generic architecture for the automated construction of multimedia presentations based on transformation sheets and constraints (Villard, 2001). This work is continued within the succeeding project Web, Accessibility, and Multimedia (WAM) with the focus on a negotiation and adaptation architecture for multimedia services for mobile devices (Lemlouma & Layada, 2003, 2004). However, we find limitations with existing systems when it comes to their expressiveness and flexible personalized content creation support. Many approaches for personalization are targeted at a specific application domain in which they provide a very specific content personalization task. The existing research solutions typically use a declarative description like rules, constraints, style sheets, configuration files, and the like to express the dynamic, personalized multimedia content creation. However, they can solve only those presentation generation problems that can be covered by such a declarative approach; whenever a complex and application-specific personalization generation task is required, the systems find their limit and need additional programming to solve the problem. Additionally, the approaches we find usually rely on fixed data models for describing user profiles, structural presentation constraints, technical infrastructure, rhetorical structure, and so forth, and use these data models as an input to their personalization engine. The latter evaluates the input data, retrieves the most suitable content, and tries to most intelligently compose the media into a coherent aesthetic multimedia presentation. A change of the input data models as well as an adaptation of the presentation generator to more complex presentation generation tasks is difficult if not unfeasible. Additionally, for these approaches the border between the declarative descriptions for describing content personalization constraints and the additional programming needed is not clear and differs from solution to solution. This leads us to the development of a software framework that supports the development of personalized multimedia applications.
MULTIMEDIA PERSONALIZATION FRAMEWORK

Most of the research approaches presented above apply to text-centered information only, are limited with regard to the personalizability, or are targeted at very specific application domains. As mentioned above, we find that existing research solutions in the field of multimedia content personalization provide interesting solutions. They typically use a declarative description like style sheets, transformation rules, presentation constraints, configuration files, and the like to express the dynamic, personalized multimedia content creation. However, they can solve only those presentation generation problems that can be covered by such a declarative approach; whenever a complex and application-specific personalization generation task is required, the systems find their limit and need additional programming to solve the problem. To provide application developers with a general, domain independent support for the creation of personalized multimedia content we pursue a software engineering approach: the MM4U framework. With this framework, we propose a component-based object-oriented software framework that relieves application developers from general tasks in the context of multimedia content personalization and lets them concentrate on the application domain-specific tasks. It supports the dynamic generation of arbitrary personalized multimedia presentations and therewith provides substantial support for the development of personalized multimedia applications. The framework does not reinvent multimedia content creation but incorporates existing research in the field and also can be extended by domain and application-specific solutions. In the following subsection we identify by an extensive study of related work and our own experiences the general design goals of this framework. In the next subsection, we present the general design of the MM4U framework, and then we present a detailed insight into the frameworks layered architecture in the last subsection.
General Design Goals for the MM4U Framework

The overall goal of MM4U is to simplify and to reduce the costs of the development process of personalized multimedia applications. Therefore, the MM4U framework has to provide the developers with support for the different tasks of the multimedia personalization process as shown in Figure 2. These tasks comprise assistance for the access to media data and associated meta data as well as user profile information and the technical characteristics of the end device. The framework must also provide for the selection and composition of media elements into a coherent multimedia presentation. Finally, the personalized multimedia content must be created for delivery and rendering on the users end device. In regard to these different tasks, we conducted an extensive study of related work: In the area of user profile modeling we considered among others Composite Capability/ Preference Profile (Klyne et al., 2003), FIPA Device Ontology Specification (Foundation for Intelligent Physical Agents, 2002), User Agent Profile (Open Mobile Alliance, 2003), Customer Profile Exchange (Bohrer & Holland, 2004), (Fink et al., 1997), and (Chen & Kotz, 2000). In regard to meta data modeling, we studied different approaches of modeling meta data and approaches for meta data standards for multimedia, for example, Dublin Core (Dublin Core Metadata Initiative, 1995-2003) and Dublin Core Extensions for Multimedia
258 Scherp & Boll
Objects (Hunter, 1999), Resource Description Framework (Beckett & McBride, 2003), and the MPEG-7 Multimedia content description standard (ISO/IEC JTC 1/SC 29/WG 11, 1999, 2001a-e). For multimedia composition we analyzed the features of multimedia document models, including SMIL (Ayars et al., 2001), SVG (Andersson et al., 2004b), Macromedia Flash (Macromedia, 2004), Madeus (Jourdan et al., 1998), and ZYX (Boll & Klas, 2001). For the presentation of multimedia content, respective multimedia presentation frameworks were regarded including Java Media Framework (Sun Microsystems, 2004), MET++ (Ackermann 1996), and PREMO (Duke et al., 1999). Furthermore, other existing systems and general approaches for creating personalized multimedia content that were considered including the Cuypers engine (van Ossenbruggen et al., 2000) and the Standard Reference Model for Intelligent Multimedia Presentation Systems (Bordegoni et al., 1997). We also derived design requirements to the framework from first prototypes of personalized multimedia applications we developed in different fields such as a personalized sightseeing tour through Vienna (Boll, 2003), a personalized mobile paper chase game (Boll et al., 2003), and a personalized multimedia music newsletter. From the extensive study of related work and the first experiences and requirements we gained from our prototypical applications, we developed the single layers of the framework. We also derived three general design goals for MM4U. These design goals are The framework is to be designed such that it is independent of any special application domain, that is, it can be used to generate arbitrary personalized multimedia content. Therefore, it provides general multimedia composition and personalization functionality and is flexible enough to be adapted and extended concerning the particular requirements of the concrete personalization functionalities a personalized application needs. The access to user profile information and media data with its associated meta data must be independent of the particular solutions for storage, retrieval, and processing of such data. Rather the framework should provide a unified interface for the access to existing solutions. With distinct interfaces for the access to user profile information and media data with associated meta data, it is the frameworks task to use and exploit existing (research) profile and media storage systems for the personalized multimedia content creation. The third design goal for the framework is what we call presentation independence. The framework is to be independent of, for example, the technical characteristics of the end devices, their network connection, and the different multimedia output formats that are available. This means, that the framework can be used to generate equivalent multimedia content for the different users and output channels and their individual characteristics. This multichannel usage implies that the personalized multimedia content generation task is to be partitioned into a composition of the multimedia content in an internal representation format and its later transformation into arbitrary (preferably standardized) presentation formats that can be rendered and displayed by end devices.
These general design goals have a crucial impact on the structure of the multimedia personalization framework, which we present in the following section.
General Design of the MM4U Framework

A software framework like MM4U is a semifinished software architecture, providing a software system as a generic application for a specific domain (Pree, 1995). The MM4U framework comprises components, which are bound together by their interaction (Szyperski et al., 2002), and realizes generic support for personalized multimedia applications. Each component is realized as an object-oriented framework and consists of a set of abstract and concrete classes. Depending on the usage of a framework, the socalled white-box and black-box frameworks can be distinguished (respectively, white-box and gray-box reuse). A framework is used as a black-box if the concrete application that uses the framework adapts its functionality by different compositions of the frameworks classes. In this case the concrete application uses only the built-in functionality of the framework, that is, those modules with which the framework is already equipped. In contrast, the functionality of a white-box framework is refined or extended by a concrete application, by adding additional modules through inheritance of (abstract) classes. Between these two contrasts arbitrary shades of gray are possible (Szyperski et al., 2002). The design of the MM4U framework lies somewhere in the middle between pure black-box and pure white-box. Being a domain independent framework, MM4U needs to be configured and extended to meet the specific requirements of a concrete personalized multimedia application. The framework provides many modules, for example, to access media data and associated meta data, user profile information, and generates the personalized multimedia content in a standardized output format that can be reused for different application areas (black-box usage). For the very applicationspecific personalization functionality, the framework can be extended correspondingly (white-box usage). The usage of the MM4U framework by a concrete personalized multimedia application is illustrated schematically in Figure 3. The personalized multimedia application uses the functionality of the framework to create personalized multimedia content, and integrates it in whatever application dependent functionality is needed, either by using the already built-in functionality of the framework or by extending it for the specific requirements of the concrete personalized multimedia application. With respect to the multimedia software development process the MM4U framework assists the computer scientists during the design and implementation phase. It alleviates the time-consuming multimedia content assembly task and lets the computer scientists concentrate on the development of the actual application. The MM4U framework provides functionality for the single tasks of the personalization engine as described in the section on Dynamic Authoring of Personalized Content. It offers the computer scientists support for integrating and accessing user profile information and media data, selecting media elements according to the users profile information, composing these elements into coherent multimedia content, and generating this content in standardized multimedia document formats to be presented on the users end device. When designing a framework, the challenge is to identify the points where the framework should be flexible, that is, to identify the semantic aspects of the frameworks application domain that have to be kept flexible. These points are the so-called hot spots and represent points or sockets of the intended flexibility of a framework (Pree, 1995). Each hot spot constitutes a well-defined interface where proper modules can be plugged in. When designing the MM4U framework we identified hot spots where adequate
260 Scherp & Boll
Figure 3. Usage of the MM4U framework by a personalized multimedia application

Personalized Multimedia Application
Application dependent functionality
Generation of multimedia presentation Personalized multimedia composition Access to media objects and meta data
Access to user profile information
modules for supporting the personalization task can be plugged in that provide the required functionality. As depicted in Figure 3, the MM4U framework provides four types of such hot spots, where different types of modules can be plugged in. Each hot spot represents a particular task of the personalization process. The hot spots can be realized by plugging in a module that implements the hot spots functionality for a concrete personalized multimedia application. These modules can be both application-dependent and application-independent. For example, the access to media data and associated meta data is not necessarily application-dependent, whereas the composition of personalized multimedia content can be heavily dependent on the concrete application. After the general design of the framework, we take a closer look at the concrete architecture of MM4U and its components in the next section.
Design of the Framework Layers

For supporting the different tasks of the multimedia personalization process, which are the access to user profile information and media data, selection and composition of media elements into a coherent presentation, rendering and display of the multimedia presentation on the end device, a layered architecture seems to be best suited for MM4U. The layered design of the framework is illustrated in Figure 4. Each layer provides modular support for the different tasks of the multimedia personalization process. The access to user profile information and media data are realized by the layers (1) and (2), followed by the two layers (3) and (4) in the middle for composition of the multimedia presentation in an internal object-oriented representation and its later transformation into a concrete presentation output format. Finally, the top layer (5) realizes the rendering and display of the multimedia presentation on the end device. To be most flexible for the different requirements of the concrete personalized multimedia applications, the frameworks layers allow extending the functionality of MM4U by embedding additional modules as indicated by the empty boxes with dots. In the following descriptions, the features of the framework are described along its different layers. We start from the bottom of the architecture and end with the top layer.
Figure 4. Overview of the multimedia personalization framework MM4U

5 4 5
Multimedia Presentation
Presentation Format Generators SMIL 2.0 BLP Generator SVG 1.2 Generator Mobile SVG Generator
SMIL 2.0 Generator

3
Multimedia Composition Parallel Citytour

2 1
Sequential
2 1
Slideshow
User Profile Accessor User Profile Connectors
Media Pool Accessor Media Data Connectors
URI CC/PP Profile storage Connector Connector
URI IR Media system Connector Connector
(1)
Connectors: The User Profile Connectors and the Media Data Connectors bring the user profile data and media data into the framework. They integrate existing systems for user profile stores, media storage, and retrieval solutions. As there are many different systems and formats available for user profile information, the User Profile Connectors abstract from the actual access to and retrieval of user profile information and provide a unified interface to the profile information. With this component, the different formats and structures of user profile models can be made accessible via a unified interface. For example, a flexible URIProfileConnector we developed for our demonstrator applications gains access to user profiles over the Internet. These user profiles are described as hierarchical ordered key-value pairs. This is a quite simple model but already powerful enough to allow effective patternmatching queries on the user profiles (Chen & Kotz, 2000). However, as shown in Figure 4 also a User Profile Connector for the access to, for example, a Composite Capability/Preference Profile (CC/PP) server could be plugged into the framework. On the same level, the Media Data Connectors abstract from the access to media elements in different media storage and retrieval solutions that are available today with a unified interface. The different systems for storage and content-based retrieval of media data are interfaced by this component. For example the URIMediaConnector, we developed for our demonstrator applications, provides a flexible access of media objects and its associated meta data from the Internet via http or ftp protocols. The meta data is stored in a single index file, describing not only the technical characteristics of the media elements and containing the location
262 Scherp & Boll
(2)
where to find the media elements in the Internet, but also comprise additional information about them, for example, a short description of what is shown in a picture or keywords for which one can search. By analogy with the access to user profile information, another Media Data Connector plugged into the framework could provide access to other media and meta data sources, for example, an image retrieval (IR) system like IBMs QBIC (IBM Corporation, 2004b). The Media Data Connector supports the query of media elements by the client application (client-pull) as well as the automatic notification of the personalized application when a new media object arises in the media database (server-push). The latter is required, for example, by the personalized multimedia sports news ticker (see the section about Sports4U) which is based on a multimedia event space (Boll & Westermann, 2003). Accessors: The User Profile Accessor and the Media Pool Accessor provide the internal data model of the user profiles and media data information within the system. Via this layer the user profile information and media data needed for the desired content personalization are accessible and processable for the application. The Connectors and Accessors are designed such that they are not reinventing existing systems for user modeling or multimedia content management. They, rather, provide a seamless integration of the systems by distinct interfaces and comprehensive data models. In addition, when a personalized multimedia application uses more than one user profile database or media database, the Accessor layer encapsulates the resources so that the access to them is transparent to the client application.
While the following layer (3) to (5) each constitute single components within the MM4U framework, the Accessor layer and Connectors layer do not. Instead the left side and the right side of the layers (1) and (2), i.e., the User Profile Accessor and User Profile Connectors as well as the Media Pool Accessor and Media Data Connectors, each form one component in MM4U. (3) Multimedia Composition: The Multimedia Composition component comprises abstract operators in compliance with the composition capabilities of multimedia composition models like SMIL, Madeus, and ZYX, which provide complex multimedia composition functionality. It employs the data from the User Profile Accessor and the Media Pool Accessor for the multimedia composition task. The Multimedia Composition component is developed as such that it enables to develop additional, possibly more complex or application-specific composition operators that can be seamlessly plugged-in into the framework. Result of the multimedia composition is an internal object-oriented representation of the personalized multimedia content independent of the different presentation formats. Presentation Format Generators: The Presentation Format Generators work on the internal object-oriented data model provided by the Multimedia Composition component and convert it into a standardized presentation format that can be displayed by the corresponding multimedia player on the client device. In contrast to the multimedia composition operators, the Presentation Format Generators are completely independent of the concrete application domain and only rely on the targeted output format. In MM4U, we have already developed Presentation Format
(4)
(5)
Generators for SMIL 2.0, the Basic Language Profile (BLP) of SMIL 2.0 for mobile devices (Ayars et al., 2001), SVG 1.2, Mobile SVG 1.2 (Andersson et al., 2004a) comprising SVG Tiny for multimedia-ready mobile phones and SVG Basic for pocket computers like Personal Digital Assistants (PDA) and Handheld Computers (HHC), and HTML (Raggett et al., 1998). We are currently working on Presentation Format Generators for Macromedia Flash (Macromedia, 2004) and other multimedia document model formats including HTML+TIME, the 3GPP SMIL Language Profile (3rd Generation Partnership Project, 2003b), which is a subset of SMIL used for scene description within the Multimedia Messaging Service (MMS) interchange format (3rd Generation Partnership Project, 2003a), and XMT-Omega, a high-level abstraction of MPEG-4 based on SMIL (Kim et al., 2000). Multimedia Presentation: The Multimedia Presentation component on top of the framework realizes the interface for applications to actually play the presentation of different multimedia presentation formats. The goal here is to integrate existing presentation components of the common multimedia presentation formats like SMIL, SVG, or HTML+TIME which the underlying Presentation Format Generator produces. So the developers benefit from the fact that only players for standardized multimedia formats need to be installed on the users end device and that they must not spend any time and resources in developing their own render and display engine for their personalized multimedia application.
The layered architecture of MM4U permits easy adaption for the particular requirements that can occur in the development of personalized multimedia applications. So special user profile connectors as well as media database connectors can be embedded into the Connectors layer of the MM4U framework to integrate the most diverse and individual solutions for storage, retrieval and gathering for user profile information and media data. With the ability to extend the Multimedia Composition layer by complex and sophisticated composition operators, arbitrary personalization functionality can be added to the framework. The Presentation Format Generator component allows integrating any output format into the framework to support most different multimedia players that are available for the different end devices. The personalized selection and composition of media elements and operators into a coherent multimedia presentation is the central task of the multimedia content creation process which we present in more detail in the following section.
CREATING PERSONALIZED MULTIMEDIA CONTENT

The MM4U framework provides the general functionality for the dynamic composition of media elements and composition operators into a coherent personalized multimedia presentation. Having presented the framework layers in the previous section, we now look in more detail how the layers contribute to the different tasks in the general personalization process as shown in Figure 2. The Media Data Accessor layer provides the personalized selection of media elements by their associated meta data and is described in the next subsection. The Multimedia Composition layer supports the composition of media elements into time and space in the internal multimedia represenCopyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
264 Scherp & Boll
tation format in three different manners, which are presented in detail in the the next three subsections, and the final subsection describes the last step, the transformation of the multimedia content in internal document model to an output format that is actually delivered to and rendered by the client devices. This is supported by the Presentation Format Generators layer.
Personalized Multimedia Content Selection

For creating personalized multimedia content, first those media elements have to be selected from the media databases that are most relevant to the users request. This personalized media selection is realized by the Media Data Accessor and Media Data Connector component of the framework. For the actual personalized selection of media elements, a context object is created within the Multimedia Composition layer carrying the user profile information, technical characteristics of the end device, and further application-specific information. With this context object, the unified interface of the Media Data Accessor for querying media elements is called. The context object is handed over to the concrete Media Data Connector of the connected media database. Within the Media Data Connector, the context object is mapped to the meta data associated with the media elements in the database and those media elements are determined that match the request, that is, the given context object at best. It is important to note that the Media Data Accessor and Media Data Connector layer integrate and embrace existing multimedia information systems and modern content-based multimedia retrieval solutions. This means that the retrieval of the best match can only be left to the underlying storage and management systems. The framework can only provide for comprehensive and lean interfaces to these systems. This can be our own URIMediaServer accessed by the URIMediaConnector but also other multimedia databases or (multi)media retrieval solutions. The result set of the query is handed back by the Accessor to the composition layer. For example, the context object for our mobile tourist guide application carries information about user interests and preferences with respect to the sights of the city, the display size of the end device, and the location for the tourist guide. The Media Data Connector, realized in this case by our URIMediaConnector, processes this context object and returns images and videos from those sights in Oldenburg that both match the users interests and preferences as well as the limited display size of the mobile device. Based on the personalized selection of media elements the Multimedia Composition layer provides the assembly of these media elements on three different manners, the basic, complex and sophisticated composition of multimedia content, which is described in the following sections.
Basic Composition Functionality

With the basic composition functionality the MM4U framework provides the basic bricks for composing multimedia content. It forms the basis for assembling the selected media elements into personalized multimedia documents and provides the means for realizing the central aspects of multimedia document models, that is, the temporal model, the spatial layout, and the interaction possibilities of the multimedia presentation. The temporal model of the multimedia presentation is determined by the temporal relationships between the presentations media elements formed by the composition operators.
The spatial layout expresses the arrangement and style of the visual media elements in the multimedia presentation. Finally, with the interaction model the user interaction of the multimedia presentation is determined, in order to let the user choose between different paths of a presentation. For the temporal model, we selected an interval-based approach as found in Duda & Keramane (1995). The spatial layout is realized by a hierarchical model for media positioning (Boll & Klas, 2001). For interaction with the user navigational and decision interaction are supported, as can be found with SMIL (Ayars et al., 2001) and MHEG-5 (Echiffre et al., 1998; International Organisation for Standardization, 1996). A basic composition operator or basic operator can be regarded as an atomic unit for multimedia composition, which cannot be further broken down. Basic operators are quite simple but applicable for any application area and therefore most flexible. Basic temporal operators realize the temporal model, and basic interaction operators realize the interaction possibilities of the multimedia presentation, as specified above. The two basic temporal operators Sequential and Parallel, for example, can be used to present media elements one after the other in a sequence respectively to present media elements parallel at the same time. With basic temporal operators and media elements, the temporal course of the presentation can be determined like a slideshow as depicted in Figure 5. The operators are represented by white rectangles and the media elements by gray ones. The relation between the media elements and the basic operators is shown by the edges beginning with a filled circle at an operator and ending with a filled rhombus respectively a diamond at a media element or another operator. The semantics of the slideshow shown in Figure 5 are that it starts with the presentation of the root element, which is the Parallel operator. The semantics of the Parallel operator are that it shows the operators and media elements that are attached to it at the same time. This means that the audio file starts to play while simultaneously the Sequential operator is presented. The semantics of the Sequential operator are to show the attached media elements one after another, so while the audio file is played in the background, the four slides are presented in sequence. Besides the basic composition operators, the so-called projectors are part of the Multimedia Composition layer. Projectors can be attached to operators and media elements to define, for example, the visual and acoustical layout of the multimedia presentation. Figure 6 shows the slideshow example from above with projectors attached. The spatial position as well as the width and height of the single slide media elements are determined by the corresponding SpatialProjectors. The volume, treble, bass, and balance of the audio medium is determined by the attached AcousticProjector. Figure 5. Slideshow as an example of assembled multimedia content
266 Scherp & Boll
Figure 6. Adding layout to the slideshow example
Besides temporal operators, the Multimedia Composition component offers basic operators for specifying the interaction possibilities of the multimedia presentation. Interaction can be added, for example, by using the basic operator InteractiveLink. It defines a link represented by a single media element or a fragment of a multimedia presentation that is clickable by the user and a target presentation the user receives if he or she clicks on the link. The description above presents some examples of the basic composition functionality the MM4U framework offers. The framework comprises customary composition operators for creating multimedia content as provided by modern multimedia presentation formats like SMIL and SVG. Even though the basic composition functionality does not reflect the fancy features of some of todays multimedia presentation formats, it supports the very central multimedia features of modeling time, space, and interaction. This allows the transformation of the internal document model into many different multimedia presentation formats for different end devices. With the basic multimedia composition operators the framework offers, arbitrary multimedia presentations can be assembled. However, so far the MM4U framework provides just basic multimedia composition functionality. In the same way that one would use an authoring tool to create SMIL presentations, for example, the GRiNS editor (Oratrix, 2004), one can also use a corresponding authoring tool for the basic composition operators the MM4U framework offers to create multimedia content. For reasons of reusing parts of the created multimedia presentations, for example, a menu bar or a presentations layout, and for convenience, there is a need for more complex and application-specific composition operators that provide a more convenient support for creating the multimedia content.
Complex Composition Functionality

For creating presentations that are more complex, the Multimedia Composition layer provides the ability to abstract from basic to complex operators. A complex composition operator encapsulates the composition functionality of an arbitrary number of basic operators and projectors and provides the developers with a more complex and application-specific building block for creating the multimedia content. Complex composition operators are composed of basic and others complex operators. As complex composition operators not only embed basic but also other complex operators, they provide for a reuse of composition operators. In contrast to the basic operators, the complex composition operators can be dismantled into their individual parts. Figure 7 depicts a complex composition operator for our slideshow example. It encapsulates the media elements, operators, and projectors of the slideshow (the latter are omitted in the diagram to reduce complexity). The complex operator Slideshow, indicated by a small c symbol in the upper right corner, represents an encapsulation of the former slideshow presentation in a complex object and forms itself a building block for more complex multimedia composition. Complex operators, as described above, define fixed encapsulated presentations. Their temporal flow, spatial layout, and the used media elements cannot be changed subsequently. However, a complex composition operator does not necessarily need to specify all media elements, operators, and projectors of the respective multimedia document tree. Instead, to be more flexible, some parts can be intentionally left open. These parts constitute the parameters of a complex composition operator and have to be filled in for concrete usage of these operators. Such parameterized complex composition operators are one means to define multimedia composition templates within the MM4U framework. However, only prestructured multimedia content can be created with these templates, since the complex composition operators can only encapsulate presentations of a fixed structure.
Figure 7. The slideshow example as a complex composition operator
268 Scherp & Boll
Figure 8 shows the slideshow example as a parameterized complex composition operator. In this case, the complex operator Slideshow comprises the two basic operators Parallel and Sequential. The Slideshows parameters are the place holders for the single slides and have to be instantiated when the operator is used within the multimedia composition. The slideshows audio file is already preselected. In addition, the parameters of a complex composition operator can be typed, that is, they expect a special type of operator or media element. The Slideshow operator would expect visual media elements for the parameters Slide 1 to Slide 4. To indicate the complex operators parameters, they are visualized by rectangles with dotted lines. The preselected audio file is already encapsulated in the complex operator as illustrated in Figure 7. In the same way projectors are attached to basic operators in the basic composition functionality section, they can also be attached to complex operators. The SpatialProjector attached to the Slideshow operator shown in Figure 8 determines that the slideshows position within a multimedia presentation is the position x = 100 pixel and y = 50 pixel in relation to the position of its parent node. With basic and complex composition operators one can build multimedia composition functionality that is equivalent to the composition functionality of advanced multimedia document models like Madeus (Jourdan et al., 1998) and ZYX (Boll & Klas, 2001). Though complex composition operators can have an arbitrary number of parameters and can be configured individually each time they are used, the internal structure of complex operators is still static. Once a complex operator is defined, the number of parameters and their type are fixed and cannot be changed. Using a complex composition
Figure 8. The slideshow example as parameterized complex composition operator
operator can be regarded as filling in fixed composition templates with suitable media elements. Personalization can only take place in selecting those media elements that fit the user profile information at best. For the dynamic creation of personalized multimedia content even more sophisticated composition functionality is needed, that allows the composition operators to change the structure of the generated multimedia content at runtime. To realize such sophisticated composition functionality, additional composition logic needs to be included into the composition operators, which cannot be expressed anymore even by the mentioned advanced document models we find in the field.
Sophisticated Composition Functionality

With basic and complex composition functionality, we already provide the dynamic composition of prestructured multimedia content by parameterized multimedia composition templates. However, such templates are only flexible concerning the selection of the concrete composition parameters. To achieve an even more flexible dynamic content composition, the framework provides sophisticated composition operators, which allow determining the document structure and layout during creation time by additional composition logic. Multimedia composition templates defined by using such sophisticated composition operators are no longer limited to creating prestructured multimedia content only, but determine the document structure and layout of the multimedia content on-the-fly and, depending on the user profile information, the characteristics of the used end device, and any other additional information. The latter can be, for example, a database containing sightseeing information. Such sophisticated composition operators exploit the basic and complex composition operators the MM4U framework offers but allow more flexible, possibly application-specific multimedia composition and personalization functionality with their additional composition logic. This composition logic can be realized by using document structures, templates, constraints and rules, or by plain programming. Independent of how the sophisticated multimedia content composition functionality is actually realized, the result of this composition process is always a multimedia document tree that consists of basic and complex operators, projectors, as well as media elements. In our graphical notation, sophisticated composition operators are represented in the same way as complex operators, but are labeled with a small s symbol in the upper right corner. Figure 9 shows an example of a parameterized sophisticated composition operator, the CityMap. This operator provides the generation of a multimedia presentation containing a city map image together with a set of available sightseeing spots on it. The parameters of this sophisticated operator are the city map image of arbitrary size and a spot image used for presenting the sights on the map. Furthermore, the CityMap operator reads out the positions of the sights on a reference map (indicated by the table on the right) and automatically recalculates the positions in dependence of the size of the actual city map image. Which spots are selected and actually presented on the city map depends on the user profile, in particular the types of sights he or she is interested in, and the categories a sight belongs to. In addition, the size of the city map image is selected to best fit the display of the end device. The CityMap operator is used within our personalized city guide prototype presented in the section on Sightseeing 4U and serves there for Desktop PCs as well as mobile devices.
270 Scherp & Boll
Figure 9. Insight into the sophisticated composition operator CityMap
The multimedia document tree generated by the CityMap operator is shown in the bottom part of Figure 9. Its root element constitutes the Parallel operator. Attached to it are the image of the city map and a set of InteractiveLink operators. Each InteractiveLink represents a spot on the city map, instantiated by the spot image. The user can click on the spots to receive multimedia presentations with further information about the sights. The positions of the spot images on the city map are determined by the SpatialProjectors. The personalized multimedia presentations about the sights are represented by the sophisticated operators Target 1 to Target N. The CityMap operator is one example of extending the personalization functionality of the MM4U framework by a sophisticated application-specific multimedia composition operator, here in the area of (mobile) tourism applications. This operator, for example, is developed by programming the required dynamic multimedia composition functionality. However, the realization of the internal composition logic of sophisticated operators is independent of the used technology and programming language. The same composition logic could also be realized by using a different technology, for example, a constraintbased approach. Though the actual realization of the personalized multimedia composition functionality would be different, the multimedia document tree generated by this rule-based sophisticated operator would be the same as depicted in Figure 9.
Sophisticated composition operators allow embracing the most different solutions for realizing personalized multimedia composition functionality. This can be plain programming as in the example of the CityMap operator, document structures and templates that are dynamically selected according to the user profile information and filled in with media elements relevant to the user, or systems describing the multimedia content generation by constraints and rules. The core MM4U framework might not offer all kinds of personalized multimedia composition functionality one might require, since the personalization functionality always depends on the actual application to be developed and thus can be very specific. Instead, the framework provides the basis to develop sophisticated personalized multimedia composition operators, such that every application can integrate its own personalization functionality into the framework. So, every sophisticated composition operator can be seen as a small application itself that can conduct a particular multimedia personalization functionality. This small application can be reused within others and thus extends the functionality of the framework. The Multimedia Composition component allows a seamless plug-in of arbitrary sophisticated composition operators into the MM4U framework. This enables the most complex personalized multimedia composition task to be just plugged into the system and to be used by a concrete personalized multimedia application. With the sophisticated composition operators, the MM4U framework provides its most powerful and flexible functionality to generate arbitrary personalized multimedia content. However, this multimedia content is still represented in an internal document model and has to be transformed into a presentation format that can be rendered and displayed by multimedia players on the end devices.
From Multimedia Composition to Presentation

In the last step of the personalization process, the personalized multimedia content represented in our internal document model is transformed by the Presentation Format Generators component into one of the supported standardized multimedia presentation formats, which can be rendered and displayed on the client device. The output format of the multimedia presentation is selected according to the users preferences and the capabilities of the end device, that is, the available multimedia players and the multimedia presentation formats they support. The Presentation Format Generators adapt the characteristics and facilities of the internal document model provided by the Multimedia Composition layer in regard of the used time model, spatial layout, and interaction possibilities to the particular characteristics and syntax of the concrete presentation format. For example, the spatial layout of our internal document model is realized by a hierarchical model that supports the positioning of media elements in relation to other media elements. This relative positioning is supported by most of todays presentation formats, for example, SMIL, SVG, and XMT-Omega. However, there exist multimedia presentation formats that do not support such a hierarchical model and only allow an absolute positioning of visual media elements in regard to the presentations origin as, for example, the Basic Language Profile of SMIL, 3GPP SMIL, and Macromedias Flash. In this case, the Presentation Format Generators component transforms the hierarchically organized spatial layout of the internal document model to a spatial layout of absolute positioning. How the transformation of the
272 Scherp & Boll
spatial model is actually performed and how the temporal model and interaction possibilities of the internal document model are transformed into the characteristics and syntax of the concrete presentation formats is intentionally omitted in this book chapter due to its focus on the composition and assembly of the personalized multimedia content and is described in Scherp and Boll (2005).
IMPACT OF PERSONALIZATION TO THE DEVELOPMENT OF MULTIMEDIA APPLICATIONS

The multimedia personalization framework MM4U presented so far provides support to develop sophisticated personalized multimedia applications. Involved parties in the development of such applications are typically a heterogeneous team of developers from different fields including media designers, computer scientists, and domain experts. In this section, we describe what challenges personalization brings to the development of personalized multimedia applications and how and where the MM4U framework can support the developer team to accomplish their job. In the next subsection, the general software engineering issues in regard to personalization are discussed. We describe how personalization affects the single members of the heterogeneous developer team and how the MM4U framework supports the development of personalized multimedia applications. The challenges that arise with creating personalized multimedia content by the domain experts using an authoring tool are presented in the following subsection. We also introduce how the MM4U framework can be used to develop a domain-specific authoring tool in the field of e-learning content, which aims to hide the technical details of content authoring from the authors and lets them concentrate on the actual creation of the personalized multimedia content.
Influence of Personalization to Multimedia Software Engineering

We observe that software engineering support for multimedia applications such as proper process models and development methodologies are not likely to be found in this area. Furthermore, the existing process models and development methodologies for multimedia applications as for example Rout and Sherwood (1999) and Engels et al. (2003) do not support personalization aspects. However, personalization requirements complicate the software development process even more and increase the development costs, since every individual alternative and variant has to be anticipated, considered, and actually implemented. Therefore, there is a high demand in supporting the development process of such applications. In the following paragraphs, we first introduce how personalization affects the software development process with respect to the multimedia content creation process in general. Then we identify what support the developers of personalized multimedia applications need and consider where the MM4U framework supports the development process. Since the term personalization profoundly depends on the applications context, its meaning has ever to be reconsidered when developing a personalized application for
a new domain. Rossi et al. (2001) claim that personalization should be considered directly from the beginning when a project is conceived. Therefore, the first activity when developing a personalized multimedia application is to determine the personalization requirements, that is, which aspects of personalization should be supported by the actual application. For example, in the case of an e-learning application the personalization aspects consider the automatic adaptation to the different learning styles of the students and their prior knowledge about the topic. In addition, different degrees of difficulty should be supported by a personalized e-learning application. In the case of a personalized mobile tourism application, however, the users location and his or her surroundings would be of interest for personalization instead. These personalization aspects must be kept in mind during every activity throughout the whole development process. The decision regarding which personalization aspects are to be supported has to be incorporated in the analysis and design of the personalized application and will hopefully entail a flexible and extendible software design. However, this increases the overall complexity of the application to be developed and automatically leads to a higher development effort including longer development duration and higher costs. Therefore, a good requirement analysis is crucial when developing personalized applications lest one dissipates ones energies in bad software design with respect to the personalization aspects. When transferring the requirements for developing personalized software to the specific requirements of personalized multimedia applications one can say that it affects all members of the developer team: the domain expert, the media designers, and the computer scientists, and putting higher requirements to them. The domain expert normally contributes to the development of multimedia applications by providing input to draw storyboards of the specific applications domain. These storyboards are normally drawn by media designers and are the most important means to communicate the later applications functionality within the developer team. When personalization comes into account, it is difficult to draw such storyboards, because of the many possible alternatives and different paths in the application that are implicated with personalization. Consequently, the storyboards change in regard to, for example, the individual user profiles and the end devices that are used. When drawing storyboards for a personalized multimedia application, those points in the storyboard have to be identified and visualized where personalization is required and needed. Storyboards have to be drawn for every typical personalization scenario concerning the concrete application. This drawing task should be supported by interactive graphical tools to create personalized storyboards and to identify reusable parts and modules of the content. It is the task of the media designer in the development of multimedia applications to plan, acquire, and create media elements. With personalization, media designers have to think additionally about the usage of media elements for personalization purposes, that is, the media elements have to be created and prepared for different contexts. When acquiring media elements, the media designers must consider for which user context the media elements are created and what aspects of personalization are to be supported, for example, different styles, colours, and spatial dimensions. Possibly a set of quite similar media assets have to be developed, that only differ in certain aspects. For example, an image or video has to be transformed for different end device resolutions, colour depth,
274 Scherp & Boll
and network connections. Since personalization means to (re)assemble existing media elements into a new multimedia presentation, the media designers will also have to identify reusable media elements. This means that additionally the storyboards must already capture the personalization aspects. Not only the content but also the layout of the multimedia application can change depending on the user context. So, the media designers have to create different visual layouts for the same application to serve the needs of different user groups. For example, an e-learning system for children would generate colourful multimedia presentations with many auditory elements and a comiclike virtual assistant, whereas the system would present the same content in a much more factual style for adults. This short discussion shows that personalization already affects the storyboarding and media acquisition. Creating media elements for personalized multimedia applications requires a better and elaborate planning of the multimedia production. Therefore, a good media production strategy is crucial, due to the high costs involved with the media production process. Consequently, the domain experts and the media designers need to be supported by appropriate tools for planning, acquiring, and editing media elements for personalized multimedia applications. The computer scientists actually have to develop the multimedia personalization functionality of the concrete application. What this personalization functionality is depends heavily on the concrete application domain and is communicated with the domain experts and media designers by using personalized storyboards. With personalization, the design of the application has to be more flexible and more abstract to meet the requirements of changing user profile information and different end device characteristics. This is where the MM4U framework comes into play. It provides the computer scientists the general architecture of the personalized multimedia application and supports them in designing and implementing the concrete multimedia personalization functionality. When using the MM4U framework, the computer scientists must know how to use and to extend it. The framework provides the basis for developing both basic and sophisticated multimedia personalization functionality, as for example the Slideshow or the CityMap operator presented in the section on content. To assist the computer scientists methodically we are currently working on guidelines and checklists of how to develop the personalized multimedia composition operators and how to apply them. Consequently, the development of personalized multimedia applications by using the MM4U framework basically means to the computer scientists the design, development, and deployment of multimedia composition operators for generating personalized content. The concept of the multimedia personalization operators as introduced in the content section, that every concrete personalized multimedia application is itself a new composition operator increases reusage of existing personalization functionality. Furthermore, the interface design of the sophisticated operators makes it possible to embrace existing approaches that are able to generate multimedia document trees, for example, so it can be generated with the basic and complex composition functionality of the MM4U framework.
Influence of Personalization to Multimedia Content Authoring

Authoring of multimedia content is the process in which the multimedia presentations are actually created. This creation process is typically supported by graphical
authoring tools, for example, Macromedias Authorware and Director (Macromedia, 2004), Toolbook (Click2learn, 2001-2002), (Arndt, 1999), and (Gaggi & Celentano, 2002). For creating the multimedia content, the authoring tools follow different design philosophies and metaphors, respectively. These metaphors can be roughly categorized into script-based, card/page-based, icon-based, timeline-based, and object-based authoring (Rabin & Burns, 1996). All these different metaphors have the same goal, to support authors in creating their content. Even though based on these metaphors a set of valuable authoring tools has been developed, these metaphors do not necessarily provide a suitable means for authoring personalized content. From the context of our research project Cardio-OP we derived early experiences with personalized content authoring for domain experts in the field of cardiac surgery (Klas et al., 1999; Greiner & Rose, 1998; Boll et al., 2001). One of the tools developed by a project partner, the Cardio-OP Authoring Wizard, is a page-based easy-to-use multimedia authoring environment, enabling medical experts to compose a multimedia book on operative techniques in the domain of cardiac surgery for three different target groups, medical doctors, nurses, and students. The Authoring Wizard guides the author through the particular authoring steps and offers dialogues specifically tailored to the needs of each step. Coupled tightly with an underlying media server, the authoring wizard allows use of every precious piece of media data available at the media server in all of the instructional applications at different educational levels. This promotes reuse of expensively produced content in a variety of different contexts. Personalization of the e-learning content is required here, since the three target groups have different views and knowledge about the domain of cardiac surgery. Therefore, the target groups require different information from such a multimedia book, presented on an adequate level of difficulty for each group. However, the experiences we gained from deploying this tool show that it is hard to provide the domain authors with an adequate intuitive user interface for the creation of personalized multimedia e-learning content for three educational levels. It was a specific challenge for the computer scientists involved in the project to provide both media creation tools and multimedia authoring wizard that allow the domain experts to insert knowledge into the system, while at the same time hiding the technical details from them as much as possible. On the basis of the MM4U framework, we are currently developing a smart authoring tool aimed for domain experts to create personalized multimedia e-learning content. The tool we are developing works at the what-you-see-is-what-you-get (WYSIWYG) level and can be seen as a specialized application employing the framework to create personalized content. The content source from which this personalized elearning content is created constitutes the LEBONED repositories. Within the LEBONED project (Oldenettel & Malachinski, 2003) digital libraries are integrated into learning management systems. Using the content managed by the LEBONED system for new elearning units, a multimedia authoring support is needed for assembling existing elearning modules into new, possibly more complex, units. In the e-learning context, the background of the learners is very relevant for the content that meets the users learning demands that means a personalized multimedia content can meet the users background knowledge and interest much better than a one-size-fits-all e-learning unit. The creation of an e-learning unit on the other side cannot be supported by a mere automatic
276 Scherp & Boll
process. Rather the domain experts would like to control the assembly of the content because they are responsible for the content conveyed. The smart authoring tool guides the domain experts through the composition process and supports them in creating presentations that still provide flexibility to the targeted user context. In the e-learning context we can expect domain experts such as lecturers that want to create a new elearning unit but do not want to be bothered with the technical details of (multimedia) authoring. We use the MM4U framework to build the multimedia composition and personalization functionality of this smart authoring tool. For this, the Multimedia Composition component supports the creation and processing of arbitrary document structures and templates. The authoring tool exploits this functionality for composition to achieve a document structure that is suitable just for that content domain and the targeted audience. The Media Data Accessor supports the authoring tool in those parts in which it lets the author choose from only those media elements that are suitable for the intended user contexts and that can be adapted to the users infrastructure. Using the Presentation Format Generators, the authoring tool finally generates the presentations for the different end devices of the targeted users. Thus the authoring process is guided and specialized with regard to selecting and composing personalized multimedia content. For the development of this authoring tool, the framework fulfils the same function in the process of creating personalized multimedia content in a multimedia application as described in the previous section on the framework. However, the creation of personalized content is not achieved at once but step by step during the authoring process.
IMPLEMENTATION AND PROTOTYPICAL APPLICATIONS

The framework, its components, classes and interfaces, are specified using the Unified Modeling Language (UML) and has been implemented in Java. The development process for the framework is carried out as an iterative software development with stepwise refinement and enhancement of the frameworks components. The redesign phases are triggered by the actual experience of implementing the framework but also by employing the framework in several application scenarios. In addition, we are planning to provide a beta version of the MM4U framework to other developers for testing the framework and to develop their own personalized multimedia applications with MM4U. Currently, we are implementing several application scenarios to prove the applicability of MM4U in different application domains. These prototypes are the first stress test for the framework. At the same time the development of the sample applications gives us an important feedback about the comprehensiveness and the applicability of the framework. In the following sections, two of our prototypes that are based on the MM4U framework are introduced: In the Sightseeing4U subsection, a prototype of a personalized city guide is presented, and in the Sports4U subsection a prototype of a personalized multimedia sports news ticker is described.
Sightseeing4U: A Generic Personalized City Guide

Our first prototype using the MM4U framework is Sightseeing4U, a generic personalized city guide application (Scherp & Boll, 2004a, 2004b; Boll et al., 2004). It is applicable to develop personalized tourist guides for arbitrary cities, both for desktop PCs and mobile devices such as PDAs (Personal Digital Assistants). The generic Sightseeing4U application uses the MM4U framework and its modules as depicted in Figure 3. The concrete demonstrator we developed for our hometown Oldenburg in Northern Germany considers the pedestrian zone and comprises video and image material of about 50 sights. The demonstrator is developed for desktop PCs as well as PDAs (Scherp & Boll, 2004a). It supports personalization with respect to the users interests, for example, churches, museums, and theatres, and preferences such as the favorite language. Depending on the specific sightseeing interests, the proper sights are automatically selected for the user. This is realized by category matching of the users interests with the meta data associated to the sights. Figure 10 and Figure 11 show some screenshots of our city guide application in different output formats and on different end devices. The presentation in Figure 10 is targeted at a user interested in culture, whereas the presentation in Figure 11 is generated for a user who is hungry and searches for a good restaurant in Oldenburg. The different interests of the users result in different spots that are presented on the map of Oldenburg. When clicking on a certain spot the user receives a multimedia presentation with further information about the sight (see the little boxes where the arrows point). Thereby, the media elements for the multimedia presen-
Figure 10. Screenshots of the city guide application for a user interested in culture (presentation generated in SMIL 2.0 and SMIL 2.0 BLP format, respectively)
(a) RealOne Player (RealNetworks, 2003) on a Desktop PC
(b) PocketSMIL Player (INRIA, 2003) on a PDA
278 Scherp & Boll
Figure 11. Screenshots of the Sightseeing4U prototype for a user searching for a good restaurant (output generated in SVG 1.2 and Mobile SVG format, respectively)
(a) Adobe SVG Plug-In (Adobe Systems, 2001) on a Tablet PC
(b) Pocket eSVG viewer (EXOR, 2001-2004) on a PDA
tation are automatically selected to fit the end devices characteristics best. For example, a user sitting at a desktop PC receives a high-quality video about the palace of Oldenburg as depicted in Figure 10a, while a mobile user gets a smaller video of less quality in Figure 10b. In the same way, the user searching for a good restaurant in Oldenburg receives either a high-quality video when using a Tablet PC as depicted in Figure 11a, or a smaller one that meets the limitations of the mobile device as shown in Figure 11b. If there is no video of a particular sight available at all, the personalized tourist guide automatically selects images instead and generates a slideshow for the user.
Sports4U: A Personalized Multimedia Sports News Ticker

A second prototype that uses our MM4U framework is the personalized multimedia sports news ticker called Sports4U. The Sports4U application exploits the Medither multimedia event space as introduced in (Boll & Westermann, 2003). The Medither is based on a decentralized peer-to-peer infrastructure and allows one to publish, to find, and to be notified about any kind of multimedia events of interest. In the case of Sports4U, the event space forms the media data basis of sports-related multimedia news events. A multimedia sports news event comprises data of different media types like describing text, a title, one or more images, an audio record, or a video clip. The personalized sports ticker application combines the multimedia data of the selected events, the available meta data, and additional information, for example, from a soccer player database. The application uses a sophisticated composition operator that automatically arranges these multimedia
Figure 12: Screenshots of the personalized sports newsticker Sport4U
sports news to a coherent presentation. It regards possible constraints like running time limit and particular characteristics of the end device, like the limited display size of a mobile device. The result is a sports news presentation that can be, for example, viewed with an SMIL player over the Web as shown in Figure 12. With a suitable Media Data Connector the Medither is connected to the MM4U framework. This connector not only allows querying for media elements like the URIMediaConnector but also provides the notification of incoming multimedia events to the actual personalized application. Depending on the user context, the Sports4U prototype receives the sports news from the pool of sports events in the Medither that match the users profile. The Sports4U application alleviates the user from the time-consuming task of searching for sports news he or she might be interested in.
CONCLUSION
In this chapter, we presented an approach for supporting the creation of personalized multimedia content. We motivated the need of technology to handle the flood of multimedia information that allows for a much targeted, individual management and access to multimedia content. To give a better understanding of the content creation process we introduced the general approaches in multimedia data modeling and multimedia authoring as we find it today. We presented how the need for personalization of multimedia content heavily affects the multimedia content creation process and can only result in a dynamic, (semi)automatic support for the personalized assembly of multimedia content. We looked into existing related approaches ranging from personalization in the text-centric Web context over single media personalization to the personalization of multimedia content. Especially for complex personalization tasks we observe that an (additional) programming is needed and propose a software engineering support with our Multimedia for you Framework (MM4U). We presented the MM4U framework concept in general and, in more detail, the single layers of the MM4U framework: access to user profile information, personalized
280 Scherp & Boll
media selection by meta data, composition of complex multimedia presentations, and generation of different output formats for different end devices. As a central part of the framework, we developed multimedia composition operators which create multimedia content in an internal model and representation for multimedia presentations, integrating the composition capabilities of advanced multimedia composition models. Based on this representation, the framework provides so-called generators to dynamically create different context-aware multimedia presentations in formats such as SMIL and SVG. The usage of the framework and its advantages has been presented in the context of multimedia application developers but also in the specific case of using the frameworks specific features for the development of a high-level authoring tool for domain experts. With the framework developed, we achieved our goals concerning the development of a domain independent framework that supports the creation of personalized multimedia content independent of the final presentation format. Its design allows to just use the functionality it provides, for example, the access to media data, associated meta data, and user profile information, as well as the generation of the personalized multimedia content in standardized presentation formats. Hence, the framework relieves the developers of personalized multimedia applications from common tasks needed for content personalization, that is, personalized content selection, composition functionality, and presentation generation, and lets them concentrate on their application-specific job. However, the framework is also designed to be extensible with regard to applicationspecific personalization functionality, for example, by an application-specific personalized multimedia composition functionality. With the applications in the field of tourism and sports news we illustrated the usage of the framework in different domains and showed how the framework easily allows one to dynamically create personalized multimedia content for different user contexts and devices. The framework has been designed not to become yet another framework but to base on and integrate previous and existing research in the field. The design has been based on our long-term experience in advanced multimedia composition models and an extensive study of previous and ongoing related approaches. Its interfaces and extensibility explicitly allow not only to extend the frameworks functionality but to embrace existing solutions of other (research) approaches in the field. The dynamically created personalized multimedia content needs semantically rich annotated content in the respective media databases. In turn, the newly created content itself not only provides users with personalized multimedia information but at the same time forms a new, semantically even richer multimedia content that can be retrieved and reused. The composition operators provide common multimedia composition functionality but also allow the integration of very specific operators. The explicit decision for presentation independence by a comprehensive internal composition model makes the framework both independent of any specific presentation format and prepares it for future formats to come. Even though a software engineering approach towards dynamic creation of personalized multimedia content may not be the obvious one, we are convinced that our framework fills the gap between dynamic personalization support based on abstract data models, constraints or rules, and application-specific programming of personalized multimedia applications. With the MM4U framework, we are contributing to a new but obvious research challenge in the field of multimedia research, that is, the shift from tools for the manual creation of static multimedia content towards techniques for the dynamic
creation of, respectively, context-aware and personalized multimedia content, which is needed in many application fields. Due to its domain independence, MM4U can be used by arbitrary personalized multimedia applications, each application applying a different configuration of the framework. Consequently, for providers of applications the framework approach supports a cheaper and quicker development process and by this contributes to a more efficient personalized multimedia content engineering.
3rd Generation Partnership Project (2003a). TS 26.234; Transparent end-to-end packetswitched streaming service; protocols and codecs (Release 5). Retrieved December 19, 2003, from http://www.3gpp.org/ftp/Specs/html-info/26234.htm 3rd Generation Partnership Project (2003b). TS 26.246; Transparent end-to-end packetswitched streaming service: 3GPP SMIL language profile (Release 6). Retrieved December 19, 2003, from http://www.3gpp.org/ftp/Specs/html-info/26246.htm Ackermann, P. (1996). Developing object oriented multimedia software: Based on MET++ application framework. Heidelberg, Germany: dpunkt. Adobe Systems, Inc., USA (2001). Adobe SVG Viewer. Retrieved February 25, 2004, from http://www.adobe.com/svg/ Agnihotri, L., Dimitrova, N., Kender, J., & Zimmerman, J. (2003). Music videos miner. In ACM Multimedia. Allen, J. F. (1983, November). Maintaining knowledge about temporal intervals. In Commun. ACM, 25(11). Amazon, Inc., USA (1996-2004). Amazon.com. Retrieved February 20, 2004, from http:/ /www.amazon.com/ Andersson, O., Axelsson, H., Armstrong, P., Balcisoy, S., et al. (2004a). Mobile SVG profiles: SVG Tiny and SVG Basic. W3C recommendation 25/03/2004. Retrieved June 10, 2004, from http://www.w3.org/TR/SVGMobile12/ Andersson, O., Axelsson, H., Armstrong, P., Balcisoy, S., et al. (2004b). Scalable vector graphics (SVG) 1.2 specification. W3C working draft 05/10/2004. Retrieved June 10, 2004, from http://www.w3c.org/Graphics/SVG/ Andr, E. (1996). WIP/PPP: Knowledge-based methods for fully automated multimedia authoring. In Proceedings of the EUROMEDIA96. London. Andr, E., & Rist, T. (1995). Generating coherent presentations employing textual and visual material. In Artif. Intell. Rev, 9(2-3). Kluwer Academic Publishers. Andr, E., & Rist, T. (1996, August). Coping with temporal constraints in multimedia presentation planning. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), Portland, Oregon. Arndt, T. (1999, June). The evolving role of software engineering in the production of multimedia applications. In IEEE International Conference on Multimedia Computing and Systems Volume 1, Florence, Italy. Ayars, J., Bulterman, D., Cohen, A., Day, K., Hodge, E., Hoschka, P., et al. (2001). Synchronized multimedia integration language (SMIL 2.0) specification. W3C Recommendation 08/07/2001. Retrieved February 23, 2004, from http:// www.w3c.org/AudioVideo/
REFERENCES
282 Scherp & Boll
Beckett, D., & McBride, B. (2003). RDF/XML syntax specification (revised). W3C recommendation 15/12/2003. Retrieved February 23, 2004: http://www.w3c.org/RDF/ Bohrer, K., & Holland, B. (2004). Customer profile exchange (cpexchange) specification version 1.0, 20/10/2000. Retrieved January 27, 2004, from http:// www.cpexchange.org/standard/cpexchangev1_0F.zip Boll, S. (2003, July). Vienna 4 U - What Web services can do for personalized multimedia applications. In Proceedings of the Seventh Multi-Conference on Systemics Cybernetics and Informatics (SCI 2003), Orlando, Florida, USA. Boll, S., & Klas, W. (2001). ZYX - A multimedia document model for reuse and adaptation. In IEEE Transactions on Knowledge and Data Engineering, 13(3). Boll, S., Klas, W., & Wandel, J. (1999, November). A cross-media adaptation strategy for multimedia presentations. Proc. of the ACM Multimedia Conf. 99, Part 1, Orlando, Florida, USA. Boll, S., Klas, W., Heinlein, C., & Westermann, U. (2001, August). Cardio-OP - Anatomy of a multimedia repository for cardiac surgery. Technical Report TR-2001301, University of Vienna, Austria. Boll, S., Klas, W., & Westermann, U. (2000, August). Multimedia Document Formats Sealed Fate or Setting Out for New Shores? In Multimedia - Tools and Applications, 11(3). Boll, S., Krsche, J., & Scherp, A. (2004, September). Personalized multimedia meets location-based services. In Proceedings of the Multimedia-Informationssysteme Workshop associated with the 34th annual meeting of the German Society of Computing Science, Ulm, Germany. Boll, S., Krsche, J., & Wegener, C. (2003, August). Paper chase revisited - A real world game meet hypermedia (short paper). In Proc. of the Intl. Conference on Hypertext (HT03), Nottingham, UK. Boll, S., & Westermann, U. (2003, November). Medither - An event space for contextaware multimedia experiences. In Proc. of International ACM SIGMM Workshop on Experiential Telepresence, Berkeley, CA., USA. Bordegoni, M., Faconti, G., Feiner, S., Maybury, M. T., Rist, T., Ruggieri, S., et al. (1997, December). A standard reference model for intelligent multimedia presentation systems. In ACM Computer Standards & Interfaces, 18(6-7). Brusilovsky, P. (1996). Methods and techniques of adaptive hypermedia. User Modeling and User Adapted Interaction, 6(2-3). Bugaj, S., Bulterman, D., Butterfield, B., Chang, W., Fouquet, G., Gran, C., et al. (1998). Synchronized multimedia integration language (SMIL 1.0) specification. W3C Recommendation 06/15/1998. Retrieved June 10, 2004, from http://www.w3.org/ TR/REC-smil/ Bulterman, D. C. A., van Rossum, G., & van Liere, R. (1991). A structure of transportable, dynamic multimedia documents. In Proceedings of the Summer 1991 USENIX Conf., Nashville, TN, USA. Chen, G., & Kotz, D. (2000). A survey of context-aware mobile computing research. Technical Report TR2000-381. Dartmouth University, Department of Computer Science, Click2learn, Inc, USA (2001-2002). Toolbook Standards-based content authoring. Retrieved February 6, 2004 from http://home.click2learn.com/en/toolbook/ index.asp
CWI (2004). The cuypers multimedia transformation engine. Amsterdam, The Netherlands. Retrieved February 25, 2004, on http://media.cwi.nl:8080/demo/ De Bra, P., Aerts, A., Berden, B., De Lange, B., Rousseau, B., Santic, T., Smits, D., & Stash, N. (2003, August). AHA! The Adaptive Hypermedia Architecture. Proceedings of the ACM Hypertext Conference, Nottingham, UK. De Bra, P., Aerts, A., Houben, G.-J., & Wu, H. (2000). Making general-purpose adaptive hypermedia work. In Proc. of the AACE WebNet Conference, San Antonio, Texas. De Bra, P., Aerts, A., Smits, D., & Stash, N. (2002a, October). AHA! version 2.0: More adaptation flexibility for authors. In Proc. of the AACE ELearn2002 Conf. De Bra, P., Brusilovsky, P., & Conejo, R. (2002b, May). Proc. of the Second Intl. Conf. for Adaptive Hypermedia and Adaptive Web-Based Systems, Malaga, Spain, Springer LNCS 2347. De Bra, P., Brusilovsky, P., & Houben, G.-J. (1999a, December). Adaptive hypermedia: From systems to framework. ACM Computing Surveys, 31(4). De Bra, P., Houben, G.-J., & Wu, H. (1999b). AHAM: A dexter-based reference model for adaptive hypermedia. In Proceedings of the 10th ACM Conf. on Hypertext and hypermedia: returning to our diverse roots, Darmstadt, Germany. De Carolis, B., de Rosis, F., Andreoli, C., Cavallo, V., De Cicco, M L (1998). The Dynamic Generation of Hypertext Presentations of Medical Guidelines. The New Review of Hypermedia and Multimedia, 4. De Carolis, B., de Rosis, F., Berry, D., & Michas, I. (1999). Evaluating plan-based hypermedia generation. In Proc. of European Workshop on Natural Language Generation, Toulouse, France. de Rosis, F., De Carolis, B., & Pizzutilo, S. (1999). Software documentation with animated agents. In Proc. of the 5th ERCIM Workshop on User Interfaces For All, Dagstuhl, Germanny. Dublin Core Metadata Initiative (1995-2003). Expressing simple Dublin Core in RDF/ XML,1995-2003. Retrieved February 2, 2004, from http://dublincore.org/documents/2002/07/31/dcmes-xml/ Duda, A., & Keramane, C. (1995). Structured temporal composition of multimedia data. Proceedings of the IEEE International Workshop Multimedia-Database-Management Systems. Duke, D. J., Herman, I., & Marshall, M. S. (1999). PREMO: A framework for multimedia middleware: Specification, rationale, and java binding. New York: Springer. Echiffre, M., Marchisio, C., Marchisio, P., Panicciari, P., & Del Rossi, S. (1998, JanuaryMarch). MHEG-5 Aims, concepts, and implementation issues. In IEEE Multimedia. Egenhofer, M. J., & Franzosa, R. (1991, March). Point-Set Topological Spatial Relations. Int. Journal of Geographic Information Systems, 5(2). Elhadad, M., Feiner, S., McKeown, K., & Seligmann, D. (1991). Generating customized text and graphics in the COMET explanation testbed. In Proc. of the 23rd Conference on Winter Simulation. IEEE Computer Society, Phoenix, Arizona, USA. Engels, G., Sauer, S., & Neu, B. (2003, October). Integrating software engineering and user-centred design for multimedia software developments. In Proc. IEEE Symposia on Human-Centric Computing Languages and Environments - Symposium on
284 Scherp & Boll
Visual/Multimedia Software Engineering, Auckland, New Zealand. IEEE Computer Society Press. Exor International Inc. (2001-2004). eSVG: Embedded SVG. Retrieved February 12, 2004, from http://www.embedding.net/eSVG/english/overview/overview_frame.html Fink, J., Kobsa, A., & Schreck, J. (1997). Personalized hypermedia information through adaptive and adaptable system features: User modeling, privacy and security issues. In A. Mullery, M. Besson, M. Campolargo, R. Gobbi, & R. Reed (Eds.), Intelligence in services and networks: Technology for cooperative competition. Berlin: Springer. Foundation for Intelligent Physical Agents (2002). FIPA device ontology specification, 2002. Retrieved January 23, 2004, from http://www.fipa.org/specs/fipa00091/ Gaggi, O., & Celentano, A. (2002). A visual authoring environment for prototyping multimedia presentations. In Proceedings of the IEEE Fourth International Symposium on Multimedia Software Engineering. Girgensohn, A., Bly, S., Shipman, F., Boreczky, J., & Wilcox, L. (2001). Home video editing made easy Balancing automation and user control. In Proc. of the HumanComputer Interaction, Tokyo, Japan. Girgensohn, A., Shipman, F., & Wilcox, L. (2003, November). Hyper-Hitchcock: Authoring Interactive Videos and Generating Interactive Summaries. In Proc. ACM Multimedia. Greiner, C., & Rose, T. (1998, November). A Web based training system for cardiac surgery: The role of knowledge management for interlinking information items. In Proc. The World Congress on the Internet in Medicine, London. Hardman, L. (1998, March). Modeling and Authoring Hypermedia Documents. Doctoral dissertation, University of Amsterdam, The Netherlands. Hardman, L., Bulterman, D. C. A., & van Rossum, G. (1994b, February). The Amsterdam Hypermedia Model: Adding time and context to the Dexter Model. In Comm. of the ACM, 37(2). Hardman, L., van Rossum, G., Jansen, J., & Mullender, S. (1994a). CMIFed: A transportable hypermedia authoring system. In Proc. of the Second ACM International Conference on Multimedia, San Francisco. Hirzalla, N., Falchuk, B., & Karmouch, A. (1995). A temporal model for interactive multimedia scenarios. In IEEE Multimedia, 2(3). Hunter, J. (1999, October). Multimedia metadata schemas. Retrieved June 16, 2004, from http://www2.lib.unb.ca/ Imaging_docs/IC/schemas.html IBM Corporation, USA. (2004a). IBM research Video semantic summarization systems. Retrieved June 15, 2004, from http://www.research.ibm.com/MediaStar/ VideoSystem.html#Summarization%20Techniques IBM Corporation, USA. (2004b). QBIC home page. Retrieved June 16, 2004, from http:/ /wwwqbic.almaden.ibm.com/ INRIA (2003). PocketSMIL 2.0. Retrieved February 24, 2004, from http:// opera.inrialpes.fr/pocketsmil/ International Organisation for Standardization (1996). ISO 13522-5, information technology Coding of multimedia and hypermedia information, Part 5: Support for base-level interactive applications. Geneva, Switzerland: International Organisation for Standardization.
ISO/IEC (1999, July). JTC 1/SC 29/WG 11, MPEG-7: Context, Objectives and Technical Roadmap, V.12. ISO/IEC Document N2861. Geneva, Switzerland: Int. Organisation for Standardization/Int. Electrotechnical Commission. ISO/IEC (2001a, November). JTC 1/SC 29/WG 11. InformationtechnologyMultimedia content description interfacePart 1: Systems. ISO/IEC Final Draft International Standard 15938-1:2001. Geneva, Switzerland: Int. Organisation for Standardization/ Int. Electrotechnical Commission. ISO/IEC (2001b, September). JTC 1/SC 29/WG 11. Information technologyMultimedia content description interfacePart 2: Description definition language. ISO/IEC Final Draft Int. Standard 15938-2:2001. Geneva, Switzerland: Int. Organisation for Standardization/Int. Electrotechnical Commission. ISO/IEC (2001c, July). JTC 1/SC 29/WG 11. Information technologyMultimedia content description interfacePart 3: Visual. ISO/IEC Final Draft Int. Standard 15938-3:2001. Geneva, Switzerland: Int. Organisation for Standardization/Int. Electrotechnical Commission. ISO/IEC (2001d, June). JTC 1/SC 29/WG 11. Information technologyMultimedia content description interfacePart 4: Audio. ISO/IEC Final Draft Int. Standard 15938-4:2001. Geneva, Switzerland: Int. Organisation for Standardization/Int. Electrotechnical Commission. ISO/IEC (2001e, October). JTC 1/SC 29/WG 11. Information technologyMultimedia content description interfacePart 5: Multimedia description schemes. ISO/IEC Final Draft Int. Standard 15938-5:2001. Geneva, Switzerland: Int. Organisation for Standardization/Int. Electrotechnical Commission. Jourdan, M., Layada, N., Roisin, C., Sabry-Ismal, L., & Tardif, L. (1998). Madeus, and authoring environment for interactive multimedia documents. ACM Multimedia. Kim, M., Wood, S., & Cheok, L.-T. (2000, November). Extensible MPEG-4 textual format (XMT). In Proc. of the 8th ACM Multimedia Conf., Los Angeles. Klas, W., Greiner, C., & Friedl, R. (1999, July). Cardio-OP: Gallery of cardiac surgery. IEEE International Conference on Multimedia Computing and Systems (ICMS 99). Florence, July. Klyne, G., Reynolds, F., Woodrow, C., Ohto, H., Hjelm, J., Butler, M. H., & Tran, L. (2003). Composite capability/preference profile (CC/PP): Structure and vocabularies W3C Working Draft 25/03/2003. Kopf, S., Haenselmann, T., Farin, D., & Effelsberg, W. (2004). Automatic generation of summaries for the Web. In Proceedings Electronic Imaging 2004. Lemlouma, T., & Layada, N. (2003, June). Media resources adaptation for limited devices. In Proc. of the Sevent ICCC/IFIP International Conference on Electronic Publishing ELPUB 2003, Universidade deo Minho, Portugal. Lemlouma, T., & Layada, N. (2004, January). Context-aware adaptation for mobile devices. IEEE International Conference on Mobile Data Management, Berkeley, California, USA. Little, T. D. C., & Ghafoor, A. (1993). Interval-based conceptual models for timedependent multimedia data. In IEEE Transactions on Knowledge and Data Engineering, 5(4). Macromedia, Inc., USA (2003, January). Using Authorware 7. [Computer manual]. Available from http://www.macromedia.com/software/authorware/
286 Scherp & Boll
Macromedia, Inc., USA (2004). Macromedia. Retrieved June 15, 2004, from http:// www.macromedia.com/ McKeown, K., Robin, J., & Tanenblatt, M. (1993). Tailoring lexical choice to the users vocabulary in multimedia explanation generation. In Proc. of the 31st conference on Association for Computational Linguistics, Columbus, Ohio. Oldenettel, F., & Malachinski, M. (2003, May). The LEBONED metadata architecture. In Proc. of the 12th International World Wide Web Conference, Budapest, Hungary (pp. S.207-216). ACM Press, Special Track on Education. Open Mobile Alliance (2003). User agent profile (UA Prof). 20/05/2003. Retrieved February 10, 2004, from http://www.openmobilealliance.org/ Oratrix (2004). GRiNS for SMIL Homepage. Retrieved February 23, 2004, from http:// www.oratrix.com/GRiNS Papadias, D., & Sellis, T. (1994, October). Qualitative representation of spatial knowledge in two-dimensional space. In VLDB Journal, 3(4). Papadias, D., Theodoridis, Y., Sellis, T., & Egenhofer, M. J. (1995, March). Topological relations in the world of minimum bounding rectangles: A study with R-Trees. In Proc. of the ACM SIGMOD Conf. on Management of Data, San Jose, California. Pree, W. (1995). Design patterns for object-oriented software development. Boston: Addison-Wesley. Rabin, M. D., & Burns, M. J. (1996). Multimedia authoring tools. In Conference Companion on Human Factors in Computing Systems, Vancouver, British Columbia, Canada, ACM Press. Raggett, D., Le Hors, A., & Jacobs, I. (1998). HyperText markup language (HTML) version 4.0. W3C Recommendation, revised on 04/24/1998. Retrieved February 20, 2004, from http://www.w3c.org/MarkUp/ RealNetworks (2003). RealOne Player. Retrieved February 25, 2004, from http:// www.real.com/ Rossi, G., Schwabe, D., & Guimares, R. (2001, May). Designing personalized Web applications. In Proceedings of the tenth World Wide Web (WWW) Conference, Hong Kong. ACM. Rout, T. P., & Sherwood, C. (1999, May). Software engineering standards and the development of multimedia-based systems. In Fourth IEEE International Symposium and Forum on Software Engineering Standards. Curitiba, Brazil. Scherp, A., & Boll, S. (2004a, March). MobileMM4U - Framework support for dynamic personalized multimedia content on mobile systems. In Multikonferenz Wirtschaftsinformatik 2004, special track on Technologies and Applications for Mobile Commerce. Scherp, A., & Boll, S. (2004b, October). Generic support for personalized mobile multimedia tourist applications. Technical demonstration for the ACM Multimedia Conference, New York, USA. Scherp, A., & Boll, S. (2005, January). Paving the last mile for multi-channel multimedia presentation generation. In Proceedings of the 11th International Conference on Multimedia Modeling, Melbourne, Australia. Schmitz, P., Yu, J., & Santangeli, P. (1998). Timed interactive multimedia extensions for HTML (HTML+TIME). W3C, version 09/18/1998. Retrieved February 20, 2004, from http://www.w3.org/TR/NOTE-HTMLplusTIME
Stash, N., Cristea, A., & De Bra, P. (2004). Authoring of learning styles in adaptive hypermedia. In WWW04 Education Track. New York: ACM. Stash, N., & De Bra, P. (2003, June). Building Adaptive Presentations with AHA! 2.0. Proceedings of the PEG Conference, Sint Petersburg, Russia. Sun Microsystems, Inc. (2004). Java media framework API. Retrieved February 15, 2004, from http://java.sun.com/products/java-media/jmf/index.jsp Szyperski, C., Gruntz, D., & Murer, S. (2002). Component software: Beyond objectoriented programming (2nd ed.). Boston: Addison-Wesley. van Ossenbruggen, J.R., Cornelissen, F.J., Geurts, J.P.T.M., Rutledge, L.W., & Hardman, H.L. (2000, December). Cuypers: A semiautomatic hypermedia presentation system. Technical Report INS-R0025. CWI, The Netherlands. van Rossum, G., Jansen, J., Mullender, S., & Bulterman, D. C. A. (1993). CMIFed: A presentation environment for portable hypermedia documents. In Proc. of the First ACM International Conference on Multimedia, Anaheim, California. Villard, L. (2001, November). Authoring transformations by direct manipulation for adaptable multimedia presentations. In Proceeding of the ACM Symposium on Document Engineering, Atlanta, Georgia. Wahl, T., & Rothermel, K. (1994, May). Representing time in multimedia systems. In Proc. IEEE Int. Conf. on Multimedia Computing and Systems, Boston. Wu, H., de Kort, E., & De Bra, P. (2001). Design issues for general-purpose adaptive hypermedia systems. In Proc. of the 12th ACM Conf. on Hypertext and Hypermedia, rhus, none, Denmark. Yahoo!, Inc. (2002). MyYahoo!. Retrieved February 17, 2004, from http://my.yahoo.com/
288
Zutshi, Wilson, Krishnaswamy & Srinivasan
Chapter 12
The Role of Relevance Feedback in Managing Multimedia Semantics:

A Survey
Samar Zutshi, Monash University, Australia Campbell Wilson, Monash University, Australia Shonali Krishnaswamy, Monash University, Australia Bala Srinivasan, Monash University, Australia
Relevance feedback is a mature technique that has been used to take user subjectivity into account in multimedia retrieval. It can be seen as an attempt to bridge the semantic gap by keeping a human in the loop. A variety of techniques have been used to implement relevance feedback in existing retrieval systems. An analysis of these techniques is used to develop the requirements of a relevance feedback technique that aims to be capable of managing semantics in multimedia retrieval. It is argued that these requirements suggest a case for a user-centric framework for relevance feedback with low coupling to the retrieval engine.
ABSTRACT
The Role of Relevance Feedback in Managing Multimedia Semantics
289
INTRODUCTION
A key challenge in multimedia retrieval remains the issue often referred to as the semantic gap. Similarity measures computed on low-level features may not correspond well with human perceptions of similarity (Zhang et al., 2003). Human perceptions of similarity of multimedia objects such as images or video clips tend to be semantically based, that is, the perception that two multimedia objects are similar arises from these two objects evoking similar or overlapping concepts in the users mind. Therefore different users posing the same query may have very different expectations of what they are looking for. On the other hand, existing retrieval systems tend to return the same results for a given query. In order to cater to user subjectivity and to allow for the fact that the users perception of similarity may be different from the systems similarity measure, the users need to be kept in the loop. Relevance feedback is a mature and widely recognised technique for making retrieval systems better satisfy users information needs (Rui et al., 1997). Informally, relevance feedback can be interpreted as a technique that should be able to understand the users semantic similarity perception and to incorporate this in subsequent iterations. This chapter aims to provide an overview of the rich variety of relevance feedback techniques described in the literature while examining issues related to the semantic implications of these techniques. Section 2 presents a discussion of the existing literature on relevance feedback and highlights certain advantages and disadvantages of the reviewed approaches. This analysis is used to develop the requirements of a relevance feedback technique that would be an aid in managing multimedia semantics (Section 3). A high-level framework for such a technique is outlined in Section 4.
RELEVANCE FEEDBACK IN CONTENT-BASED MULTIMEDIA RETRIEVAL

Broadly speaking, anything the user does or says can be used to interpret something about their view of a computer system; for example, the time spent at a Web page, the motion of their eyes while viewing an electronic document, and so forth. In the context of content-based multimedia retrieval we use the term relevance feedback in the conventional sense, whereby users are allowed to indicate their opinion of results returned by a retrieval system. This is done in a number of ways; for example, the user only selects results that they consider relevant to their query, the user provides positive as well as negative examples, or the user is left to provide some sort of ranking of the images. In general terms, the user classifies the result set into a number of categories. The relevance feedback module should be able to use this classification to improve subsequent retrieval. It is expected that several successive iterations will further refine the result set, thus converging to an acceptable result. What makes a result acceptable is dependent on the user; it may be a single result (the so-called target searching of Cox et al. (1996)). On the other hand, acceptable may mean a sufficient number of relevant results (as when the users are performing a category search). Intuitively, for results to approximate semantic retrieval, the relevance feedback mechanism should understand why users mark the results the way they do. It should ideally be able to identify not only what is common about the results belonging to a
290
particular category but also take into account subtleties like what the difference is between an example being marked nonrelevant and being left unmarked. There is a rich body of literature on the subject of relevance feedback, particularly in the context of content-based image retrieval (CBIR). For instance, Zhang et al. (2003) interpret relevance feedback as a machine-learning problem and present a comprehensive review of relevance feedback algorithms in CBIR based on the learning and searching natures of the algorithms. Zhou and Huang (2001b) discuss a selection of relevance feedback variants in the context of multimedia retrieval examined under seven conceptual dimensions. These range from simple dimensions such as, What is the user looking for, to the highly diverse, The goals and the learning machines (Zhou & Huang, 2001b). We examine a variety of existing relevance feedback methods, many from CBIR since this area has received much attention in the literature. We classify them into one of five broad approaches. While there are degrees of overlap, this taxonomy focuses on the main conceptual thrust of the techniques and may be taken as representative rather than exhaustive.
The Classical Approach
In this approach, documents and queries are represented by vectors in an ndimensional feature space. The feedback is implemented through Query Point Movement (QPM) and/or Query ReWeighting (QRW). QPM aims to estimate an ideal query point by moving it closer to positive example points and away from negative example points. QRW tries to give higher importance to the dimensions that help in retrieving relevant images and reduce the importance of the ones that do not. The classical approach is arguably the most mature approach. Although elements of it can be found in almost all existing techniques, certain techniques are explicitly or very strongly of this type. The MARS system (Rui et al., 1997) uses a reweighting technique based on a refinement of the text retrieval approach. Later work by Rui and Huang (1999) has been adapted by Kang (2003) to use relevance feedback to detect emotional events in video. To overcome certain disadvantages of these approaches, such as the need for ad-hoc constants, the relevance feedback task is formalised as a minimisation problem in Ishikawa et al. (1998). A novel framework is presented in Rui and Huang (1999) based on a two-level image model with features like colour, texture and shape occupying the higher level. The lower level contains the feature vector for each feature. The overall distance between a training sample and the query is defined in terms of both these levels. The performance of the MARS and MindReader systems is compared against the novel model in terms of a percentage of relevant images returned. While relevance feedback boosts retrieval performance of all the techniques, the novel framework is able to consistently perform better than MARS and MindReader (Rui & Huang, 1999). The vector space model is not just restricted to CBIR. Liu and Wan (2003) have developed two relevance feedback algorithms for use in audio retrieval with time domain features as well as frequency domain features. The first is a standard deviation based reweighting while the second relies on minimizing the weighted distance between the relevant examples and the query. They demonstrate an average precision increase as a result of the use of relevance feedback and claim that, Through the relevance feedback
291
some [...] semantics can be added to the retrieval system, although the claim is not very well supported. The recent techniques in this area are characterised by their robustness, solid mathematical formulation and efficient implementations. However, their semantic underpinnings may be seen as somewhat simplistic due to the underlying assumptions that (a) a single query vector is able to represent the users query needs and (b) visually similar images are close together in the feature space. There is also often an underlying assumption in methods of this kind that the users ideal query point remains static throughout a query session. While this assumption may be justifiable while performing automated tests for computation of performance benchmarks, it clearly does not capture users information needs at a semantic level.
The Probabilistic-Statistical Approach

Feedback used to estimate the probability of other documents/images being relevant is often performed using Bayesian reasoning, for example, the PicHunter system (Cox et al., 1996). They are interested in the case when a user is interested in a specific target image as the result of their search. The probability that a given datum is the target is expressed in terms of the probability of the possible user actions (marking). The number of iterations taken to retrieve the target image (search length) is used as an objective benchmark. At each iteration four images are displayed. The user can select zero or more images as relevant. Unmarked ones are considered not relevant (absolute judgement). A subsequent extension (Cox et al., 1998), the stochastic-comparison search, is based on interpreting the marked images as more relevant than unmarked ones (relative judgement); the relevance feedback is addressed as a comparison-searching problem. The Stochastic model in Geman and Moquet (1999) tries to improve the PicHunter technique by recognising that the actual interactive process is more random than a noisy response to a known metric. Hence the metric itself is modeled as a random variable in three cases. In the first case with a fixed metric, in the second with a fixed distribution and in the general case the distributions vary with the display and target. The more complex and general model is seen to perform better in terms of mean search length than the ones with simplifying assumptions when humans perform the tests. The BIR probabilistic retrieval model (Wilson & Srinivasan, 2001) relies on inference within Bayesian networks. Evidential inference within the Bayesian networks is employed for the initial retrieval. Following this, diagnostic inference, suppressed in the initial retrieval, is used for relevance feedback subsequent to user selection of images (Wilson & Srinivasan, 2002). The generality and elaborate modeling of user subjectivity is representative of the more sophisticated techniques approach while the computational complexity associated with such modeling can be a challenge during implementation. The semantic interpretation of the techniques in this area is more appealing than those from the classical approach. It seems conceptually more robust to try to estimate the probability of a given datum being the target given the users actions and the information signatures of the multimedia objects. The intuitive guess that elaborate and complex modeling of the user response seems to be confirmed by Geman and Moquet (1999) who ascribe this behaviour to the greater allowance for variation and subjectivity in human decisions.
292
The Machine Learning Approach

The problem of relevance feedback has been considered a machine-learning issue the approach is to learn user preferences over time. Typically, the query image and the images marked relevant are used as positive training samples and the images marked nonrelevant are used as negative training samples. MacArthur et al. (2000) use decision trees as the basis of their relevance feedback framework, which they call RFDT. The samples are used to partition the feature space and classify the entire database of feature vectors based on the partitioning. RFDT stores historical information from previous feedback and uses it while inducing the decision tree. The gain in average precision is documented to improve with a higher number of images marked relevant at each iteration during testing on high-resolution computedtomography (HRCT) images of the human lung. Koskela et al. (2002) demonstrate that the PicSOM image retrieval system (Laaksonen et al., 2000) can be enhanced with relevance feedback. The assumption is that images similar to each other are located near each other on the Self-Organising Map surfaces. The semantic implications of these methods seem to have close parallels with those of the vector space approach. An interesting subclass of these techniques augments a classical or probabilistic method by learning from past feedback sessions. For example, the application of a Market Basket Analysis to data from log files to improve relevance feedback is described in Mller et al. (2004). The idea is to identify pairs of images that have been marked together (either as relevant or nonrelevant) and use the frequency of their being marked together to construct association rules and calculate their probabilities. Depending on the technique used to combine these probabilities with the existing relevance feedback formula, significant gains in precision and recall can be achieved. An unorthodox variation on the use of machine learning for relevance feedback involves an attempt to minimize the user involvement. This is done by adapting a selftraining neural network to model the notion of video similarity through automated relevance feedback (Muneesawang & Guang, 2002). The method relies on a specialized indexing paradigm designed to incorporate the temporal aspects of the information in a video clip. The machine learning methods usually place an emphasis on the users classification of the results, which is a conceptually natural and appealing way to model relevance feedback. However, obtaining an effective training sample may be a challenge. Further, care needs to be taken to ensure that considerations of user subjectivity are not lost while considering historical training data that may have been contributed by multiple users.
The Keyword Integration Approach

A method for integrating semantic (keyword) features and low-level features is presented in Lu et al. (2000). The low-level features are dealt with in a vector-space style. A semantic network is built to link the keywords to the images. An extension to this approach with cross-session memory is presented in Zhang et al. (2003). These techniques show that the computationally efficient vector-space style approach (for lowlevel features) and the machine-learning approach (for keywords) can be integrated. While these methods incorporate some elements of other approaches, we present them separately since the integration of keyword as well as visual features is their distinguishCopyright 2005, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
293
ing characteristic rather than their use of such techniques. These techniques enable the so-called cross-modal queries, which allow users to express queries in terms of keywords and yet have low-level features taken into account and vice versa. Further, Zhang et al. (2003) are able to utilise a learning process to propagate keywords to unlabelled images based on the users interaction. The motivation is that in some cases the users information need is better expressed in terms of keywords (possibly also combined with low-level features) than in terms of only low-level features. It seems to be taken for granted in these techniques that, in semantic terms, keywords represent a higher level of information about the images than the visual features. We elaborate on these assumptions in Section 3.
The Mathematical-Conceptual Approach

There has been a recent emergence of techniques that interpret the relevance feedback problem in a novel manner. Established conceptual and mathematical tools are used to develop elaborate frameworks in order to capture the semantic richness and diversity of the problem. Grootjen and van der Weide (2002) acknowledge that it is difficult to grasp the users information need since it is an internal mental state. If the user were able to browse through an entire document collection and pick out all the relevant ones, such a selection may be interpreted as a representation of the information need of that user with respect to the given collection. Reasoning in this manner implies that the result set returned is an approximation of the users information need. To ensure that this approximation is close to the actual need, they postulate a semantic structure of interconnected nodes. Each node represents a so-called concept and this overall structure becomes a concept lattice and can be derived using the theory of Formal Concept Analysis (Willie, 1982). Zhuang et al. (2002) have developed a two-layered graph theoretic model for relevance feedback in image retrieval that aims to capture the correlations between images. The two layers are the visual layer and the semantic layer. Each layer is an undirected graph with the nodes representing images and the links representing correlations. The nodes in the two layers actually represent the same images but the correlations are interpreted differently. A correlation between images in the visual layer denotes a similarity based on low-level features while a correlation in the semantic layer represents a high-level semantic correlation. The visual links are obtained in an offline manner at the time images are added to the database. An algorithm to learn the semantic links based on feedback is outlined in the paper. These semantic links are then exploited in subsequent iterations. The semantic information becomes more beneficial in the long term as the links are updated. Zutshi et al. (2003) strongly focus on the relevance feedback provided by considering it a classification problem and modeling it using rough set theory. A rough set decision system is constructed at each iteration of feedback and used to analyse the use response. Two feature reweighting schemes were proposed. Integrating this relevance feedback technique with a probabilistic retrieval system was demonstrated.
Analysis and Emerging Trends

A summary of the approaches reviewed is presented in Table 1 to establish their advantages and disadvantages with respect to the management of multimedia semantics.
294
Table 1. Advantages and disadvantages of the various approaches with respect to multimedia semantics
Approach Classical Advantages Computationally more efficient than other approaches, mature Conceptually appealing model of user response in the more sophisticated methods Avoids loss of information gained across sessions Disadvantages Simplistic semantic interpretation of the users information need (ideal query vector) Computational costs can become prohibitive with complex modeling Training sample can take time to accumulate, in some techniques user subjectivity not explicitly catered for Possible impedance mismatch between the different techniques used for visual and keyword features, difficulty in obtaining keyword annotations Computational costs can become prohibitive with complex modeling
ProbabilisticStatistical Machine Learning
Keyword Integration Can take advantage of keyword annotations as well as visual features which may be able to capture the users information need well MathematicalConceptual Few assumptions regarding users information need hence more usercentric
The review of the literature above suggests certain trends. There is a movement to more general and comprehensive approaches. In the vector-space arena this can be seen in the changes from MARS (Rui et al., 1997) to MindReader (Ishikawa et al., 1998) to the novel framework of Rui and Huang (1999). The vector-space style interpretation of the relevance feedback can be seen to be increasingly general, increasingly complex and reliant on fewer artificial parameters in its evolution, if we may term it thus. The spirit of recent generalised methods such as in Rui and Huang (1999) and Geman and Moquets (1999) stochastic model (an extension of the work by Cox et al., 1996) acknowledges a much higher level of complexity than was previously recognised. The application of the machine learning techniques and the augmentation of relevance feedback methods with historical information also present a promising avenue. Extending these approaches to incorporate per-user relevance feedback as well as the deduction of implicit truths that may apply across a large number of users will probably continue. The use of keyword features as a means to augment visual features is also a step forward. However, the success of its adoption in applications other than Web-based ones may depend on the feasibility of obtaining keyword annotations for large multimedia collections. Such keyword annotations could perhaps be accumulated manually during user feedback.
CONSIDERATIONS FOR A RELEVANCE FEEDBACK TECHNIQUE TO HANDLE MULTIMEDIA SEMANTICS

While considering existing techniques from the perspective of selecting appropriate strategies for a proposed multimedia system or for developing a new one, several
295
factors need to be taken into account. Zhou and Huang (2001a) have outlined several such factors while interpreting relevance feedback as a classification problem. Among these are the small sample issue, that is, the fact that in a given session the user provides a small number of training examples. Another key issue is the inherent asymmetry involved the training is in the form of a classification while the desired output of retrieval is typically a rank-ordered top-k return. They proceed to present a compilation of a list of critical issues to consider while designing a relevance feedback algorithm (Zhou & Huang, 2001b). While their analysis continues to be relevant and the issues presented remain pertinent, they do not explicitly take into account the added complexities of certain semantic implications. We attempt to outline certain critical considerations that need to be taken into account while selecting an existing relevance feedback technique (or developing a new one) to be used with a semantically focussed multimedia retrieval system. These requirements are based on the analysis and review of the literature in the second section.
Conceptual Robustness and Soft Modeling of the User Response

If semantic issues are a concern in a multimedia retrieval system, the relevance feedback must be based on a robust conceptual model, which captures at a high level the complexities of relevance feedback and the issues of subjectivity. Since most existing techniques are strongly analytical and often rely on what is essentially a reductionist approach, missing out on important details or making simplifying assumptions too early in the modeling process can lead to deficiencies at the implementation stage. For instance, an elaborate vector-space- style model proposes at the very outset that the user has an ... ideal query vector in mind and that the distance of the sample vectors from this ideal vector is a generalised ellipsoid distance (Ishikawa et al., 1998). It can be seen that from a semantic viewpoint, if we acknowledge that the users information needs may be complex and highly abstract, claiming that an ideal query vector can represent this may not be conceptually justifiable. The users information needs in openended browsing may evolve based on the results initially presented to them. Indeed, the users may not always start out with a fixed, clear query in their mind at all. To allow for these possibilities, the methods listed under the Mathematical-Conceptual approach (see that section) attempt to formulate a much higher level and a more general model of the relevance feedback scenario. This means that simplifying assumptions are only made at a later stage as and when they become necessary. This avoids the reductionist trap whereby if something relatively minor is missing at the higher levels, the lower levels may suffer greatly. In this sense, these approaches could be seen as having a conceptual advantage. However, this is not to be taken to mean that we endorse all complex and general models for relevance feedback blindly; nor that we deny that the harder approaches are useful. We merely point out the conceptual appeal of modeling a complex semantic issue in a fitting manner, and that doing so could be used as the basis of a more thorough analysis and cater to a wider variety of circumstances. An analogous argument can be made for the modeling of the user response. Indicating a few results to be of some type can be challenging to interpret semantically. There is much room for ambiguity; for example, does marking a given result relevant imply that this result meets the users need, or is that an indication that the need is not met but
296
the result marked relevant is a close approximation? Further, in the case when negative examples are employed (i.e., the user is allowed to nominate results as either being relevant or nonrelevant) what does the fact that the user left a result unmarked signify? For instance Cox et al. (1998) make a distinction between absolute judgement, where images being marked by the user are nominated as relevant while unmarked ones are considered nonrelevant; and relative judgement, which interprets the user feedback to mean that the marked images are more relevant than the unmarked ones. Geman and Moquet (1999) observe that their most general model with fewest simplifying assumptions may be the most suitable for modeling the human response, which, however, comes with a penalty in terms of computational complexity. Again, a general model, which incorporates a soft approach, could be seen as more conceptually robust than a harder one. Another important criterion for conceptual robustness is the compatibility of the tools used while developing a relevance feedback strategy. For instance, in the keywordintegration techniques a semantic web based on keywords is integrated with a vectorspace style representation of low-level features. Care must be taken to ensure that in cases like these, when multiple tools are combined, conceptually compatible approaches are used rather than fusing incompatible tools in an ad hoc manner. Interestingly, Cox et al. (1996) take notice of the fact that PicHunter seems to be able to place together images that are apparently semantically similar. This remains hard to explain. They advance as conjecture that their algorithms produce probability distributions that are very complex functions of the feature sets used. This phenomenon of the feature sets combined with user feedback producing seemingly semantic results does not seem to be explored in great depth by the literature on relevance feedback and possibly merits further investigation.
Computational Complexity and Effective Representation

Computational complexity and effective representation of the relevance feedback problem tend to be conflicting requirements. For the sake of computational efficiency, certain simplifying assumptions often have to be made, particularly in probabilistic techniques. These simplifications are designed to make a technique efficient to implement but almost invariably result in a loss of richness of representation. Such simplifications include the assumption that the probability that a given datum is the target is independent of who the user is, the assumption that the users action in a particular iteration is independent of the previous iterations (Cox et al., 1996) and the simplified metrics of Geman and Moquet (1999) which offer computational benefits over their most general (and arguably most effective) model. As is often the case, there is a trade-off to be made. If semantic retrieval is critical, users may be willing to allow for longer processing times, for example, while matching mug shots of criminal offenders to the query image of a suspect, high retrieval accuracy may be important enough to warrant longer waiting times. If, on the other hand, users are likely to use a retrieval system for so-called category searches, they may prefer lower processing times with acceptable results rather than very accurate results based on a sophisticated model of their preferences. However, semantic considerations may add another dimension to the computing complexity issue. The challenge is interpreting and representing the relevance feedback problem in a way that captures enough detail while remaining
297
computationally feasible. Ideally, the test of whether enough detail has been captured could be based on the notion of emergent meaning (Santini et al., 2001). If a relevance feedback technique can facilitate sufficiently complex interaction between the user and the retrieval system for semantic behaviour to emerge as a result of the interaction, the technique has effectively represented all aspects of the problem. The computational issues are related, since if the users interaction is useful in the emergence of semantic behaviour by the system, extended waiting times will influence their state of mind (frustration, disinterest) and possibly impair the semantic performance of the relevance feedback technique.
Keyword Features and Semantics

While acknowledging the breakthrough made in recent approaches such as Lu et al. (2000) and Zhang et al. (2003) whereby relevance feedback models can take advantage of keywords associated with images, it must be noted that simply incorporating keywords into a relevance feedback technique may not be sufficient to address the issue of semantic retrieval. A keyword can be seen as a way to represent a concept, part of a concept or multiple concepts depending on the keyword itself and its context. In multimedia collections where there are a relatively small number of ground truths and/ or the concepts that specific keywords refer to are agreed upon, definite improvements in retrieval can be achieved by incorporating keywords. These can be obtained from textual data associated with a multimedia object such as existing captions or descriptions in web pages. However, automatically associating keywords with collections of multimedia objects that have not been somehow manually annotated remains difficult. A suitable position for a relevance feedback technique to take would be to allow for the use of keywords so that if they can be obtained (or are already available) their properties can be exploited This is indeed done in Zhang et al. (2003). It is worth bearing in mind that, depending on the application domain, visual features and keywords may have varying distinguishing characteristics. In relatively homogenous collections requiring expert knowledge, for instance, keywords would clearly be very useful, for example, in finding images similar to those of a given species of flower (www.unibas.ch/botimage). In this case, it is likely that the users semantic perception can be reflected in the fact that a domain expert (or a classifier) has assigned a particular label (the name of the species) to a given image. In other contexts, say a collection of images of clothes, the user may be looking for matches such as blue shirts but not trousers, which could be addressed by a combination of visual features (colour and shape), or by a combination of keywords and visual features (colour and textual annotation of garment type) or even completely by keywords (textual annotations of colour and garment type). In such cases, it may not be semantically appropriate to consider keywords as being higher-level features than other visual features, as seems to be done by Zhang et al. (2003). This indicates that constructing a semantic web on top of a multimedia database might not accurately reflect the semantic content of a multimedia collection. A final point regarding the use of keywords as a feature: Keywords may be inadequate to capture the semantic content of media due to the richness of the content (Dorai et al., 2002). The role of the relevance feedback mechanism then becomes to try and adapt to the user by capturing the interaction between keywords and visual
298
information to mimic the users semantic perception. A future direction along these lines may be to combine the work of Zhang et al. (2003) with the work of Rui and Huang (1999), which incorporates the two-layer model for visual information. It may then be possible to build parallel or even interconnecting networks of visual and textual features in a step towards truly capturing the semantic content of multimedia objects.
Limitations of Precision and Recall

A perusal of the literature on relevance feedback demonstrates that by far the most common method of indicating performance is to present precision-recall measures. This may make it tempting to select a relevance feedback technique that has been documented as producing the best gain in terms of precision and recall. However, this may not necessarily be advisable in the case of a multimedia retrieval system that aims to handle semantic content. From this point of view, two major criticisms can be made of the use of precision and recall. Firstly, the practical utility of these measures from an end-user point of view (as opposed to the point of view of researchers) has been questioned. For example, a study outlined in Su (1994) contained the conclusion that precision was not a good indicator of user perceptions of quality of content-based image retrieval. In this study, recall was not evaluated as a quality measure since the authors point out that this would involve advance knowledge of the contents of the image database by the end users; an unrealistic assumption for practical image database systems. Such criticism of recall and precision has in fact been made almost from the time of their conception. For example, it is stated in Cleverdon (1974) that the use of recall and precision in highly controlled experimental settings did not necessarily translate well to operational systems. A key contention common to these negative analyses of recall and precision would appear to be that simple numerical scores of researcher-determined relevance judgments do not capture the semantic intentions of real users. The second criticism to be levelled at recall and precision (and indeed can be applied to many attempts at objective evaluation of content-based multimedia retrieval) concerns the nature of relevance itself. Relevance is directly correlated with the highly subjective notion of similarity and is therefore itself highly subjective. The research in this area indicates that relevance is itself a difficult concept to define. Many different definitions of relevance are in use in information retrieval (Mizzaro, 1998). In the textual information retrieval arena, concepts such as contextual pertinence (Ruthven & van Rijsbergen, 1996) and situational relevance (Wilson, 1973) are also used to further express semantic information needs of users. Attempts at evaluation of multimedia retrieval systems based on numeric relevance measures, particularly those drawing comparisons with other systems, should be made with caution. As stated by Hersh (1994), these measures are important, but they must be placed in perspective and particular attention paid to their meaning.
USER-CENTRIC MODELING OF RELEVANCE FEEDBACK

From the review of various feedback techniques in the second section and the requirements of relevance feedback outlined in the third section, it can be seen that much
299
of the current literature suggests a strong focus on the underlying retrieval engine. Indeed the existing literature often tends to be constrained by the value set of the features, for example, the separate handling of numerical and keyword features (Lu et al., 2002; Zhang et al., 2003). In this section we make a case for and present an alternate perspective. Nastar et al. (1998) point out that image databases can be classified into two types. The first type is the homogeneous collection where there is a ground truth regarding perceptual similarity. This may be implicitly obvious to human users, such as in photographs of people where two images are either of the same person or not. Otherwise, experts may largely agree upon similarity, for example, two images either represent the same species of flower, or they do not. However, in the second type, the heterogeneous collection, no ground truth may be available. A characteristic example of this type is a stock collection of photographs. Systems designed to perform retrieval on this second category of collection should therefore be as flexible as possible, adapting to and learning from each user in order to satisfy their goal (Nastar et al., 1998). This classification has significance in the context of databases of other multimedia objects such as video and audio clips. In homogeneous collections, retrieval strategies may perform adequately even if they are not augmented by relevance feedback. This is because the feature set and similarity measure can be chosen in order to best reflect the accepted ground truths. This would not be true of heterogeneous collections. Relevance feedback becomes increasingly necessary with heterogeneity in large multimedia collections. When there is no immediately obvious ground truth, the semantic considerations can become extremely complex. We try to illustrate this complexity in Figure 1, which uses notation similar to that of ER Diagrams, while eschewing the use of cardinality. The users information need may be specific to any of several relevance entities: the users themselves (subjectivity); the context of their situation (e.g., while considering fashion, the keyword model would map to a different semantic concept than if cars were being considered); or the semantic concept(s) associated with their information need (which may or may not be expressible in terms of keywords and visual features). There would then be zero or more multimedia objects that would be associated with the users information need. The key point is that the information need is potentially specific to all four entities in the diagram. Figure 1. Complex information need in a heterogeneous collection
User
Multimedia Object
Information need
Semantic Concept
Context
300
It can be seen that while considering whether a multimedia object satisfies a users information need, all the factors that are to be taken into account are based on the user. This clearly highlights the importance of relevance feedback for semantic multimedia retrieval feedback provided by the user can be used to ensure that the concept symbolising their information need is not neglected. It follows that the underlying model of relevance feedback must be explicitly user-centric and semantically based to better meet the users information needs. In order to identify a way to address this challenge, the overall multimedia process can be reviewed. A high-level overview of the information flow in the relevance feedback and multimedia retrieval process is presented in Figure 2. In existing approaches the retrieval mechanism and the relevance feedback module have often been formulated very closely integrated with each other such as in the classical approach. In the extreme case, the initial query can be interpreted as the initial iteration of user feedback. For example, in PicHunter (Cox et al., 1996; Cox et al., 1998) and MindReader (Ishikawa et al., 1998), there is no real distinction between the relevance feedback and the retrieval. As an alternative approach, it is possible to interpret the task of relevance feedback as being distinct from retrieval, by identifying and separating their roles. Retrieval tries to answer the question which objects are similar to the query specification? Feedback deals with the questions what does the fact that objects were marked in this fashion tell us about what the user is interested in? And how can this information be best conveyed to the retrieval engine? As reflected in Figure 2, the task of relevance feedback can then be considered a module in the overall multimedia retrieval process. If the relevance feedback technique is modeled in a general fashion, it can focus on interpreting user input and be loosely coupled to the retrieval engine and the feature set. By reducing the dependency of the relevance feedback module on the feature set and its value set, the incorporation of novel visual and semantic features as they become
Figure 2. A high-level overview of the multimedia retrieval process

Object Collection Objects Query User Interface Feature Extractor Query Specification Result Set Feature Descriptions
Classified Results
Retrieval Engine
Relevance Feedback Module Query Modification Specification
Log
301
available would be possible. A highly general relevance feedback model would also support user feedback in terms of an arbitrary number of classes, that is, support the classification of the result set into the categories relevant and unmarked; relevant, nonrelevant and unmarked; or even, say, results given an integral ranking between one and five. The interpretations of each case would, of course, have to be consistent to maintain semantic integrity. An interesting effect of modeling relevance feedback is the possibility for the use of multiple relevance feedback modules with the same retrieval engine. A relevance feedback controller could then potentially be developed to apply the most suitable relevance feedback algorithm depending on specific users requirements in the context of the collection. This would mean that the controller would have the task of identifying which relevance feedback technique would best approximate the users information need at the semantic level.
CONCLUSION
There is a rich and diverse collection of relevance feedback techniques described in the literature. Any multimedia retrieval system aiming to cater to users needs at a semantic or close to semantic level of performance should certainly incorporate a relevance feedback strategy as a measure towards narrowing the semantic gap. An understanding of the generic types of techniques that are available and what bearing they have on the specific goals of the multimedia retrieval system in question is essential. The situation is complicated by the fact that in the literature such techniques are often supported by precision and recall figures and other such objective measures. While these figures can reveal certain characteristics about the relevance feedback and its enhancement of the retrieval process, it is safe to say that such figures alone are not sufficient to compare alternatives, especially in semantic terms. It would appear that at the moment there are no definitive comparison criteria. However, a subjective consideration of available techniques can be made in accordance with the guidelines outlined and the conceptual alignment of a relevance feedback technique with the goals of the given retrieval system. Such an approach has the benefit of being able to eliminate techniques should they not meet the requirements envisioned for the retrieval system. Should existing trends continue and relevance feedback systems continue to become more complex and general, it may be possible to implement a relevance feedback controller, or a metarelevance feedback module that can use information gathered from the users feedback to actually select and refine the systems relevance feedback strategy to better meet the users specific information needs at a semantic level.
Cleverdon, C. W. (1974) User evaluation of information retrieval systems. Journal of Documentation, 30, 170-180. Cox, I., Miller, M., Omohundro, S., & Yianilos, P. (1996). PicHunter: Bayesian relevance feedback for image retrieval. In Proceedings of the 13th International Conference on Pattern Recognition (Vol. 3, pp. 361-369).
REFERENCES
302
Cox, I. J., Miller, M. L., Minka, T. P., & Yianilos, P. N. (1998). An optimized interaction strategy for bayesian relevance feedback. In IEEE Conf. on Comp. Vis. and Pattern Recognition, Santa Barbara, California (pp. 553-558). Dorai, C., Mauthe, A., Nack, F., Rutledge, L., Sikora, T., & Zettl, H. (2002). Media semantics: Who needs it and why? In Proceedings of the 10th ACM International Conference on Multimedia (pp. 580 -583). Geman, D., & Moquet, R. (1999). A stochastic feedback model for image retrieval. Technical report, Ecole Polytechnique, 91128 Palaiseau Cedex, France. Grootjen, F. & van der Weide, T. P. (2002). Conceptual relevance feedback. In IEEE International Conference on Systems, Man and Cybernetics (Vol. 2, pp. 471-476). Hersh, W. (1994) Relevance and retrieval evaluation: Perspectives from medicine. Journal of the American Society for Information Science, 45(3), 201-206. Ishikawa, Y., Subramanya, R., & Faloutsos, C. (1998). MindReader: Querying databases through multiple examples. In Proceedings of the 24th International Conference on Very Large Data Bases, VLDB (pp. 218-227). Kang, H.-B. (2003). Emotional event detection using relevance feedback. In Proceedings of the International Conference on Image Processing (Vol. 1, pp. 721-724). Koskela, M., Laaksonen, J., & Oje, E. (2002). Implementing relevance feedback as convolutions of local neighborhoods on self-organizing maps. In Proceedings of the International Conference on Artificial Neural Networks (pp. 981-986), Madrid, Spain. Laaksonen, J., Koskela, M., Laakso, S.P., & Oje, E. (2000). PicSOM Content-based image retrieval with self organizing maps. Pattern Recognition Letters, 21, 1199-1207. Liu, M., & Wan, C. (2003). Weight updating for relevance feedback in audio retrieval. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 03) (Vol. 5, pp. 644-647). Lu, Y., Hu, C., Zhu, X., Zhang, H., & Yang, Q. (2000). A unified framework for semantics and feature based relevance feedback in image retrieval systems. In Proceedings of the eighth ACM international conference on Multimedia, Marina del Rey, California (pp. 31-37). MacArthur, S., Brodley, C., & Shyu, C.-R. (2000). Relevance feedback decision trees in content-based image retrieval. In Proceedings of the IEEE Workshop on Contentbased Access of Image and Video Libraries (pp. 68-72). Mizzaro, S. (1998). How many relevances in information retrieval? Interacting with Computers, 10(3), 305-322. Mller, H., Squire, D., & Pun, T. (2004). Learning from user behaviour in image retrieval: Application of market basket analysis. International Journal of Computer Vision, 56(1-2), 65-77. Muneesawang, P., & Guan, L. (2002). Video retrieval using an adaptive video indexing technique and automatic relevance feedback. In Proceedings of the IEEE Workshop on Multimedia Signal Processing (pp. 220-223). Nastar, C., Mitschke, M., & Meilhac, C. (1998). Efficient query refinement for image retrieval. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Santa Barbara, California (pp. 547-552). Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information Sciences, 11, 341-356.
303
Rui, Y., & Huang, T. S. (1999). A novel relevance feedback technique in image retrieval. In Proceedings of the Seventh ACM International Conference on Multimedia (Part 2), Orlando, Florida (pp. 67-70). Rui, Y., Huang, T., & Mehrotra, S. (1997). Content-based image retrieval with relevance feedback in MARS. In Proceedings of the IEEE International Conference on Image Processing (pp. 815-818). Ruthven, I., & van Rijsbergen, C. J. (1996). Context generation in information retrieval. In Proceedings of the Florida Artificial Intelligence Research Symposium. Santini, S., Gupta, A., & Jain, R. (2001). Emergent semantics through interaction in image databases. In IEEE Transaction on Knowledge and Data Engineering, 13(3), 337351. Su, L. T. (1994). The relevance of recall and precision in user evaluation. Journal of the American Society for Information Science, 45(3), 207-217. Willie, R. (1982). Restructuring lattice theory: An approach based on hierarchies of concepts. In I. Rival (Ed.), Ordered sets, 445-470. D. Reidel Publishing Company. Wilson, C., & Srinivasan, B. (2002). Multiple feature relevance feedback in content based image retrieval using probabilistic inference networks. In Proceedings of the First International Conference on Fuzzy Systems and Knowledge Discovery (FSKD02a), Singapore (pp. 651-655). Wilson, C., Srinivasan, B., & Indrawan, M. (2001). BIR - The Bayesian network image retrieval system. In Proceedings of the IEEE International Symposium on Intelligent Multimedia, Video and Speech Processing (ISIMP2001) (pp. 304-307). Hong Kong SAR, China. Wilson, P. (1973). Situational relevance. Information Retrieval and Storage, 9, 457-471. Zhang, H., Chen, Z., Li, M., & Su, Z. (2003). Relevance feedback and learning in contentbased image search. In World Wide Web: Internet and Web Information Systems, 6, (pp. 131-155). The Netherlands: Kluwer Academic Publishers. Zhou, X. S., & Huang, T. S. (2001a). Comparing discriminating transformations and SVM for learning during multimedia retrieval. In Proceedings of the Ninth ACM International Conference on Multimedia (pp. 137-146). Ottawa, Canada. Zhou, X. S., & Huang, T. S. (2001b). Exploring the nature and variants of relevance feedback. In IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL 2001), 94-101. Zhuang, Y., Yang, J., Li, Q., & Pan, Y. (2002). A graphic-theoretic model for incremental relevance feedback in image retrieval. In International Conference on Image Processing, 1, 413-416. Zutshi, S., Wilson, C., Krishnaswamy, S., & Srinivasan, B. (2003). Modelling relevance feedback using rough sets. In Proceedings of the Fifth International Conference on Advances in Pattern Recognition (ICAPR 2003) (pp. 495-500).
304
Section 4 Managing Distributed Multimedia
EMMO 305
Chapter 13
Tradeable Units of KnowledgeEnriched Multimedia Content

Utz Westermann, University of Vienna, Austria Sonja Zillner, University of Vienna, Austria Karin Schellner, ARC Research Studio Digital Memory Engineering, Vienna, Austria Wolfgang Klas, University of Vienna and ARC Research Studio Digital Memory Engineering, Vienna, Austria
EMMO:
Current semantic approaches to multimedia content modeling treat the contents media, the semantic description of the content, and the functionality performed on the content, such as rendering, as separate entities, usually kept on separate servers in separate files or databases and typically under the control of different authorities. This separation of content from its description and functionality hinders the exchange and sharing of content in collaborative multimedia application scenarios. In this chapter, we propose Enhanced Multimedia MetaObjects (Emmos) as a new content modeling formalism that combines multimedia content with its description and functionality. Emmos can be serialized and exchanged in their entirety covering media, description, and functionality and are versionable, thereby establishing a suitable foundation for collaborative multimedia applications. We outline a distributed infrastructure for Emmo management and illustrate the benefits and usefulness of Emmos and this infrastructure by means of two practical applications.
ABSTRACT
306
Westermann, Zillner, Schellner & Klas
INTRODUCTION
Todays multimedia content formats such as HTML (Raggett et al., 1999), SMIL (Ayars et al., 2001), or SVG (Ferraiolo et al., 2003) primarily encode the presentation of content but not the information content conveys. But this presentation-oriented modeling only permits the hard-wired presentation of multimedia content exactly in the way specified; for advanced operations like retrieval and reuse, automatic composition, recommendation, and adaptation of content according to user interests, information needs, and technical infrastructure, valuable information about the semantics of content is lacking. In parallel to research on the Semantic Web (Berners-Lee et al., 2001; Fensel, 2001), one can therefore observe a shift in paradigm towards a semantic modeling of multimedia content. The basic media of which multimedia content consists are supplemented with metadata describing these media and their semantic interrelationships. These media and descriptions are processed by stylesheets, search engines, or user agents providing advanced functionality on the content that can exceed mere hard-wired playback. Current semantic multimedia modeling approaches, however, largely treat the contents basic media, the semantic description, and the functionality offered on the content as separate entities: the basic media of which multimedia content consists are typically stored on web or media servers; the semantic descriptions of these media are usually stored in databases or in dedicated files on web servers using formats like RDF (Lassila & Swick, 1999) or Topic Maps (ISO/IEC JTC 1/SC 34/WG 3, 2000); the functionality on the content is normally realized as servlets or stylesheets running in application servers or as dedicated software running at the clients such as user agents. This inherent separation of media, semantic description, and functionality in semantic multimedia content modeling, however, hinders the realization of multimedia content sharing as well as collaborative applications which are gaining more and more importance, such as the sharing of MP3 music files (Gnutella, n.d.) or learning materials (Nejdl et al., 2002) or the collaborative authoring and annotation of multimedia patient records (Grimson et al., 2001). The problem is that exchanging content today in such applications simply means exchanging single media files. An analogous exchange of semantically modeled multimedia content would have to include content descriptions and associated functionality, which are only coupled loosely to the media and usually exist on different kinds of servers potentially under control of different authorities, and which are thus not easily moveable. In this chapter, we give an illustrated introduction to Enhanced Multimedia MetaObjects (Emmo), a semantic multimedia content modeling approach developed with collaborative and content sharing applications in mind. Essentially, an Emmo constitutes a self-contained piece of multimedia content that merges three of the contents aspects into a single object: the media aspect, that is, the media which make up the multimedia content, the semantic aspect which describes the content, and the functional aspect by which an Emmo can offer meaningful operations on the content and its description that can be invoked and shared by applications. Emmos in their entirety including media, content description, and functionality can be serialized into bundles and are versionable: essential characteristics that enable their exchangeability in content sharing applications as well as the distributed construction and modification of Emmos in collaborative scenarios.
EMMO 307
Furthermore, this chapter illustrates how we employed Emmos for two concrete collaborative and content sharing applications in the domains of cultural heritage and digital music archives. The chapter is organized as follows: we begin with an overview of Emmos and show their difference to existing approaches for multimedia content modeling. We then introduce the conceptual model behind Emmos and outline a distributed Emmo container infrastructure for the storage, exchange, and collaborative construction of Emmos. We then apply Emmos for the representation of multimedia content in two application scenarios. We conclude this paper with a summary and give an outlook to our current and future work.
BACKGROUND
In this section, we provide a basic understanding of the Emmo idea by means of an illustrating example. We show the uniqueness of this idea by relating Emmos to other approaches to multimedia content modeling in the field.
The Emmo Idea

The motivation for the development of the Emmo model is the desire to realize multimedia content sharing and collaborative applications on the basis of semantically modeled content but to avoid the limitations and difficulties of current semantic modeling approaches implied by their isolated treatment of media, semantic description, and content functionality. Following an abstract vision originally formulated by (Reich et al., 2000), the essential idea behind Emmos is to keep semantic description and functionality tied to the pieces of content to which they belong, thereby creating self-contained units of semantically modeled multimedia content easier to exchange in content sharing and collaborative application scenarios. An Emmo coalesces the basic media of which a piece of multimedia content consists (i.e., the contents media aspect), the semantic description of these media (i.e., the contents semantic aspect), and functionality on the content (i.e., the contents functional aspect) into a single serializeable and versionable object. Figure 1 depicts a sketch of a simplified Emmo, which models a small multimedia photo album of a holiday trip of an imaginary couple Paul and Mary and their friend Peter. The bottom of the figure illustrates how Emmos address the media aspect of multimedia content. An Emmo acts as a container of the basic media of which the content that is represented by the Emmo consists. In our example, the Emmo contains four JPEG images which constitute the different photographs of the album along with corresponding thumbnail images. Media can be contained either by inclusion, that is, raw media data is directly embedded within an Emmo, or by reference via a URI if embedding raw media data is not feasible because of the size of media data or the media is a live stream. An Emmo further carries a semantic description of the basic media it contains and the associations between them. This semantic aspect, illustrated to the upper left of Figure 1, makes an Emmo a unit of knowledge about the multimedia content it represents.
308
Figure 1. Aspects of an Emmo

Semantics
instance-of France part-of Paris location part-of Vienna location location depicts Location instance-of Austria part-of Salzburg location depicts depicts depicts depicts instance-of Peter is-a Friend Person is-a Family Member
Functionality
renderAsSlideShow(persons, locations, dates) renderAsMap(persons, location, dates)
instance-of instance-of Paul Mary
Picture 1 Shot at 07/21/2003
instance-of instance-of Photograph instance-of
Media
picture001.jpg
picture002.jpg thumbnail002.jpg
picture003.jpg thumbnail003.jpg
picture004.jpg
thumbnail001.jpg
thumbnail004.jpg
For content description, Emmos apply an expressive concept graph-like data model similar to RDF and Topic Maps. In this graph model, the description of the content represented by an Emmo is not performed directly on the media that are contained in the Emmo; instead, the model abstracts from physical media making it possible to subsume several media objects which constitute only different physical manifestations of logically one and the same medium under a single media node. This is a convenient way to capture alternative media. In Figure 1, for example, each media node Picture 1 Picture 4 subsumes not only a photo but also its corresponding thumbnail image. Apart from media, nodes can also represent abstract concepts. By associating an Emmos media objects with such concepts, it is possible to create semantically rich descriptions of the multimedia content the Emmo represents. In Figure 1, for instance, it is expressed that the logical media nodes Picture 1 Picture 4 constitute photos taken in Paris, Vienna, and Salzburg showing Peter and Paul, Paul and Mary, and Mary, respectively. The figure further indicates that nodes can be augmented with primitive attribute values for closer description: the pictures of the photo album are furnished with the dates at which they have been shot. By associating concepts with each other, it is also possible to express domain knowledge within an Emmo. It is stated in our example that Peter, Paul, and Mary are Persons, that Paul and Mary are family members, that Peter is a friend, that Paris is located in France, and that Vienna and Salzburg are parts of Austria. The Emmo model does not predefine the concepts, association types, and primitive attributes available for media description; these can be taken from arbitrary, domainspecific ontologies. While they thus constitute a very generic, flexible, and expressive approach to multimedia content modeling, Emmos are not ready-to-use formalism but require an agreed common ontology before they can be employed in an application.
EMMO 309
Figure 2. Emmo functionality
Finally, Emmos also address the functional aspect of content. An Emmo can offer operations that can be invoked by applications in order to work with the content the Emmo represents in a meaningful manner. As shown to the top right of Figure 1, our example Emmo provides two operations supporting two different rendition options for the photo album, which are illustrated by the screenshots of Figure 2. As indicated by the left screenshot, the operation renderAsSlideshow()might know how to given a set of persons, locations, as well as time periods of interest render the photo album as a classic slideshow on the basis of the contained pictures and their semantic description by generating an appropriate SMIL presentation. As indicated by the right screenshot, the operation renderAsMap() might also know how to given the same data render the photo album as a map with thumbnails pointing to the locations where photographs have been taken by constructing an SVG graph. One may think of many further uses of operations. For example, operations could also be offered for rights clearance, displaying terms of usage, and so forth. Emmos have further properties: an Emmo can be serialized and shared in its entirety in a distributed content sharing scenario including its contained media, the semantic description of these media, and its operations. In our example, this means that Paul can accord Peter the photo album Emmo as a whole for instance, via email or a file-sharing peer-to-peer infrastructure and Peter can do anything with the Emmo that Paul can also do, including invoking its operations. Emmos also support versioning. Every constituent of an Emmo is versionable, an essential prerequisite for applications requiring the distributed and collaborative authoring of multimedia content. This means that Peter, having received the Emmo from Paul, can add his own pictures to the photo album while Paul can still modify his local copy. Thereby, two concurrent versions of the Emmo are created. As the Emmo model is able to distinguish both versions, Paul can merge them into a final one when he receives Peters changes.
Related Approaches
The fundamental idea underlying the concept of Emmos presented beforehand is that an Emmo constitutes an object unifying three different aspects of multimedia
310
content, namely the media aspect, the semantic aspect, and the functional aspect. In the following, we fortify our claim that this idea is unique. Interrelating basic media like single images and videos to form multimedia content is the task of multimedia document models. Recently, several standards for multimedia document models have emerged (Boll et al., 2000), such as HTML (Ragett et al., 1999), XHTML+SMIL (Newmann et al., 2002), HyTime (ISO/IEC JTC 1/SC 34/WG 3, 1997), MHEG-5 (ISO/IEC JTC 1/SC 29, 1997), MPEG-4 BIFS and XMT (Pereira & Ebrahimi, 2002), SMIL (Ayars et al., 2001), and SVG (Ferraiolo et al., 2003). Multimedia document models can be regarded as composite media formats that model the presentation of multimedia content by arranging basic media according to temporal, spatial, and interaction relationships. They thus mainly address the media aspect of multimedia content. Compared to Emmos, however, multimedia document models neither interrelate multimedia content according to semantic aspects nor do they allow providing functionality on the content. They rely on external applications like presentation engines for content processing. As a result of research concerning the Semantic Web, a variety of standards have appeared that can be used to model multimedia content by describing the information it conveys on a semantic level, such as RDF (Lassila & Swick, 1999; Brickley & Guha, 2002), Topic Maps (ISO/IEC JTC 1/SC 34/WG 3, 2000), MPEG-7 (especially MPEG-7s graph tools for the description of content semantics (ISO/IEC JTC 1/SC 29/WG 11, 2001)), and Conceptual Graphs (ISO/JTC1/SC 32/WG 2, 2001). These standards clearly cover the semantic aspect of multimedia content. As they also offer means to address media within a description, they undoubtedly refer to the media aspect of multimedia content as well. Compared to Emmos, however, these approaches do not provide functionality on multimedia content. They rely on external software like database and knowledge base technology, search engines, user agents, and so forth, for the processing of content descriptions. Furthermore, media descriptions and the media described are separate entities potentially scattered around different places on the Internet, created and maintained by different and unrelated authorities not necessarily aware of each other and not necessarily synchronized whereas Emmos combine media and their semantic relationships into a single indivisible unit. There exist several approaches that represent multimedia content by means of objects. Enterprise Media Beans (EMBs) (Baumeister, 2002) extend the Enterprise Java Beans (EJBs) architecture (Matena & Hapner, 1998) with predefined entity beans for the representation of basic media within enterprise applications. These come with rudimental access functionality but can be extended with arbitrary functionality using the inheritance mechanisms available to all EJBs. Though addressing the media and functional aspects of content, EMBs in comparison to Emmo are mainly concerned with single media content and not with multimedia content. Furthermore, EMBs do not offer any dedicated support for the semantic aspect of content. Adlets (Chang & Znati, 2001) are objects that represent individual (not necessarily multimedia) documents. Adlets support a fixed set of predefined functionality which enables them to advertise themselves to other Adlets. They are thus content representations that address the media as well as the functional aspect. Different from Emmos, however, the functionality supported by Adlets is limited to advertisement and there is no explicit modeling of the semantic aspect. Tele-Action Objects (TAOs) (Chang et al., 1995) are object representations of multimedia content that encapsulate the basic media of which the content consists and
EMMO 311
interlink them with associations. Though TAOs thus address the media aspect of multimedia content in a way similar to Emmos, they do not adequately cover the semantic aspect of multimedia content: only a fixed set of five association types is supported mainly concerned with temporal and spatial relationships for presentation purposes. TAOs can further be augmented with functionality. Such functionality is, in contrast to the functionality of Emmos, automatically invoked as the result of system events and not explicitly invoked by applications. Distributed Active Relationships (Daniel et al., 1998) define an object model based on the Warwick Framework (Lagoze et al., 1996). In the model, Digital Objects (DOs), which are interlinked with each other by semantic relationships, act as containers of metadata describing multimedia content. DOs thus do not address the media aspect of multimedia content but focus on the semantic aspect. The links between containers can be supplemented with arbitrary functionality. As a consequence, DOs take account of the functional aspect as well. Different from Emmos, however, the functionality is not explicitly invoked by applications but implicitly whenever an application traverses a link between two DOs.
ENHANCED MULTIMEDIA META OBJECTS

Having motivated and illustrated the basic idea behind them, this section semiformally introduces the conceptual model underlying Emmos using UML class diagrams. A formal definition of this model can be found in (Schellner et al., 2003). The discussion is oriented along the three aspects of multimedia content encompassed by Emmos: the media aspect, the semantic aspect, and the functional aspect.
Figure 3. Management of basic media in an Emmo

Connector 0..1 0..* 1 MediaProfile +audioChannels : int +bandWidth : float +bitRate : int +colorDomain : String +contentType : String +duration : float +fileFormat : String +fileSize : int +fontSize : int +fontStyle : String +frameRate : double +height : int +profileID : String +qualityRate : float +resolution : int +samplingRate : double +width : int FullSelector TemporalSelector +beginMs : int +durationMs : int SpatialSelector +startX : int +startY : int +endX : int +endY : int TextualSelector +beginChar : int +endChar : int 1 MediaSelector 1..* 0..*
CompositeSelector +compositionType : int
MediaInstance +inlineMedia : Byte[] +locationDescription : String +mediaURL : URL
1..*
312
Media Aspect
Addressing the media aspect of multimedia content, an Emmo encapsulates the basic media of which the content it represents is composed. Figure 3 presents the excerpt of the conceptual model which is responsible for this. Closely following the MPEG-7 standard and its multimedia description tools (ISO/ IEC JTC 1/SC 29/WG 11, 2001), basic media are modeled by media profiles (represented by the class MediaProfile in Figure 3) along with associated media instances (represented by the class MediaInstance). Media profiles hold low-level metadata describing physical characteristics of the media such as the storage format, file size, and so forth.; the media data itself is represented by media instances, each of which may directly embed the data in form of a byte array or, if that is not possible or feasible, address its storage location by means of a URI. Moreover, if a digital representation is not available, a textual location description can be specified, for example the location of analog tapes in some tape archive. Figure 3 further shows that a media profile can have more than one media instances. In this way, an Emmo can be provided with information about alternative storage locations of media. Basic media represented by media profiles and media instances are attached to an Emmo by means of a connector (see class Connector in Figure 3). A connector does not just address a basic medium via a media profile; it may also refer to a media selector (see base class MediaSelector) to address only a part of the medium. As indicated by the various subclasses of MediaSelector, it is possible to select media parts according to simple textual, spatial, temporal and textual criteria, as well as an arbitrary combination of these criteria (see class CompositeSelector). It is thus possible to address the upper right part of a scene in a digital video starting from second 10 and lasting until second 30 within an Emmo without having to extract that scene and put it into a separate media file using a video editing tool.
Semantic Aspect
Out of the basic media which it contains, an Emmo forges a piece of semantically modeled multimedia content by describing these media and their semantic interrelationships. The class diagram of Figure 4 gives an overview over the part of the Emmo model that provides these semantic descriptions. As one can see, the basic building blocks of the semantic descriptions, the so-called entities, are subsumed under the common base
Figure 4. Semantic aspect of Emmos

Entity
0..* 0..* LogicalMediaPart Emmo Association OntologyObject
EMMO 313
Figure 5. Entity details

Entity 0..* +successor +OID : String +name : String +description : String +creationDate : long +modifiedDate : long +creator : String 1 0..* 0..* 0..* +predecessor 0..* +type 0..* 1 +attribute +value AttributeValue +value : Object
OntologyObject
class Entity. The Emmo model distinguishes four kinds of entities: namely, logical media parts, associations, ontology objects, and Emmos themselves, represented by assigned subclasses. These four kinds of entities have a common nature but each extends the abstract notion of an entity with additional characteristic features. Figure 5 depicts the characteristics that are common to all kinds of entities. Each entity is globally and uniquely identified by its OID, realized by means of a universal unique identifier (UUID) (Leach, 1998) which can be easily created even in distributed scenarios. To enhance human readability and usability, each entity is further augmented with additional attributes like a name and a textual description. Moreover, each entity holds information about its creator and its creation and modification date. Figure 5 further expresses that entities may receive an arbitrary number of types. A type is a concept taken from an ontology and represented by an ontology object in the model. Types thus constitute entities themselves. By attaching types, an entity gets meaning and is classified in an application-dependent ontology. As mentioned before, the Emmo model does not come with a predefined set of ontology objects but instead relies on applications to agree on common ontology before the Emmo model can be used. In the example of Figure 6, the entity Picture 3 of kind logical media part (depicted as a rectangle), which represents the third picture of our example photo album of the
Figure 6. Entity with its types
Picture 3
digital image
photograph
314
Figure 7. Entity with an attribute value

:java.util.Date
Picture 3
07/28/2003
date
holiday trip introduced in the previous section, is an instantiation of the concepts photograph and digital image, represented by the ontology objects photograph and digital image (each pictured by an octagon), respectively. The type relationships are indicated by dashed arrows. For further description, the Emmo model also allows attaching arbitrary attribute values to entities (expressed by the class of the same name in the class diagram of Figure 5). Attribute values are simple attribute-value pairs, with the attributes being a concept of an application-dependent ontology represented by an ontology object entity, and the value being an arbitrary object suiting the type of the value. The rationale behind representing attributes by concepts of an ontology and not just by mere string identifiers is that this allows expressing constraints on the usage of attributes within the ontology; for example, to which entity types attributes are applicable. Figure 7 gives an example of attribute values. In the figure, it is stated that the third picture of the photo album was taken July 28, 2003, by attaching an attribute value date=07/28/2003 to the entity Picture 3 representing that picture. The attribute date is modeled by the ontology object date and the value 07/28/2003 is captured by an object of a suitable date class (represented using the UML object notation). As an essential prerequisite for the realization of distributed, collaborative multimedia applications in which multimedia content is simultaneously authored and annotated by different persons at different locations, the Emmo model provides intrinsic support for versioning. The class diagram of Figure 5 states that every entity is versionable and can have an arbitrary number of predecessor and successor versions, all of which have to be entities of the same kind as the original entity. Treating an entitys versions as entities on their own has several benefits: on the one hand, entities constituting versions of other entities have their own globally unique OID. Hence, different versions concurrently derived from one and the same entity at different sites can easily be distinguished without synchronization effort. On the other hand, different versions of an entity can be interrelated just like any other entities allowing one to establish comparative relationships between entity versions. Figure 8 exemplifies a possible versioning of our example entity Picture 3. The original version of this logical part is depicted to the left of the figure. As expressed by the special arrows indicating the predecessor (pred) and the successor (succ) relationship between different versions of the same entity, two different successor versions of this original version were created, possibly by two different people at two different
EMMO 315
Figure 8. Versioning of an entity

Picture 3 (attribute value date added) Picture 3 (attribute value aperture added)
Picture 3
pred pred
succ succ
pred pred
succ succ
Picture 3 (attribute values aperture and date merged)
locations. One version augments the logical media part with a date attribute value to denote the creation date of the picture whereas the other provides an attribute value describing the aperture with which the picture was taken. Finally, as shown by the logical media part at the right side of the figure, these two versions were merged again into a fourth that now holds both attribute values. Having explained the common characteristics shared by all entities, we are now able to introduce the peculiarities of the four concrete kinds of entities: logical media parts, ontology objects, associations, and Emmos.
Logical Media Parts

Logical media parts are entities that form the bridge between the semantic aspect and the media aspect of an Emmo. A logical media part represents a basic medium of which multimedia content consists on a logical level for semantic description, thereby providing an abstraction from the physical manifestation of the medium. According to the class diagram of Figure 9, logical media parts can refer to an arbitrary number of connectors which we already know from our description of the media aspect of Emmo permitting one to logically subsume alternative media profiles and instances representing different media files in possibly different formats in possibly different storage locations under a common logical media part. The ID of the default profile to use is identified via the attribute masterProfileID. Since logical media parts do not need to have connectors associated with them, it is also possible to refer to media within Emmos which do not have a physical manifestation.
Ontology Objects
Ontology objects are entities that represent concepts of an ontology. We have already described how ontology objects are used to define entity types and to augment entities with attribute values. By relating entities such as logical media parts to ontology objects, they can be given a meaning. As it can be seen from the class diagram of Figure 10, the Emmo model distinguishes two kinds of ontology objects represented by two subclasses of OntologyObject: Concept and ConceptRef. Whereas an instance of Concept serves to represent a concept of an ontology that is fully captured within the Emmo model, ConceptRef allows one to reference concepts of ontologies specified in external ontology languages such as RDF Schema (Brickley & Guha, 2002). The latter is a pragmatic tribute to the fact that we have not developed an ontology language for
316
Figure 9. Logical media parts

LogicalMediaPart +masterProfileID : String
Figure 10. Ontology objects

OntologyObject +ontType : int
1
ConceptRef +objOID : String +ontStandard : String Concept
0..* Connector
Figure 11. Association

Entity +source 1 1 +target 0..* 0..*
Association
Emmos yet and therefore rely on external languages for this purpose. References to concepts of external ontologies additionally need a special ID (objOID) uniquely identifying the external concept referenced and a label indicating the format of the ontology (ontStandard); for example, RDF Schema.
Associations
Associations are entities that establish binary directed relationships between entities, allowing the creation of complex and detailed descriptions of the multimedia content represented by the Emmo. As one can see from Figure 11, each association has exactly one source entity and one target entity. The kind of semantic relationship represented by an association is defined by the associations type which is like the types of other entities an ontology object representing the concept that captures the type in an ontology. Different from other entities, however, an association is only permitted to have one type as it can express only a single kind of relationship. Since associations are first-class entities, they can take part as sources or targets in other associations like any other entities. This feature permits the creation of very complex content descriptions, as it facilitates the reification of statements (statements about statements) within the Emmo model.
EMMO 317
Figure 12. Reification

Paul believes
Peter
thinks
Mary
Picture 3
fancies
Figure 12 demonstrates how reification can be expressed. In the figure, associations are symbolized by a diamond shape, with solid arrows indicating the source and target of an association and dashed arrows indicating the association type. The example shown in this figure wants to express that Peter believes that Paul thinks that Mary fancies Picture 3. The statement Mary fancies Picture 3 is represented at the bottom of the figure by an association of type fancies that connects the ontology object Mary with the logical media part Picture3. Moreover, this association acts as target for another association having the type thinks and the source entity Paul, thereby making a statement about the statement Mary fancies Picture 3. This reification is then further enhanced by attaching another statement to obtain the desired message.
Emmos
Emmos themselves, finally, constitute the fourth kind of entities. An Emmo is basically a container that encapsulates arbitrary entities to form a semantically modeled piece of multimedia content (see the aggregation between the classes Emmo and Entity in the introductory outline of the model in Figure 4). As one and the same entity can be contained in more than one Emmo, it is possible to encapsulate different, contextdependent, and even contradicting, views onto the same content within different Emmo; as Emmo are first-class entities, they can be contained within other Emmos and take part in associations therein, allowing one to build arbitrarily nested Emmo structures for the logical organization of multimedia content. These are important characteristics especially useful for the authoring process, as they facilitate reuse of existing Emmos and the content they represent. Figure 13 shows an example where a particular Emmo encapsulates another. In the figure, Emmos are graphically shown as ellipses. The example depicts an Emmo modeling a private photo gallery that up to the moment holds only a single photo album (again
318
Figure 13. Nested Emmo

Photo Gallery
Journey to Europe
domain
Paris
Vienna
Paul
Mary
vacation
location
depicts
Picture 1
Picture 2
Picture 3
digital image
photograph
modeled by an Emmo): namely, the photo album of the journey to Europe we used as a motivating example in the section illustrating the Emmo idea. Via an association, this album is classified as vacation within the photo gallery. In the course of time, the photo gallery might become filled with additional Emmos representing further photo albums; for example, one that keeps the photos of a summer vacation in Spain. These Emmos can be related to each other. For example, an association might express that the journey to Europe took place before the summer vacation in Spain.
Functional Aspect
Emmos also address the functional aspect of multimedia content. Emmos may offer operations that realize arbitrary content-specific functionality which makes use of the media and descriptions provided with the media and semantic aspects of an Emmo and which can be invoked by applications working with content. The class diagram of Figure 14 shows how this is realized in the model. As expressed in the diagram, an Emmo may aggregate an arbitrary number of operations represented by the class of the same name. Each operation has a designator, that is, a name that describes its functionality, which is represented by an ontology object. Similar to attributes, the motivation behind using concepts of an ontology as operation designators instead of simple string identifiers is that this allows one to express restrictions on the usage of operations within an ontology; for example, the types of Emmo for which an operation is available, the types of the expected input parameters, and so forth. The functionality of an operation is provided by a dedicated implementation class whose name is captured by an operations implClassName attribute to permit the dynamic instantiation of the implementation class at runtime. There are not many restrictions for such an implementation class: the Emmo model merely demands that an implementation class realizes the OperationImpl interface. OperationImpl enforces the implementation of a single method only: namely, the method execute() which expects
EMMO 319
Figure 14. Emmos functionality

Emmo OntologyObject
1 0..* Operation
+Designator
+implClassName : String 0..*
<<instantiates>> interface OperationImpl +execute(in e : Emmo, in args : Object[]) : Object
Figure 15. Example of Emmo operations

renderAsSlideShow
RenderAsSlideShow: OperationImpl
Journey to Europe
RenderAsMap: OperationImpl
renderAsSlideMap
the Emmo on which an operation is executed as its first parameter followed by a vector of arbitrary operation-dependent parameter objects. Execute() performs the desired functionality and, as a result, may return an arbitrary object. Figure 15 once more depicts the Emmo modeling the photo album of the journey to Europe that we already know from Figure 13, but this time enriched with the two operations already envisioned in the second section: one that traverses the semantic description of the album returns an SMIL presentation that renders the album as a slide show, and another that returns an SVG presentation that renders the same album as a map. For both operations, two implementation classes are provided that are attached to the Emmo and differentiated via their designators renderAsSlideShow and renderAsMap.
320
THE EMMO CONTAINER INFRASTRUCTURE

As an elementary foundation for the sharing and collaborative authoring of pieces of semantically modeled multimedia content on the basis of the Emmo model, we have implemented a distributed Emmo container infrastructure. Figure 16 provides an overview of this infrastructure, which we are going to describe in more detail in the following section. Basically, an Emmo container provides a space where Emmos live. Its main purpose is the management and persistent storage of Emmos. An Emmo container provides application programming interfaces that permit applications to fine-grainedly access, manipulate, traverse, and query the Emmos it stores. This includes the media aspect of an Emmo with its media profiles and instances, the semantic aspect with all its descriptional entities such as logical media parts, ontology objects, other Emmos, and associations, as well as the versioning relationships between those entities. Moreover, an Emmo container offers an interface to invoke and execute an Emmos operations giving access to the functional aspect of an Emmo. Emmo containers are not intended as a centralized infrastructure with a single Emmo container running at a server (although this is possible). Instead, it is intended to establish a decentralized infrastructure with Emmo containers of different scales and sizes running at each site that works with Emmos. Such a decentralized Emmo management naturally reflects the nature of content sharing and collaborative multimedia applications. The decentralized approach has two implications. The first implication is that platform independence and scalability are important in order to support Emmo containers at potentially very heterogeneous sites ranging from home users to large multimedia content publishers with different operating systems, capabilities, and requirements. For these reasons, we have implemented the Emmo containers in Java, employing the object-oriented DBMS ObjectStore for persistent storage. By Java, we obtain platform independence; by ObjectStore, we obtain scalability as there does not just exist a full-fledged database server implementation suitable for larger content providers, but also a code-compatible file-based in-process variant named PSEPro better suiting the limited needs of home users. It would have been possible to use a similarly scalable Figure 16. Emmo container infrastructure
Access, manipulate, traverse, query, Access, manipulate, traverse, query,
Emmo Container 1 Persistent storage of media and semantic relationships
Import/Export Emmos
Emmo Container 2 Persistent storage of media and semantic relationships
EMMO_1284 e98ea567ea d456778872
Invoke and execute operations
Invoke and execute operations
EMMO 321
relational DBMS for persistent storage as well; we opted for an object-oriented DBMS, however, because of these systems suitability for handling complex graph structures like Emmos. The second implication of a decentralized infrastructure is that Emmos must be transferable between the different Emmo containers operated by users that want to share or collaboratively work on content. This requires Emmo containers to be able to completely export Emmos into bundles encompassing their media, semantic, and functional aspects, and to import Emmos from such bundles, which is explained in more detail in the following two subsections. In the current state of implementation, Emmo containers are rather isolated components, requiring applications to explicitly initiate the import and export of Emmos and to manually transport Emmo bundles between different Emmo containers themselves. We are building a peer-to-peer infrastructure around Emmo containers that permits the transparent search for and transfer of Emmos across different containers.
Exporting Emmos
An Emmo container can export an Emmo into a bundle whose overall structure is illustrated by Figure 17. The bundle is basically a ZIP archive which captures all three aspects of an Emmo: the media aspect is captured by the bundles media folder. The basic media files of which the multimedia content modeled by the Emmo consists are stored in this folder.
Figure 17. Structure of an Emmo bundle
!""#$!%&'(')*+(,,+-../('0'-
-1(2-230).+4.%-
!%&'(')*+(,,+-../('0'-
-1(2-230).+4.%-5678
709:+1;<
71/:0
=<<>?@@AAA5>=B<B53B7@>0C8@1CDB>1@>:3<CD122-5E>F
=<<>?@@AAA5>=B<B53B7@>0C8@1CDB>1@<=C7490:822-5E>F
B>1D0<:B9;
D19/1D:9F5E0D
322
Figure 18. Emmo XML representation

<?xml version="1.0" encoding="UTF-16"?>  <emmo xmlns="http://www.cultos.org/emmos" xmlns:mpeg7="http://www.mpeg7.org/2001/MPEG-7_Schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.cultos.org/emmos http://www.cultos.org/emmos/XML/emmo.xsd"> <components> <entities> <entity xsi:type="LogicalMediaPart" mode="strong"> <oid>E1a8d252-f8f2e098bb-3c7cb04afdbd1144ba4d1ea866d93db2</oid> <name>Beethoven's 5th Symphony</name> <creationDate>19 November 2003 08:24:54 CET</creationDate> <modifiedDate>19 November 2003 08:24:55 CET</modifiedDate> </entity> <entity xsi:type="Concept" mode="weak"> <oid>E1a8d252-f8f2d7a8a5-3c7cb04afdbd1144ba4d1ea866d93db2</oid> <name>Classical music</name> <creationDate>19 November 2003 08:15:09 CET</creationDate> <modifiedDate>19 November 2003 08:15:09 CET</modifiedDate> <ontologyType>0</ontologyType> </entity> ..... </entities> <mediaProfiles/> </components> <links> <types> <typeLink entity="E1a8d252-f8f2e095ed-3c7cb04afdbd1144ba4d1ea866d93db2" type="E1a8d252-f8f2e0980f-3c7cb04afdbd1144ba4d1ea866d93db2"/> <typeLink entity="E1a8d252-f8f2e095fc-3c7cb04afdbd1144ba4d1ea866d93db2" type="E1a8d252-f8f2e098ef-3c7cb04afdbd1144ba4d1ea866d93db2"/> <typeLink entity="E1a8d252-f8f2e098bb-3c7cb04afdbd1144ba4d1ea866d93db2" type="E1a8d252-f8f2d7a8a5-3c7cb04afdbd1144ba4d1ea866d93db2"/> </types> <attributeValues/> <associations> <assoLink association="E1a8d252-f8f2e0981a-3c7cb04afdbd1144ba4d1ea866d93db2" sourceEntity="E1a8d252-f8f2e095fc3c7cb04afdbd1144ba4d1ea866d93db2" targetEntity="E1a8d252-f8f2e098bb-3c7cb04afdbd1144ba4d1ea866d93db2"/> </associations> <connectors/> <predVersions/> <succVersions/> <encapsulations/> </links> </emmo>
The semantic aspect is captured by a central XML file whose name is given the OID of the bundled Emmo. This XML file captures the semantic structure of the Emmo, thus describing all of the Emmos entities, the associations between them, the versioning relationships, and so forth. Figure 18 shows a fragment of such an XML file. It is divided into a <components> section declaring all entities and media profiles relevant for the current Emmo and a <links> section capturing all kinds of relationships between these entities and media profiles, such as types, associations, and so forth. The functional aspect of an Emmo is captured by the bundles operations folder in which the binary code of the Emmos operations is stored. Here, our choice for Java as the implementation language for Emmo containers comes in handy again, as it allows
EMMO 323
us to transfer operations in form of JAR files with platform-independent bytecode even between heterogeneous platforms. The export functionality can react to different application needs by offering several export variants: an Emmo can be exported with or without media included in the bundle, one can choose whether to also include media that are only referenced by URIs, the predecessor and successor versions of the contained entities can either be added to the bundle or omitted, and it can be decided whether to recursively export Emmos contained within an exported Emmo. The particular export variants chosen are recorded in the bundles manifest file. In order to implement these different export variants, an Emmo container distinguishes three different modes of how entities can be placed in a bundle: The strong mode is the normal mode for an entity. The bundle holds all information about an entity including its types, attribute values, immediate predecessor and successor versions, media profiles (in case of a logical media part), contained entities (in case of an Emmo), and so forth. The hollow mode is applicable to Emmos only. The hollow mode indicates that the bundle holds all information about an Emmo except the entities it contains. The hollow mode appears in bundles where it was chosen not to recursively export encapsulated Emmo. In this case, encapsulated Emmos receive the hollow mode; the entities encapsulated by those Emmos are excluded from the export. The weak mode indicates that the bundle contains only basic information about an entity, such as its OID, name, and description but no types, attribute values, and so forth. Weak mode entities appear in bundles that have been exported without versioning information. In this case, the immediate predecessor and successor versions of exported entities are placed into the bundle in weak mode; indirect predecessor and successor versions are excluded from the export. The particular mode of an entity within a bundle is marked with the mode attribute in the entitys declaration in the bundles XML file (see again Figure 18).
Importing Emmos
When importing an Emmo bundle exported in the way described in the previous subsection, an Emmo container essentially inserts all media files, entities, and operations included in the bundle into its local database. In order to avoid duplicates, the container checks whether an entity with the same OID or whether a media file or JAR file already exists in the local database before insertion. If a file already exists, the basic strategy of the importing container is that the local copy prevails. However, the different export variants for Emmos and the different modes in which entities might occur in a bundle as well as the fact that in a collaborative scenario Emmos might have been concurrently modified without creating new versions of entities demand a more sophisticated handling of duplicate entities on the basis of a timestamp protocol. Depending on the modes of two entities with the same OID in the bundle, and the local database and the timestamps of both entities, essentially the following treatment is applied: A greater mode (weak < hollow < strong) in combination with a more recent timestamp always wins. Thus, if the local entity has a greater mode and a newer
324
timestamp, it prevails, and the entity in the bundle is ignored. Similarly, if the local entity has a lesser mode and an older timestamp, the entity in the bundle completely replaces the local entity in the database. If the local entity has a more recent timestamp but a lesser mode, additional data available for the entity in the bundle (entity types, attribute values, predecessor or successor versions, encapsulated entities in case of Emmos, or media profiles in case of logical media parts) complements the data of the local entity, thereby raising its mode. In case of same modes but a more recent timestamp of the entity in the bundle, the entity in the bundle completely replaces the local entity in the database. In case of same modes but a more recent timestamp of the entity in the local database, the entity in the database prevails and the entity in the bundle is ignored.
APPLICATIONS
Having introduced and described the Emmo approach to semantic multimedia content modeling and the Emmo container infrastructure, this section illustrates how these concepts have been practically applied in two concrete multimedia content sharing and collaborative applications. The first application named CULTOS is in the domain of cultural heritage and the second application introduces a semantic jukebox.
CULTOS
CULTOS is an European Union (EU)-funded project carried out from 2001 to 2003 with 11 partners from EU-countries and Israel1. It has been the task of CULTOS to develop a multimedia collaboration platform for authoring, managing, retrieving, and exchanging Intertextual Threads (ITTs) (Benari et al., 2002; Schellner et al., 2003) knowledge structures that semantically interrelate and compare cultural artifacts such as literature, movies, artworks, and so forth. This platform enables the community of intertextual studies to create and exchange multimedia-enriched pieces of cultural knowledge that incorporate the communitys different cultural backgrounds an important contribution to the preservation of European cultural heritage. ITTs are basically graph structures that describe semantic relationships between cultural artifacts. They can take a variety of forms, ranging from spiders over centipedes to associative maps, like the one shown in Figure 19. The example ITT depicted in the figure highlights several relationships of the poem The Fall by Tuvia Ribner to other works of art. It states that the poem makes reference to the 3rd book of Ovids Metamorphoses and that the poem is an ekphrasis of the painting Icarus Fall by the famous Dutch painter Breugel. The graphical representation of an ITT bears strong resemblance to well-known techniques for knowledge representation such as concept graphs or semantic nets, although it lacks their formal rigidity. ITTs nevertheless get very complex, as they commonly make use of constructs such as encapsulation and reification of statements that are challenging from the perspective of knowledge representation.
EMMO 325
Figure 19. Simple intertextual thread

Text Referencing Metamorphoses by Ovid Book 3 The Fall by Ribner Text Ekphrasis Icaruss Fall by Breugel Painting
Encapsulation is intrinsic to ITTs because intertextual studies are not exact sciences. Certainly, the cultural and personal context of a researcher affects the kind of relationships between pieces of literature he discovers and are of value to him. As such different views onto a single subject are highly interesting to intertextual studies, ITTs themselves can be relevant subjects of discourse and thus be contained as first-class artifacts within other ITTs. Figure 20 illustrates this point with a more complex ITT that interrelates two ITTs manifesting two different views on Ribners poem as opposed representations. Reification of statements is also frequently occurring within ITTs. Since experts in intertextual studies extensively base their position on the position of other researchers, statements about statements are common practice within ITTs. In the ITT of Figure 20, for instance, it is expressed by reification that the statement describing the two depicted ITTs as opposed representation is only the opinion of a certain researcher B. Zoa. Given these characteristics of ITTs, we have found that Emmos are very well suited for their representation in the multimedia collaboration platform for intertextual studies that is envisioned by CULTOS. Firstly, the semantic aspect of Emmo offers sufficient
Figure 20. Complex intertextual thread
Text Referencing Metamorphoses by Ovid Book 3 The Fall by Ribner Text Ekphrasis Icaruss Fall by Breugel Painting believes The Fall by Ribner Text Opposed Representation Cultural Concept The Fall of Adam&Eve Text NewTestament Genesis,ch.II
B. Zoa
326
Figure 21. Emmo representing an ITT
Emmo3
Metamorphoses by Ovid Referencing

Connector 2 http://.../Metamorphoses.pdf
Opposed Representation The Fall of Adam&Eve
Emmo1
Icarus Fall By Breugel The Fall by Ribner
Connector 1 http://.../TheFall.doc Connector 3 http://.../IcarusFall.jpg
Emmo2
Connector 5 FallAdam&Eve.doc
Cultural Concept The Fall by Ribner

Connector 4 TheFall.doc
Ekphrasis
believes
B.Zoa Rendering
:RenderingImplementation
expressiveness to capture ITTs. Figure 21 shows how the complex ITT of Figure 20 could be represented using Emmos. Due to the fact that associations as well as Emmos themselves are first-class entities, it is even possible to cope with reification of statements as well as with encapsulation of ITTs. Secondly, the media aspect of Emmos allows researchers to enrich ITTs that so far expressed interrelationships between cultural artefacts on an abstract level with digital media about these artefacts, such as a JPEG image showing Breugels painting, Icarus Fall. The ability to consume these media while browsing an ITT certainly enhances the comprehension of the ITT and the relationships described therein. Thirdly, with the functional aspect of Emmos, functionality can be attached to ITTs. For instance, an Emmo representing an ITT in CULTOS offers operations to render itself in an HTML-based hypermedia view. Additionally, our Emmo container infrastructure outlined in the previous section provides a suitable foundation for the realization of the CULTOS platform. Their ability to persistently store Emmos as well as their interfaces which enable applications to finegrainedly traverse and manipulate the stored Emmos and invoke their operations make Emmo containers an ideal ground for the authoring and browsing applications for ITTs that had to be implemented in the CULTOS project. Figure 22 gives a screenshot of the authoring tool for ITTs that has been developed in the CULTOS project which runs on top of an Emmo container. Moreover, their decentralized approach allows the setup of independent Emmo containers at the sites of different researchers; their ability to import and export Emmos with all the aspects they cover facilitates the exchange of ITTs, including the media by
EMMO 327
Figure 22. CULTOS authoring tool for ITTs
which they are enriched as well as the functionality they offer. This enables researchers to share and collaboratively work on ITTs in order to discover and establish new links between artworks as well as different personal and cultural viewpoints, thereby paving the way to novel insights to a subject. The profound versioning within the Emmo model further enhance this kind of collaboration, allowing researchers to concurrently create different versions of an ITT at different sites, to merge these versions, and to highlight differences between these versions.
Semantic Jukebox
One of the most prominent (albeit legally disputed) multimedia content sharing applications is the sharing of MP3 music files. Using peer-to-peer file sharing infrastructures such as Gnutella, many users gather large song libraries on their home PCs which
328
they typically manage with one of the many jukebox programs available, such as Apples iTunes (Apple Computer, n.d.). The increasing use of ID3 tags (ID3v2, n.d.) optional free text attributes capturing metadata like the interpreter, title, and the genre of a song - within MP3 files for song description alleviates the management of such libraries. Nevertheless, ID3-based song management quickly reaches its limitations. While ID3 tags enable jukeboxes to offer reasonably effective search functionality for songs (provided the authors of ID3 descriptions spell the names of interprets, albums, and genres consistently), more advanced access paths to song libraries are difficult to realize. Apart from other songs of the same band or genre, for instance, it is difficult to find songs similar to the one that is currently playing. In this regard, it would also be interesting to be able to navigate to other bands in which artists of the current band played as well or with which the current band appeared on stage together. But such background knowledge cannot be captured with ID3 tags. Using Emmos and the Emmo container infrastructure, we have implemented a prototype of a semantic jukebox that considers background knowledge about music. The experience we have gained from this prototype shows that the Emmo model is well-suited to represent knowledge-enriched pieces of music in a music sharing scenario. Figure 23 gives a sketch of such a music Emmo which holds some knowledge about the song Round Midnight.
Figure 23. Knowledge about the song Round Midnight represented by an Emmo
Composition
Round Midnight
Thelonious Monk Round Midnight
composed by Artist
assigned to Record has Manifestation
Miles Davis
Round Midnight
Round about Midnight

:java.util.Date
played by
Connector 1
http://.../roundmid.mp3
10/26/1955
date of issue Performance

:RenderAsTimelineinSVG
Rendering
EMMO 329
Its media aspect enables the depicted Emmo to act as a container of MP3 music files. In our example, this is a single MP3 file with the song Round Midnight that is connected as a media profile to the logical media part Round Midnight in the center of the figure. The Emmos semantic aspect allows us to express rich background knowledge about music files. For this purpose, we have developed a basic ontology for the music domain featuring concepts such as Artist, Performance, Composition, and Record that all appear as ontology objects in the figure. The ontology also features various association types which allow us to express that Round Midnight was composed by Thelonious Monk and the particular performance by Miles Davis can be found on the record Round about Midnight. The ontology also defines attributes for expressing temporal information like the issue date of a record. The functional aspect, finally, enables the Emmo to support different renditions of the knowledge it contains. To demonstrate this, we have realized an operation that, being passed a time interval as its parameter, produces an SVG timeline rendition (see screenshot of Figure 24) arranging important events like the foundation of bands, the birthdays and days of death of artists, and so forth, around a timeline. More detailed information for each event can be gained by clicking on the particular icons on the timeline. Further operations could be imagined; for example, operations that provide rights clearance functionality for the music files contained in the Emmo, which is a crucial issue in music sharing scenarios.
Figure 24. Timeline rendition of a music Emmo
330
Our Emmo container infrastructure provides a capable storage foundation for semantic jukeboxes. Their ability to fine-grainedly manage Emmos as well as their scalability allowing them to be deployed as both small-scale file-based and as large-scale database server configurations. Thus, Emmo containers constitute suitable hosts for knowledge-enriched music libraries of private users as well as libraries of professional institutions such as radio stations. Capable of exporting and importing Emmos to and from bundles, Emmo containers also facilitate the sharing of music between different jukeboxes. Their versioning support even allows it to move from mere content sharing scenarios to collaborative scenarios where different users cooperate to enrich and edit Emmos with their knowledge about music.
CONCLUSION
Current approaches to semantic multimedia content modeling typically regard the basic media which the content comprises, the description of these media, and the functionality of the content as conceptually separate entities. This leads to difficulties with multimedia content sharing and collaborative applications. In reply to these difficulties, we have proposed Enhanced Multimedia Meta Objects (Emmos) as a novel approach to semantic multimedia content modeling. Emmos coalesce the media of which multimedia content consists, their semantic descriptions, as well as functionality of the content into single indivisible objects. Emmos in their entirety are serializable and versionable, making them a suitable foundation for multimedia content sharing and collaborative applications. We have outlined a distributed container infrastructure for the persistent storage and exchange of Emmos. We have illustrated how Emmos and the container infrastructure were successfully applied for the sharing and collaborative authoring of multimedia-enhanced intertextual threads in the CULTOS project and for the realization of a semantic jukebox. We strive to extend the technological basis of Emmos. We are currently developing a query algebra, which permits declarative querying of all the aspects of multimedia content captured by Emmos, and integrating this algebra within our Emmo container implementation. Furthermore, we are wrapping the Emmo containers as services in a peerto-peer network in order to provide seamless search for and exchange of Emmos in a distributed scenario. We also plan to develop a language for the definition of ontologies that is adequate for use with Emmos. Finally, we are exploring the handling of copyright and security within the Emmo model. This is certainly necessary as Emmos might not just contain copyrighted media material but also carry executable code with them.
Apple Computer (n.d.). iTunes. Retrieved 2004 from http://www.apple.com Ayars, J., Bulterman, D., Cohen, A., et al. (2001). Synchronized multimedia integration language (SMIL 2.0). W3C Recommendation, World Wide Web Consortium (W3C). Baumeister, S. (2002). Enterprise media beans TM specification. Public Draft Version1.0, IBM Corporation.
REFERENCES
EMMO 331
Benari, M., Ben-Porat, Z., Behrendt, W., Reich, S., Schellner, K., & Stoye, S. (2002). Organizing the knowledge of arts and experts for hypermedia presentation. Proceedings of the Conference of Electronic Imaging and the Visual Arts, Florence, Italy. Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American. Boll, S., Klas, W., & Westermann, U. (2000). Multimedia document formats - sealed fate or setting out for new shores? Multimedia - Tools and Applications, 11(3). Brickley, D., & Guha, R.V. (2002). Resource description framework (RDF) vocabulary description language 1.0: RDF Schema. W3C Working Draft, World Wide Web Consortium (W3C). Chang, H., Hou, T., Hsu, A., & Chang, S. (1995). Tele-Action objects for an active multimedia system. Proceedings of the International Conference on Multimedia Computing and Systems (ICMCS 1995), Ottawa, Canada. Chang, S., & Znati, T. (2001). Adlet: An active document abstraction for multimedia information fusion. IEEE Transactions on Knowledge and Data Engineering, 13(1). Daniel, R., Lagoze, D., & Payette, S. (1998). A metadata architecture for digital libraries. Proceedings of the Advances in Digital Libraries Conference, Santa Barbara, California. Fensel, D. (2001). Ontologies: A silver bullet for knowledge management and electronic commerce. Heidelberg: Springer. Ferraiolo, J., Jun, F., & Jackson, D. (2003). Scalable vector graphics (SVG) 1.1. W3C Recommendation, World Wide Web Consortium (W3C). Gnutella (n.d.). Retrieved 2003 from http://www.gnutella.com Grimson, J., Stephens, G., Jung, B., et al. (2001). Sharing health-care records over the internet. IEEE Internet Computing, 5(3). ID3v2 (n.d.). [Computer software]. Retrieved 2004 from http://www.id3.org ISO/IEC JTC 1/SC 29 (1997). Information technology - Coding of hypermedia information - part 5: support for base-level interactive applications. ISO/IEC International Standard 13522-5:1997, International Organization for Standardization/International Electrotechnical Commission (ISO/IEC). ISO/IEC JTC 1/SC 29/WG 11 (2001). Information technology - Multimedia content description interface - part 5: Multimedia description schemes. ISO/IEC Final Draft International Standard 15938-5:2001, International Organization for Standardization/International Electrotechnical Commission (ISO/IEC). ISO/IEC JTC 1/SC 34/WG 3 (1997). Information technology - Hypermedia/time-based structuring language (HyTime). ISO/IEC International Standard 15938-5:2001, International Organization for Standardization/International Electrotechnical Commission (ISO/IEC). ISO/IEC JTC 1/SC 34/WG 3 (2000). Information technology - SGML applications - topic maps. ISO/IEC International Standard 13250:2000, International Organization for Standardization/International Electrotechnical Commission (ISO/IEC). ISO/JTC1/SC 32/WG 2 (2001). Conceptual graphs. ISO/IEC International Standard, International Organization for Standardization/International Electrotechnical Commission (ISO/IEC).
332
Lagoze, C., Lynch, C., & Daniel, R. (1996). The warwick framework: A container architecture for aggregating sets of metadata. Technical Report TR 96-1593, Cornell University, Ithaca, New York. Lassila, O., & Swick, R.R. (1999). Resource description framework (RDF) model and syntax specification. W3C Recommendation, World Wide Web Consortium (W3C). Leach, P. J. (1998, February). UUIDs and GUIDs. Network Working Group Internet-Draft, The Internet Engineering Task Force (IETF). Matena, V., & Hapner, M. (1998). Enterprise Java Beans TM. Specification Version 1.0, Sun Microsystems Inc. Nejdl, W., Wolf, B., Qu, C., et al. (2002). EDUTELLA: A P2P networking infrastructure based on RDF. Proceedings of the Eleventh International World Wide Web Conference (WWW 2002), Honolulu, Hawaii. Newmann, D., Patterson, A., & Schmitz, P. (2002). XHTML+SMIL profile. W3C Note, World Wide Web Consortium (W3C). Pereira, F., & Ebrahimi T., (Eds.) (2002). The MPEG-4 book. CA: Pearson Education Reich, S., Behrendt, W., & Eichinger, C. (2000). Document models for navigating digital libraries. Proceedings of the Kyoto International Conference on Digital Libraries, Orlando, Kyoto, Japan. Raggett, D., Le Hors, A., & Jacobs, I. (1999). HTML 4.01 specification. W3C Recommendation, World Wide Web Consortium (W3C).
ENDNOTE
1
See http://www.cultos.org for more details on the project.
Semantically Driven Multimedia Querying and Presentation
333
Chapter 14

Isabel F. Cruz, University of Illinois, Chicago, USA Olga Sayenko, University of Illinois, Chicago, USA
Semantics can play an important role in multimedia content retrieval and presentation. Although a complete semantic description of a multimedia object may be difficult to generate, we show that even a limited description can be explored so as to provide significant added functionality in the retrieval and presentation of multimedia. In this chapter we describe the DelaunayView that supports distributed and heterogeneous multimedia sources and proposes a flexible semantically driven approach to the selection and display of multimedia content.
ABSTRACT
INTRODUCTION
The goal of a semantically driven multimedia retrieval and presentation system is to explore the semantics of the data so as to provide the user with a rich selection criteria and an expressive set of relationships among the data, which will enable the meaningful extraction and display of the multimedia objects. The major obstacle in developing such a system is the lack of an accurate and simple way of extracting the semantic content that is encapsulated in multimedia objects and in their inter-relationships. However, metadata that reflect multimedia semantics may be associated with multimedia content. While
334 Cruz & Sayenko
metadata may not be equivalent to an ideal semantic description, we explore and demonstrate its possibilities in our proposed framework. DelaunayView is envisioned as a system that allows users to retrieve multimedia content and interactively specify its presentation using a semantically driven approach. DelaunayView incorporates several ideas from the earlier systems Delaunay (Cruz & Leveille, 2000) and DelaunayMM (Cruz & James, 1999). In the DelaunayView framework, multimedia content is stored in autonomous and heterogeneous sources annotated with metadata descriptions in resource description framework (RDF) format (Klyne & Carroll, 2004). One such source could be a database storing scientific aerial photographs and descriptions of where and when the photographs were taken. The framework provides tools for specifying connections between multimedia items that allow users to create an integrated virtual multimedia source that can be queried using RQL (Karvounarakis et al., 2002) and keyword searches. For example, one could specify how a location attribute from the aerial photo database maps to another location attribute of an infrared satellite image database so that a user can retrieve images of the same location from both databases. In DelaunayView, customizable multimedia presentation is enabled by a set of graphical interfaces that allow users to bind the retrieved content to presentation templates (such as slide sorters or bipartite graphs), to specify content layout on the screen, and to describe how the dynamic visual interaction among multimedia objects can reflect the semantic relationships among them. For example, a user can specify that aerial photos will be displayed in a slide sorter on the left of the workspace, satellite images in another slide sorter on the bottom of the workspace, and that when a user selects a satellite image, the aerial photos will be reordered so that the photos related to the selected image appear first in the sorter. In this paper we describe our approach to multimedia querying and presentation and focus on how multimedia semantics can be used in these activities. In Background we discuss work in multimedia presentation, retrieval, and description; we also introduce concepts relating to metadata modeling and storage. In A Pragmatic Approach to Multimedia Presentation, we present a case study that illustrates the use of our system and describe the system architecture. In Future Work we describe future research directions and summarize our findings in Conclusions.
BACKGROUND
A multimedia presentation system relies on a number of technologies for describing, retrieving and presenting multimedia content. XML (Bray et al., 2000) is a widely accepted standard for interoperable information exchange. MPEG-7 (Martinez, 2003; Chang et al., 2001) makes use of XML to create rich and flexible descriptions of multimedia content. DelaunayView relies on multimedia content descriptions for the retrieval and presentation of content, but it uses RDF (Klyne & Carroll, 2004) rather than XML. We chose RDF over XML because of its richer modeling capabilities, whereas in other components of the Delaunay View system we have used XML (Cruz & Huang, 2004). XML specifies a way to create structured documents that can be easily exchanged over the Web. An XML document contains elements that encapsulate data. Attributes may be used to describe certain properties of the elements. Elements participate in
335
hierarchical relationships that determine the document structure. XML Schema (Fallside, 2001) provides tools for defining elements, attributes, and document structure. One can define typed elements that act as building blocks for a particular schema. XML Schema also supports inheritance, namespaces, and uniqueness. MPEG-7 (Martnez, 2003) defines a set of tools for creating rich descriptions of multimedia content. These tools include Descriptors, Description Schemes (DS) (Salembier & Smith, 2001) and the Description Definition Language (DDL) (Hunter, 2001). MPEG-7 descriptions can be expressed in XML or in binary format. Descriptors represent low-level features such as texture and color that can be extracted automatically. Description Schemes are composed of multiple Descriptors and Description Schemes to create more complex descriptions of the content. For example, the MediaLocator DS describes the location of a multimedia item. The MediaLocator is composed of the MediaURL descriptor and an optional MediaTime DS: the former contains the URL that points to the multimedia item, while the latter is meaningful in the case where the MediaLocator describes an audio or a video segment. Figure 1 shows an example of a MediaLocator DS and its descriptors RelTime and Duration that, respectively, describe the start time of a segment relative to the beginning of the entire piece and the segment duration. The Resource Description Framework (RDF) offers an alternative approach to describing multimedia content. An RDF description consists of statements about resources and their properties. An RDF resource is any entity identifiable by a URI. An RDF statement is a triple consisting of subject, predicate, and object. The subject is the resource about which the statement is being made. The predicate is the property being described. The object is the value of this property. RDF Schema (RDFS) (Brickley & Guha, 2001) provides mechanisms for defining resource classes and their properties. If an RDF document conforms to an RDF schema (expressed in RDFS), resources in the document belong to classes defined in the schema. A class definition includes the class name and a list of class properties. A property definition includes domain the subject of the corresponding RDF triple and range the object. The RDF Query Language (RQL) (Karvounarakis et al., 2002) is a query language for RDF and RDFS documents. It supports a select-from-where structure, basic queries and iterators that combine the basic queries into nested and aggregate queries, and generalized path expressions. MPEG-7 Description Schemes and DelaunayView differ in their approach to multimedia semantics. In MPEG-7, semantics are represented as a distinct description scheme that is narrowly aimed at narrative media. The Semantic DS includes description schemes
Figure 1. MediaLocator description scheme
336 Cruz & Sayenko
for places, objects, events, and agents people, groups, or other active entities that operate within a narrative world associated with a multimedia item. In DelaunayView any description of multimedia content can be considered a semantic description for the purposes of multimedia retrieval. DelaunayView recognizes that, depending on the application, almost any description may be semantically valuable. For example, an aerial photo of the arctic ice sheet depicts some areas of intact ice sheet and others of open water. Traditional image processing techniques can be applied to the photo to extract light and dark regions that represent ice and open water respectively. In the climate research domain, the size, shape, and locations of these regions constitute the semantic description of the image. In MPEG-7 however, this information will be described with the StillRegion DS, which does not carry semantic significance. Beyond their diverse perspectives on the nature of the semantics that they incorporate, MPEG-7 and DelaunayView use different approaches to representing semantic descriptions: MPEG-7 uses XML, while DelaunayView uses RDF. An XML document is structured according to the tree paradigm: each element is a node and its children are the nodes that represent its subelements. An RDF document is structured according to the directed graph paradigm: each resource is a node and each property is a labeled directed edge from the subject to the object of the RDF statement. Unlike XML, where schema and documents are separate trees, an RDF document and its schema can be thought of as a single connected graph. This property of RDF enables straightforward implementation of more powerful keyword searches as a means of selecting multimedia for presentation. Thus using RDF as an underlying description format gives users more flexibility in selecting content for presentation. Another distinctive feature between MPEG-7 and DelaunayView is the focus of the latter on multimedia presentation. A reference model for intelligent multimedia presentation systems encompasses an architecture consisting of control, content, design, realization, and presentation display layers (Bordegoni et al., 1997). The user interacts with the control layer to direct the process of generating the presentation. The content layer includes the content selection component that retrieves the content, the media allocation component that determines in what form content will be presented, and ordering components. The design layer produces the presentation layout and further defines how individual multimedia objects will be displayed. The realization layer produces the presentation from the layout information provided by the design layer. The presentation display layer displays the presentation. Individual layers interact with a knowledge server that maintains information about customization. LayLab demonstrates an approach to multimedia presentation that makes use of constraint solving (Graf, 1995). This approach is based on primitive graphical constraints such as under or beside that can be aggregated into complex visual techniques (e.g., alignment, ordering, grouping, and balance). Constraint hierarchies can be defined to specify design alternatives and to resolve overconstrained states. Geometrical placement heuristics are constructs that combine constraints with control knowledge. Additional work in multimedia presentation and information visualization can be found in Baral et al. (1998), Bes et al. (2001), Cruz and Lucas (1997), Pattison and Phillips (2001), Ram et al. (1999), Roth et al. (1996), Shih and Davis (1997), and Weitzman and Wittenburg (1994).
337
A PRAGMATIC APPROACH TO MULTIMEDIA PRESENTATION

In our approach to the design of a multimedia presentation system, we address the following challenges. Multimedia content is resident in distributed, heterogeneous, and autonomous sources. However, it is often necessary to access content from multiple sources. The data models and the design of the sources vary widely and are decided upon autonomously by the various entities that maintain them. Our approach accommodates this diversity by using RDFS to describe the multimedia sources in a simple and flexible way. The schemata are integrated into a single global schema that enables users to access the distributed and autonomous multimedia sources as if they were a single source. Another challenge is that the large volume of multimedia objects presented to the user makes it difficult to perceive and understand the relationships among them. Our system gives users the ability to construct customized layouts, thus making the semantic relationships among multimedia objects more obvious.
Case Study
This case study illustrates how multimedia can be retrieved and presented in an integrated view workspace using as example of a bill of materials for the aircraft industry. A bill of materials is a list of parts or components required to build a product. In Figure 2, the manufacturing of commercial airplanes is being planned using a coordinated visualization composed of three views: a bipartite graph, a bar chart, and a slide sorter. The bipartite graph illustrates the part-subpart relationship between commercial aircrafts and their engines, the bar chart displays the number of engines currently available in the Figure 2. A coordinated integrated visualization
338 Cruz & Sayenko
inventory of a plant or plants, and the slide sorter shows the maps associated with the manufacturing plants. First, the user constructs a keyword query using the Search Workspace to obtain a data set. This process may be repeated several times to get data sets related to airplanes, engines, and plants. The user can preview the data retrieved from the query, further refine the query, and name the data set for future use. Then, relationships are selected (if previously defined) or defined among the data sets, using metadata, a query, or user annotations. In the first two cases, the user selects a relationship that was provided by the integration layer. An example of such a relationship would be the connection that is established between the attribute engine of the airplane data set (containing one engine used in that airplane) and the engine data set. Other more complex relationships can be established using an RQL query. Yet another type of relationship can be a connection that is established by the user. This interface is shown in Figure 3. In this figure and those that follow, the left panel contains the overall navigation mechanism associated with the interface, allowing for any other step of the querying or visualization process to be undertaken. Note that we chose the bipartite component to provide visual feedback when defining binary relationships. This is the same component that is used for the display of bipartite graphs. The next step involves creating the views, which are built using templates. A data set can be applied to different templates to form different views. The interface of Figure 4 illustrates a slide sorter of the maps where the manufacturers of aircraft engines are located. In this process, data attributes of the data set are bound to visual attributes of the visual template. For example, the passenger capacity of a plane can be applied to the height of a bar chart. The users also can further change the view to conform to their preferences, for example, by changing the orientation of a bar chart from vertical to horizontal. The sorter allows the thumbnails to be sorted by the values of any of the attributes of the objects that are depicted by the thumbnails. Individual views can be laid out anywhere on the panel as shown in Figure 5. The user selects the kind of dynamic interaction between every pair of views by using a simple customization panel. Figure 3. Relation workspace
339
Figure 4. Construction of a view
Figure 5. View layout
In the integrated view, the coordination between individual views has been established. By selecting a manufacturing plant in the slide sorter, the bar displays the inventory situation of the selected plant; for example, the availability of each type of airplane engine. By selecting more plants in the sorter, the bar chart can display the aggregate number of available engines over several plants for each type of airplane engine. There are two ways of displaying relationships: they can be either represented within the same visualization (as in the bipartite graph of Figure 2) or as a dynamic relationship between two different views, as in the interaction between the bar chart and sorter views. Other interactions are possible in our case study. For example, the bipartite graph can also react to the user selections on the sorter. As more selections of plants
340 Cruz & Sayenko
are performed on the sorter, different types of engines produced by the selected manufacturer(s) appear highlighted. Moreover, the bipartite graph view can be refreshed to display only the relationship between the corresponding selected items in the two data sets.
System Architecture
The DelaunayView system is composed of the presentation, integration, and data layers. The data layer consists of a number of autonomous and heterogeneous multimedia data sources that contain images and metadata. The integration layer connects the individual sources into a single integrated virtual source that makes multimedia from the distributed sources available to the presentation layer. The presentation layer includes user interface components that allow for users to query the multimedia sources and to specify how the images returned by the queries should be displayed and what should be the interaction among those images.
The data layer is comprised of a number of autonomous multimedia sources that contain images annotated with metadata. An image has a number of attributes associated with it. First there are the low-level features that can be extracted automatically. In addition, there are the application-dependent attributes that may include timestamps, provenance, text annotations, and any number of other relevant characteristics. All of these attributes determine the semantics of the image in the application context. Image attributes are described by an RDF schema. When an image is added to a multimedia source, it is given a unique identifier and stored in a binary string format. A document fragment containing image metadata is created and stored in the database. Example 1: Let us consider a multimedia source that contains aerial photos of the Arctic ice sheet used in climate research. The relevant metadata attributes include date and time a photo was taken and the latitude and longitude of the location where it was taken. A photo taken on 08/22/2003 14:07:23 at 8148N, 140E is represented in the following way: the image file is stored in table arctic (Figure 6) with identifier QZ297492. The RDF document fragment containing the metadata and referencing the image is shown in Figure 7. In Figure 7, Line 3 declares namespace aerial which contains the source schema. Lines 5-8 contain the RDF fragment that describes the object-relational schema of table Figure 6. Table arctic
Data Layer
341
Figure 7. Arctic aerial photo metadata document
arctic; arctic.imageId contains a unique identifier and arctic.imageValue contains the image itself. Note that only the attributes that are a part of the reference to the image are described; that is, arctic.source is omitted. Lines 9-14 describe the metadata attributes of an aerial photo. They include timestamp, longitude, and latitude. The property reference does not describe a metadata attribute, but rather acts as a reference to the image object. The source schema is shown in Figure 8. Lines 4-12 define class imageLocation with properties key and value. Lines 13-29 define class image with properties timestamp, longitude, latitude and reference. Every time an image is added to the source an RDF fragment conforming to this schema is created and stored in the database. Although each fragment will contain a description of table arctic, this description will be stored only once. We use the RDFSuite (Alexaki et al., 2000) to store RDF and RDFS data. The RDFSuite provides both persistent storage for the RDF and RDFS data and the implementation of the RQL language. The RDFSuite translates RDF data and schemata into object-relational format and stores them in a PostgreSQL database. The RQL interpreter generates SQL queries over the object-relational representation of RDF data and schema and processes RQL path expressions.
The integration layer combines all multimedia sources into a single integrated virtual source. In the context of this layer, a multimedia source is a local source and its source schema is a local schema. The integrated virtual source is described by the global schema, which is obtained as a result of the integration of the sources. DelaunayView uses foreign key relationships to connect individual sources into the integrated virtual source. Implicit foreign key relationships exist between local sources, but they only become apparent when all local sources are considered as a whole. The global schema is built by explicitly defining foreign key relationships. A sequence of foreign key definitions
Integration Layer
342 Cruz & Sayenko
Figure 8. RDFS schema describing arctic photo metadata
yields a graph where the local schemata are the subgraphs and the foreign key relationships are the edges that connect them. The foreign key relationships are defined with the help of the graphical integration tool of Figure 9. This tool provides a simple graphical representation of the schemata that are present in the system and enables the user to specify foreign key relationships between them. When the user imports a source into the system, its schema is represented on the left-hand side panel as a box. Individual schemata are displayed on the right-hand side pane as trees. The user defines a foreign key by selecting a node in each schema that participates in the relationship, and connecting them by an edge. Figure 9 shows how a foreign key relationship is defined between airplane and engine schemata. The edge between engine and name represents that relationship. The graphical integration tool generates an RDF document that describes all the foreign key relationships defined by the user. The integration layer contains the mediator engine and the schema repository. The mediator engine receives queries from the presentation layer, issues queries to the data
343
Figure 9. Graphical integration tool
sources, and passes results back to the presentation layer. The schema repository contains the description of the global schema and the mappings from global to local schemata. The mediator engine receives queries in terms of the global schema, global queries, and translates them into queries in terms of the local schemata of the individual sources, local queries, using the information available from the schema repository. We demonstrate how local queries are obtained by the following example. Example 2: The engine database and the airplane database are two local sources and engine name connects the local schemata. In the airplane schema (Figure 10), engine name is a foreign key and is represented by the property power-plant and in the engine schema (Figure 11) it is the key and is represented by the property name. Mappings from the global schema (Figure 12) to the local schema have the form ([global name], ([local name], [local schema])). We say that a class or a property in the global schema, x, maps to a local schema S when (x, (y, S)) is in the set of the mappings. For this example, this set is: (airplane, (airplane, S1)), (type, (type, S1)), (power-plant, (power-plant, S1)), (power-plant, (name, S2)), (engine, (engine, S2)), (thrust, (thrust, S2)), (name, (name, S2)) All the mappings are one-to-one, except for the power-plant property that connects the two schemata; power-plant belongs to a set of foreign key constraints maintained
344 Cruz & Sayenko
Figure 10. Airplane schema S 1
Figure 11. Engine schema S2
Figure 12. Global schema
by the schema repository. These constraints are used to connect the set of results from the local queries. The global query QG returns the types of airplanes that have engines with thrust of 115,000 lb: select B from {A}type{B}, {A}power-plant{C}, {C}thrust{D} where D = 115000 lbs The mediator engine translates Q G into QL1, which is a query over the local schema S1, and QL2, which is a query over the local schema S2. The from clause of QG contains three path expressions: {A}type{B}, which contains property type that maps to S1 , {A}power-plant{C}, which contains property power-plant that maps both to S1 and to S2 , and {C}thrust{D}, which contains property thrust that maps to S2 . To obtain the from clause of a local query, the mediator engine selects those path expressions that contain classes or properties that map to the local schema. The from clause of QL1 is: {A}type{B}, {A}power-plant{C}. Similarly, the where clause of a local query contains only those variables of the global where clause that appear in the local from clause. D, which is the only variable in the global where clause, does not appear in the from clause of QL1, so the where clause of QL1 is absent. The select clause of a local query includes variables that appear in the global select clause and in the local from clause. B is a part of the global select clause and it appears
345
in the from clause of QL1, so it will appear in the select clause as well. In addition to the variables from the global select clause, a local select clause contains variables that are necessary to perform a join of the local results in order to obtain the global result. These are the variables is the local from clause that refer to elements of the foreign key constraint set. C refers to the value of power-plant, which is the only foreign key constraint, so C is included in the local select clause. Therefore, QL1 is as follows: select B, C from {A}type{B}, {A}power-plant{C} The from clause of QL2 should include {A}power-plant{C} and {C}thrust{D}; power-plant maps to name and thrust maps to thrust in S2. Since D in the global where clause maps to S2, the local where clause contains D and the associated constraint: D = 115000 lbs. The global select clause does not contain any variables that map to S2, so the local select clause contains only the foreign key constraint variable C. The intermediate version of QL2 is: select C from {C}thrust{D}, {A}power-plant{C} where D = 115000 lbs The intermediate version of QL2 contains {A}power-plant{C} because power-plant maps to name in S2. However, {A}power-plant{C} is a special case because it is a foreign key constraint: we must check whether variables A and C refer to resources that map to S2. A refers to airplane, therefore it does not map to S2 and {A}power-plant{C} should be removed from QL2. The final version of QL2 is: select C from {C}thrust{D} where D = 115000 lbs In summary, the integration layer connects the local sources into the integrated virtual source and makes it available to the presentation layer. The interface between the integration and presentation layers includes the global schema provided by the integration layer, the queries issued by the presentation layer, and the results returned by the integration layer.
Presentation Layer
The presentation layer enables the user to query the distributed multimedia sources and to create complex multicomponent coordinated layouts to display the query results. The presentation layer sends user queries to the integration layer and receives the data sets, which are the query results. A view is created when a data set is attached to a presentation template that determines how the images in the data set are to be displayed. The user specifies the position and the orientation of the view and the dynamic interaction properties of views in the integrated layout. Images and metadata are retrieved from the multimedia sources by means of RQL queries to the RDF multimedia annotations stored at the local sources. In addition to RQL
346 Cruz & Sayenko
Figure 13. Translation of a keyword search to an RQL query
queries, the user may issue keyword searches. A keyword search has three components: the keyword, the criteria, and the source. Any of the components is optional. A keyword will match the class or property names in the schema. The criteria match the values of properties in the metadata RDF document. The source restricts the results of the query to that multimedia source. The data sets returned by the integration layer are encapsulated in the data descriptors that associate the query, layout, and view coordination information with the data set. The following example illustrates how a keyword query gets translated into an RQL query: Example 3: The keyword search where keyword = airplane, criteria = Boeing, and source = aircraftDataSource returns resources that are of class airplane or have a property airplane, have the property value Boeing, and are located in the source aircraftDataSource. This search is translated into the RQL query of Figure 13 and sent to source aircraftDataSource by the integration layer. DelaunayView includes predefined presentation templates that allow the user to build customized views. The user chooses attributes of the data set that correspond to template visual attributes. For example, a view can be defined by attaching the arctic photo dataset (see Example 1) to the slide sorter template, setting the order-by property of the view to the timestamp attribute of the data set, and setting the image source property of the view to the reference attribute of the data set. When a tuple in the data set is to be displayed, image references embedded in it are resolved and images are retrieved from multimedia sources. The user may further customize views by specifying their orientation, position relative to each other, and coordination behavior. Views are coordinated by specifying a relationship between the initiating view and the destination view. The initiating view notifies the destination view of initiating events. An initiating event is the change of view state caused by a user action; selecting an image in a slide sorter, for example. The destination view responds to initiation events by changing its own state according to the reaction model selected by the user. Each template defines a set of initiating events and reaction models. In summary, semantics play a central role in DelaunayView architecture. The data layer makes semantics available as the metadata descriptions and the local schemata. The integration layer enables the user to define the global schema that adds to the semantics provided by the data layer. The presentation layer uses semantics provided by the data and integration layers for source querying, view definition, and view coordination.
347
FUTURE WORK
Our future work will further address the decentralized nature of the data layer. DelaunayView can be viewed as a single node in a network of multimedia sources. This network can be considered from two different points of view. From a centralized perspective, the goal is to create a single consistent global schema with which queries can be issued to the entire network as if it were formed by a single database. From a decentralized data acquisition point of view, the goal is to answer a query submitted at one of the nodes. The network becomes relevant when data are required that are not present at the local node. In the centralized approach, knowledge of the entire global schema is required to answer a query while in the decentralized approach only the knowledge of paths to the required information is necessary. In the centralized approach, the global schema is static. Local sources are connected to each other one by one, resulting in the global schema that must be modified when a local schema is changed. Under the decentralized approach, the integration process can be performed at the time the query is created (automatically in an ideal system) by discovering the data available at the other nodes. A centralized global schema must resolve inconsistencies in schema and data in a globally optimal manner. Under the decentralized approach inconsistencies have to be resolved only at the level of that node. The goal of our future work will be to extend DelaunayView to a decentralized peerto-peer network. Under this architecture, the schema repository will connect to its neighbors to provide schema information to the mediator engine. Conceptually, a request for schema information will be recursively transmitted throughout the network to retrieve the current state of the distributed global schema, but our implementation will adapt optimization techniques from the peer-to-peer community to make schema retrieval efficient. The implementation of the mediator engine and the graphical integration tool will be modified to accommodate the new architecture. Another goal is to incorporate MPEG-7 Feature Extraction Tools into the framework. Feature extraction can be incorporated into the implementation of the graphical integration tool to perform automatic feature extraction on the content of the new sources as they are added to the system. This capability will add another layer of metadata information that will enable users to search for content by specifying low-level features.
CONCLUSIONS
We have discussed our approach to multimedia presentation and querying from a semantic point of view, as implemented by our DelaunayView system. Our paper describes how multimedia semantics can be used to enable access to distributed multimedia sources and to facilitate construction of coordinated views. Semantics are derived from the metadata descriptions of multimedia objects in the data layer. In the integration layer, schemata that describe the metadata are integrated into a single global schema that enables users to view a set of distributed multimedia sources as a single unified source. In the presentation layer, the system provides a framework for creating customizable integrated layouts that highlight semantic relationships between the multimedia objects. The user can retrieve multimedia data sets by issuing RQL queries or keyword searches.
348 Cruz & Sayenko
The datasets thus obtained are mapped to presentation templates to create views. The position, the orientation, and the dynamic interaction of views can be interactively specified by the user. The view definition process involves the mapping of metadata attributes to the graphical attributes of a template. The view coordination process involves the association of metadata attributes from two datasets and the specification of how the corresponding views interact. By using the metadata attributes, both the view definition and the view coordination processes take advantage of the multimedia semantics.
ACKNOWLEDGMENTS
This research was supported in part by the National Science Foundation under Awards ITR-0326284 and EIA-0091489. We are grateful to Yuan Feng Huang and to Vinay Bhat for their help in implementing the system, and to Sofia Alexaki, Vassilis Christophides, and Gregory Karvounarakis from the University of Crete for providing timely technical support of the RDFSuite.
Alexaki, S., Christophides, V., Karvounarakis, G., Plexousakis, D., & Tolle, K. (2000). The RDFSuite: Managing voluminous RDF description bases. Technical report, Institute of Computer Science, FORTH, Heraklion, Greece. Online at http:// www.ics.forth.gr/proj/isst/ RDF/RSSDB/rdfsuite.pdf Baral, C., Gonzalez, G., & Son, T. C. (1998). Design and implementation of display specifications for multimedia answers. In Proceedings of the 14th International Conference on Data Engineering, (pp. 558-565). IEEE Computer Society. Bes, F., Jourdan, M., & Khantache, F. A. (2001) Generic architecture for automated construction of multimedia presentations. In the Eighth International Conference on Multimedia Modeling. Bordegoni, M., Faconti, G., Feiner, S., Maybury, M., Rist, T., Ruggieri, S., et al. (1997). A standard reference model for intelligent multimedia presentation systems. Computer Standards and Interfaces, 18(6-7), 477-496. Bray, T., Paoli, J., Sperberg-McQueen, C., & Maler, E. (2000). Extensible markup language (XML) 1.0 (second edition). W3C Recommendation 6 October 2000. Online at http://www.w3.org/TR/2000/REC-xml-20001006 Brickley, D., & Guha, R. (2001). RDF vocabulary description language 1.0: RDF schema. W3C Recommendation 10 February 2004. Online at http://www.w3.org/TR/2004/ REC-rdf-schema-20040210 Cruz, I. F., & Huang, Y. F. (2004). A layered architecture for the exploration of heterogeneous information using coordinated views. In Proceedings of the IEEE Symposium on Visual Languages and Human-Centric Computing (to appear). Cruz, I. F., & James, K. M. (1999). User interface for distributed multimedia database querying with mediator supported refinement. In International Database Engineering and Application Symposium (pp. 433-441).
REFERENCES
349
Cruz, I. F., & Leveille, P. S. (2000). Implementation of a constraint-based visualization system. In IEEE Symposium on Visual Languages (pp. 13-20). Cruz, I. F., & Lucas, W. T. (1997). A visual approach to multimedia querying and presentation. In Proceedings of the Fifth ACM international conference on Multimedia (pp. 109-120). Fallside, D. (2001). XML schema part 0: Primer. W3C Recommendation, 2 May 2001. Online at http://www.w3.org/TR/2001/REC-xmlschema-0-20010502 Graf, W. H. (1995). The constraint-based layout framework LayLab and its applications. In Proceedings of ACM Workshop on Effective Abstractions in Multimedia, Layout and Interaction, San Francisco. Hunter, J. (2001) An overview of the MPEG-7 Description definition language (DDL). IEEE Transactions on Circuits and Systems for Video Technology, 11(6), 765-772. Karvounarakis, G., Alexaki, S., Christophides, V., Plexousakis, D., & Scholl, M. (2002). RQL: A declarative query language for RDF. In the 11th International World Wide Web Conference (WWW2002). Klyne, G., & Carroll, J. (2004). Resource description framework (RDF): Concepts and abstract syntax. W3C Recommendation 10 February 2004. Online at http:// www.w3.org/TR/2004/REC-rdf-concepts-20040210 Martnez, J. M. (Ed.) (2003) MPEG-7 overview. ISO/IEC JTC1/SC29/WG11N5525. Pattison, T., & Phillips, M. (2001) View coordination architecture for information visualisation. In Australian Symposium on Information Visualisation, 9, 165-169. Ram, A., Catrambone, R., Guzdial, M.J., Kehoe, C.M., McCrickard, D.S., & Stasko, J. T. (1999). PML: Adding flexibility to multimedia presentations. IEEE Multimedia, 6(2), 40-52. Roth, S. F., Lucas, P., Senn, J. A., Gomberg, C. C., Burks, M. B., Stroffolino, P. J., et al. (1996) Visage: A user interface environment for exploring information. In Information Visualization, 3-12. Salembier, P., & Smith, J. R. (2001). MPEG-7 multimedia description Schemes. IEEE Transactions on Circuits and Systems for Video Technology, 11(6), 748-759. Shih, T. K., & Davis, R. E. (1997). IMMPS: A multimedia presentation design system. IEEE Multimedia, 4(2), 67-78. Weitzman, L., & Wittenburg, K. (1994). Automatic presentation of multimedia documents using relational grammars. In Proceedings of the Second ACM International Conference on Multimedia (pp. 443-451). ACM Press.
350 Cruz & Sayenko
Section 5 Emergent Semantics
Emergent Semantics: An Overview 351
Chapter 15
Emergent Semantics:
An Overview
Viranga Ratnaike, Monash University, Australia Bala Srinivasan, Monash University, Australia Surya Nepal, CSIRO ICT Centre, Australia
The semantic gap is recognized as one of the major problems in managing multimedia semantics. It is the gap between sensory data and semantic models. Often the sensory data and associated context compose situations which have not been anticipated by system architects. Emergence is a phenomenon that can be employed to deal with such unanticipated situations. In the past, researchers and practitioners paid little attention to applying the concepts of emergence to multimedia information retrieval. Recently, there have been attempts to use emergent semantics as a way of dealing with the semantic gap. This chapter aims to provide an overview of the field as it applies to multimedia. We begin with the concepts behind emergence, cover the requirements of emergent systems, and survey the existing body of research.
ABSTRACT
INTRODUCTION
Managing media semantics should not necessarily involve semantic descriptions or classifications of media objects for future use. Information needs, for a user, can be task dependent, with the task itself evolving and not known beforehand. In such situations, the semantics and structure will also evolve, as the user interacts with the content, based on an abstract notion of the information required for the task. That is, users can interpret multimedia content, in context, at the time of information need. One way to achieve this is through a field of study known as emergent semantics.
352
Ratnaike, Srinivasan & Nepal
Emergence is the phenomenon of complex structures arising from interactions between simple units. Properties or features appear that were not previously observed as functional characteristics of the units. Though constraints on a system can influence the formation of the emergent structure, they do not directly describe it. While emergence is a new concept in multimedia, it has been used in fields such as biology, physics and economics, as well as having a rich philosophical history. To the best of our knowledge, commercial emergent systems do not currently exist. However, there is research into the various technologies that would be required. This chapter aims to outline some characteristics of emergent systems and relevant tools and techniques. The foundation for computational emergence is found in the Constrained Generating Procedures (CGP) of John Holland (Holland, 2000). Initially there are only simple units and mechanisms. These mechanisms interact to form complex mechanisms, which in turn interact to form very complex mechanisms. This interaction results in self-organization through synthesis. If we relate the concept of CGP to multimedia, the simple units are sensory data, extracted features or even multimedia objects. Participating units can also come from other sources such as knowledge bases. Semantic emergence occurs when meaningful behaviour (phenotype) or complex semantic representation (genotype) arises from the interaction of these units. This includes user interaction, the influence of context, and relationships between media. Context helps to deal with the problem of subjectivity, which occurs when there are multiple interpretations of a multimedia instance. World knowledge and context help to select one interpretation from the many. Ideally, we want to form semantic structures that can be understood by third parties who do not have access to the multimedia instance. This is not the same as relevance to the user. A system might want to determine what is of interest to one user, and have that understood by another. We note that a multimedia scene is not reality; it is merely a reference to a referent in reality. Similarly, the output from emergence is a reference, hopefully useful to the user of the information. We use the linguistic terms reference and referent to indicate existence in the modeled world and the real world, respectively. There is a danger in confusing the two (Minsky, 1988). The referenced meaning is embedded in our experience. This is similar to attribute binding using Dublin Core metadata (Hillmann, 2003), where the standard attribute name is associated with the commonly understood semantic. The principal benefit of emergence is dealing with unanticipated situations. Units in unanticipated configurations or situations will still interact with each other in simple ways. Emergent systems, ideally, take care of themselves, without needing intervention or anticipation on the part of the system architect (Staab 2002). However, the main advantage of emergent semantics is also its greatest flaw. As well as dealing with unanticipated situations, it can also produce unanticipated results. We cannot control the outcomes. They might be useful, trivial or useless, or in the worst case misleading. However, we can constrain the scope of output by constraining the inputs and the ground truths. We can also ask for multiple interpretations. Sometimes, a structure is better understood if one can appreciate the other forms it can take. In the next section, we state the requirements of emergent semantics. This will be followed by a description of existing research. In the last section, we identify gaps in the research, and suggest future directions.
EMERGENT SYSTEMS
Both complete order (regularity) and complete chaos (randomness) are very simple. Complexity occurs between the two, at a place known as the edge of chaos (Langton, 1990). Emergence results in complex systems, forming spontaneously from the interactions of many simple units. In nature, emergence is typically expressed in self-assembly, such as (micro-level) crystal formation and (macro-level) weather systems. These systems form naturally without centralized control. Similarly, emergence is useful in computer systems, when centralized control is impractical. The resources needed in these systems are primarily simple building blocks capable of interacting with each other and their environment (Holland, 2000). However, we are not interested in all possible complex systems that may form. We are interested in systems that might form useful semantic structures. We need to set up environments where the emergence is likely to result in complex semantic representation or expression (Whitesides & Grzybowski, 2003; Crutchfield, 1993; Potgeiter & Bishop, 2002). It is therefore necessary to understand the characteristics and issues involved in emergent information systems. We lead our discussion through the example of an ant colony. An ant colony is comprised, primarily, of many small units known as ants. Each ant can only do simple tasks; for example, walk, carry, lay a pheromone trail, follow a trail, and so forth. However, the colony is sophisticated enough to thoroughly explore and manage its environment. Several characteristics of emergent systems are demonstrated in the ant colony metaphor: interaction, synthesis and self-organization. The main emergent phenomenon is self-organization, expressed in specialized ants being where the colony needs them, when appropriate. These ants and others, the ant interactions, the synthesis and self-organization, compose the ant colony. See Bonabeau and Theraulaz (2000) for more details. This section describes the characteristics and practical issues of emergent systems. They constitute our requirements. These include, but are not limited to, interaction, synthesis, self-organization, knowledge representation, context and evaluation.
Characteristics of Emergent Systems

The requirements of emergent semantic systems (Figure 1) can be seen from three main perspectives: information, mechanism and formation. Information is what we ultimately want (the result of the emergence). It can also be what we need (a source of units for interaction). Some of this information is implicit in the multimedia and context. Other information is either implicit or explicit in stored knowledge. It might seem that nothing particularly useful happens at the scope of two units interacting. However, widening our field of view to take in the interaction of many units, we should see synthesis of complex semantic units, and eventually self-organization of the unit population into semantic structures. These semantic structures can then be used to address the information need and augment the knowledge used by initial interaction. Context influences the emergent structures. We need mechanisms which enable different interactions to occur depending on the context. These mechanisms also need to implicitly select, from the data, the salient units for each situation; either that or cause the interaction of those units, to have a greater effect.
354
Figure 1. Emergent semantic systems
Information
If humans are to evaluate the emergence, they must either observe system behaviour (phenotype) or a knowledge representation (genotype). The system representation must be translatable to terms a human can understand, or to an intermediate representation that can provide interaction. Typically, for this to be possible, the domain needs to be well known. Unanticipated events might not be translated well. Though we deal with the unanticipated, we must communicate in terms of the familiar. Emergence must be in terms of the system being interpreted. Otherwise we run the risk of infinite regression (Crutchfield, 1993). The environment, context and user should be included as part of the system. We need semantic structures, which contain the result of emergence, to be part of the system. Context will either determine which of the many interpretations are appropriate or constrain the interpretation formation. Context is taken mainly from the user or from the application domain. Spatial and temporal positioning of features can also provide context, depending on the domain. The significance of specialized information, such as
geographical position or time point, would be part of application domains such as fire fighting or astronomy. It is known in film theory as the Kuleshov effect. Reordering shots in a scene affects interpretation (Davis, Dorai, & Nack, 2003). Context supplies the system with constraints on relationships between entities. It can also affect the granularity and form of semantic output: classification, labelled multimedia objects, metadata, semantic networks, natural language description, or system behaviour. Different people will want to know different things.
Mechanism
The defining characteristic of useful emergent systems is that simple units can interact to provide complex and useful structures. Interaction1 is the notion that units2 in the system will interact to form a combined entity, which has properties that no unit has separately. The interaction is significant. Examining the units in isolation will not completely explain the properties of the whole. For two or more units to interact some mechanism must exist which enables them to interact. The mere presence of two salient units doesnt mean that they are able to interact. Before we can reap the benefits of units interacting, we need units. These units might be implied by the data. Explicit selection of units by a central controller would not be part of an emergent process. Emergence involves implicit selection of the right units to interact. The environment should make it likely for salient units to interact. Possibly all units interact, with the salient units interacting more. In different contexts, different units will be the salient units. The context should change which units are more likely to interact, or the significance of their interaction.
Formation
Bridge laws, linking micro and macro properties, are emergent laws if they are not semantically implied by initial micro conditions and micro laws (McLaughlin, 2001). Synthesis involves a group of units composing a recognisable whole. Most systems do analysis which involves top-down reduction and a control structure performing analysis. Synthesis is essentially the interaction mechanism seen at another level or from a different perspective. A benefit of emergence is that the system designer is freed from having to anticipate everything. Synthesis involves bottom-up emergence, which results in a complex structure. The unanticipated interaction of simple units might carry out an unanticipated and complex task. We lessen the need for a high-level control structure that tries to anticipate all possible future scenarios. Boids (Reynolds, 1987) synthesizes flocking behaviour in a population of simple units. Each unit in the flock follows simple laws, knowing only how to interact with its closest neighbours. The knowledge of forming a flock isnt stored in any unit in the flock. Unanticipated obstacles are avoided by the whole flock, which reforms if split up. Self-Organization involves a population of units which appear to determine their own collective form and processes. Self-assembly is the autonomous organization of components into patterns or structures without human intervention (Whitaker, 2003; Whitesides & Grzybowski, 2003). Similar though less complex, self-organization occurs in artificial life (Waldrop, 1992). It attempts to mimic biological systems by capturing
356
an abstract model of evolution. Organisms have genes, which specify simple attributes or behaviours. Populations of organisms interact to produce complex systems.
ISSUES
Evaluation
The Semantic Gap is industry jargon for the gap between sensory information and the complex model in a humans mind. The same sensory information provides some of the units which participate in computational emergence. The semantic structures, which are formed, are the systems complex model. Since emergence is not something controlled, we cannot make sure that the systems complex model will be the same as the humans complex model. The ant colony is not controlled, though we consider it successful. If the ant colony self-organized in a different way, we might consider that structure successful as well. There may be many acceptable, emergent semantic structures. We need to know whether the semantic emergence is appropriate, to either the user or task. Therefore, we need to evaluate the emergence, either through direct communication of the semantic structure or though system behaviour.
Scalability
This notion of scale is slightly different to traditional notions. We can scale with respect to domain and richness of data. Most approaches for semantics constrain the domain of knowledge, such that the constraints themselves provide ground truths. If a system tries to cater for more domains, it loses some of the ground truths. Richness of data refers to numbers of units and types of units available. If the amount of data is too small, we might not have enough interaction to create a meaningful structure. A higher amount of data increases the number of units and types available. Unit pairings increase exponentially with increasing units. A system where all units try to interact with all other units might stress the process power of the system. A system without all possible interactions might miss the salient interactions. It is also uncertain whether increasing data richness will lead to finer granularity of semantic structure or lesser ability to settle on a stable structure.
Augmentation
Especially for iterative processes, it might be useful to incrementally add knowledge back to the system. The danger here is that in order to reapply what has been learned the system will have to recognise situations which have occurred before with different sensory characteristics; for example, two pictures of the same situation taken from different angles.
CURRENT RESEARCH
Having described the requirements in abstract, we will now describe the tools and techniques which address the requirements.
Information
Knowledge representation, for emergence, includes ontology, metadata and genetic algorithm strings. The use of templates and grammars, which can communicate semantics in terms of the media, arent emergent techniques as their semantic structure is predefined and the multimedia content anticipated. Metadata (data about data) can be used as an alternative semantic description of multimedia content. MPEG-7 has a description stream, which is associated with the multimedia stream by using temporal operators. The description resides with the data. However, it is difficult in advance to provide metadata for every possible future interpretation of an event. The metadata can instead be derived from emergent semantic structures. If descriptions are needed, natural language can be derived from predicates associated with modeled concepts (Kojima, Tamura, & Fukunaga, 2002). Classically, ontology is the study of being. The computer industry uses the term to refer to fact bases, repositories of properties and relations between objects, and semantic networks (such as Princetons WordNet). Some ontology is used for reference, with multimedia objects or direct sensory inputs being used to index the ontology (Hoogs, 2001; Kuipers, 2000). Other ontology attempts to capture how humans communicate their own cognitive structures. The Semantic Web attempts to use ontology to access the semantics implicit in human communication (Maedche, 2002). Semantic networks consist of a skeleton of low-level data which can be augmented by adding semantic annotation nodes (Nack, 2002). The low-level data consists of the multimedia or ground truths, which can act as units in an emergent system. The annotation nodes can contain the results of emergence, and they are not permanent. This has the advantage of providing metadata-like properties, which can also be changed for different contexts. In genetic algorithms, knowledge representation (genotype) lies in evolving strings. The strings can contain units and the operators that act on them (Gero & Ding, 1997). The genotypes evolve over several generations, with the successful3 genes selected to generate the next generation. Knowledge and context are acquired across generations. Context is taken from the domain, the data instances, or the user. A users context is mainly taken from their interaction history. Their personal history and current mental state are harder to measure. The users role during context gathering can be active (direct manipulation) or passive (observation).
Direct Manipulation
The user can actively communicate context to the system. Santini, Gupta, and Jain (2001) ask their users to organize images in a database. They use the example of a portrait. If the portrait is in a cluster of paintings, then the semantic is painting. If it is in a cluster of people, the semantic is people or faces. The same image can be a reference to different referents, which can be intangible ideas as well as tangible objects. CollageMachine (Kerne, 2002) is a Web browsing tool which tries to predict user browsing intentions. The system tries to predict possible lines of user inquiry and selects multimedia components of those to display. Reorganization of those components, by the user, is used by the system to adjust its model.
358
Observation
The context of a multimedia instance is taken from past and future subjects of user attention. The entire path taken, or group formed, by a user provides an interpretation for an individual node. The whole provides the context for the part. Emergence depends on what the user thinks the data are, though the user does not need to know how they draw conclusions from observing the data. The emergence of semantics can be made by observing human and machine agent interaction (Staab, 2002). Context, at each point in the users path, is supplied by their navigation (Grosky, Sreenath, & Fotouhi, 2002). The users interpretation can be different from the authors intentions. The Web is considered a directed graph (nodes: Web pages, edges: links). Adjacent nodes are considered likely to have similar semantics, though attempts are made to detect points of interest change. The meaning of a Web page (and multimedia instances in general) emerges through use and observation.
Grouping
In order to model what humans can observe, it is often helpful to model human vision. In computer vision, grouping algorithms (based on human vision) are used to form higher-level structures from units within an image (Engbers & Smeulders, 2003). An algorithm can be an emergent technique if it can adapt dynamically to context. Context can come from sources other than users. Multiple media can be associated with data events to help in disambiguating semantics (Nakamura & Kanade, 1997). Context in genetic algorithms is sensed over many generations, if one interprets that better performance in the environment is response to context. The domain, in schemata agreement, is partly defined by the parties involved.
Mechanism
Mechanisms of automatic, implicit unit selection and interaction are yet to be developed for semantic emergence. This is a gap in the literature that will need to be filled. Current mechanisms involve the user as a unit. The semantics emerge through interaction of the users own context with multimedia components (Santini & Jain, 1999; Kerne, 2002). The user decides which things interact, either actively or passively. In genetic algorithms, fitness functions decide how gene strings evolve (Gero & Ding, 1997). Genetic algorithms can be used to lessen the implicit selection problem by reducing the search spaces of how units interact, which units interact, and which things are considered units. A similar situation arises with evaluation of emerged semantics. The current thinking is that humans are needed to evaluate accuracy or reasonableness. In genetic algorithms, the representation (genotype) can be evaluated indirectly by testing the phenotype (expression). Simply having all the necessary sensory (and other) information present will not necessarily result in interaction occurring. Information from ontology could be used in decision making, or in suggesting other units for interaction. Explicitly identifying units for interaction might be a practical nonemergent step. Units can be feature patterns rather than individual features (Fan, Gao, Luo, & Hacid, 2003). Templates can be used to search for units suggested by the ontology. Well-known video structures can be used to locate salient units within video sequences (Russell, 2000; Dorai & Venkatesh, 2001; Venkatesh & Dorai, 2001). Data can provide context by affecting the perception or emotions of the
observer. Emotions can be referents. Low-level units, such as tempo and colour, in the multimedia instance, act as symbols which reference them.
Formation
In genetic algorithms, synthesis occurs between generations. The genotype is selforganizing. With direct manipulation and user observation, synthesis and organization come in the form of users putting things together. In schemata agreement, region synthesis leads to self-organization. It is designed to be adaptable to unfamiliar schemata. Agreement can be used to capture relationships later. Emergence occurs as pairs of nodes in decentralised P2P (peer-to-peer) systems attempt to form global semantic agreements by mapping their respective schemata (Aberer, Cudre-Mauroux, & Hauswirth, 2003). Regions of similar property emerge as nodes in the network are connected pair wise, and as other nodes link to them (Langley, 2001). Unfortunately, this work does not deal with multimedia. There has been recent interest in combining the areas of multimedia, data mining and knowledge discovery. However, the semantics here are not emergent. There is also data mining research into multimedia using Self-Organizing Maps (SOM), but this is not concerned with semantics (Petrushin, Kao, & Khan, 2003; Simoff & Zaiane, 2000).
CONCLUSION
A major difficulty in researching the concept of emergent semantics in multimedia is that there are no complete systems integrating the various techniques. While there is work in knowledge representation, with respect to both semantics and multimedia, to the best of our knowledge, theres very little in interaction, synthesis and selforganization. There is the work on schemata agreement (nonmultimedia) and some work on Self-Organizing Maps (nonsemantic), but nothing combining them. The little that has been done involves users to provide context and genetic algorithms to reduce problem spaces. One of the gaps to be filled is developing interaction mechanisms, which enable possibly unanticipated data to interact with each other and their environment. Even if we can trust the process, we are still dependent on its inputs the simple units that interact. The set of units needs to be sufficiently rich to enable acceptable emergence. Ideally, salient features (even patterns) should naturally select themselves during emergence, though this may require participation of all units, placing a high computational load on the system. Part of the problem, for emergence techniques, is that the simple interactions must occur in parallel, and in numbers great enough to realise selforganization. The future will probably have more miniaturized systems, capable of true parallelism in quantum computers. A cubic millimetre of the brain holds the equivalent of 4 km of axonal wiring (Koch, 2001). Perhaps greater parallelism will permit interaction of all available units. There is motivation for research into nonverbal computing, where the users are illiterate (Jain, 2003). Without user ability to issue and access abstract concepts, the concepts must be inferred. Experiential computing (Jain, 2003; Sridharan, Sundaram, & Rikasis, 2003) allows users to interact with the system environment, without having to
360
build a mental model of the environment. They seek a symbiosis formed from human and machine, taking advantage of their respective strengths. These systems are insight facilitators. They help us make sense of our own context by engaging our senses directly, as opposed to being confronted by an abstract description. Experiential computing, while in its infancy now, might in the future enable implicit relevance feedback. The users interactions with the system could cause both emergence and verification of semantics.
Aberer, K., Cudre-Mauroux, P., & Hauswirth, M. (2003). The chatty Web: Emergent semantics through gossiping. Paper presented at the WWW2003, Budapest, Hungary. Bonabeau, E., & Theraulaz, G. (2000). Swarm smarts. Scientific American, 282(3), 54-61. Crutchfield, J. P. (1993). The calculi of emergence. Paper presented at the Complex Systems - from Complex Dynamics to Artificial Reality, Numazu, Japan. Davis, M., Dorai, C., & Nack, F. (2003). Understanding media semantics. Berkeley, CA: ACM Multimedia 2003 Tutorial. Dorai, C., & Venkatesh, S. (2001, September 10-12). Bridging the semantic gap in content management systems: Computational media aesthetics. Paper presented at the COSIGN 2001: Computational Semiotics (pp. 94-99), CWI Amsterdam. Engbers, E. A., & Smeulders, A. W. M. (2003). Design considerations for generic grouping in vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(4), 445-457. Fan, J., Gao, Y., Luo, H., & Hacid, M.-S. (2003). A novel framework for semantic image classification and benchmark. Paper presented at the ACM SIGKDD, Washington, DC. Gero, J. S., & Ding, L. (1997). Learning emergent style using and evolutionary approach. In B.Varma & X. Yao (Eds.), paper presented at the ICCIMA (pp. 171-175), Gold Coast, Australia. Grosky, W. I., Sreenath, D. V., & Fotouhi, F. (2002). Emergent semantics and the multimedia semantic Web. SIGMOD Record, 31(4), 54-58. Hillmann, D. (2003). Using Dublin core. Retrieved February 16, 2004, from http:// dublincore.org/documents/usageguide/ Holland, J. H. (2000). Emergence: From chaos to order (1st ed.). Oxford: Oxford University Press. Hoogs. (2001, 10-12 October). Multi-modal fusion for video understanding. Paper presented at the 30th Applied Imagery Pattern Recognition Workshop (pp. 103108), Washington, DC. Jain, R. (2003). Folk computing. Communications of the ACM, 46(3), 27-29. Kerne, A. (2002). Concept-context-design: A creative model for the development of interactivity. Paper presented at the Creativity and Cognition, Vol. 4 (pp. 92-122), Loughborough, UK. Koch, C. (2001). Computing in single neurons. In R. A. Wilson & F. C. Keil (Eds.), The MIT Encyclopedia of the Cognitive Sciences (pp. 174-176). Cambridge, MA: MIT Press.
REFERENCES
Kojima, A., Tamura, T., & Fukunaga, K. (2002). Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 150(2), 171-184. Kuipers, B. J. (2000). The spatial semantic hierarchy. Artificial Intelligence, 119, 191233. Langley, A. (2001). Freenet. In A. Oram (Ed.), Peer-to-peer: Harnessing the benefits of a disruptive technology (pp. 123-132). Sebastopol, CA: OReilly. Langton, C. (1990). Computation at the edge of chaos: Phase transitions and emergent computation. Physica D, 42(1-3), 12-37. Maedche. (2002). Emergent semantics for ontologies. IEEE Intelligent Systems, 17(1), 85-86. McLaughlin, B. P. (2001). Emergentism. In R. A. Wilson & F. C. Keil (Eds.), The MIT Encyclopedia of the Cognitive Sciences (pp. 267-269). Cambridge, MA: MIT Press. Minsky, M. L. (1988). The society of mind (1st ed.). New York: Touchstone (Simon & Schuster). Nack, F. (2002). The future of media computing. In S. Venkatesh & C. Dorai (Eds.), Media computing (159-196). Boston: Kluwer. Nakamura, Y., & Kanade, T. (1997, November). Spotting by association in news video. Paper presented at the Fifth ACM International Multimedia Conference (pp. 393401) Seattle, Washington. OED. (2003). Oxford English Dictionary. Retrieved February 2004, from dictionary.oed.com/entrance.dtl Petrushin, V. A., Kao, A., & Khan, L. (2003). The Fourth International Workshop on Multimedia Data Mining, MDM/KDD 2003. Vol. 6(1). (pp. 106-108). Potgeiter, A., & Bishop, J. (2002). Complex adaptive systems, emergence and engineering: The basics. Retrieved February 20, 2004, from http://people.cs.uct.ac.za/ ~yng/Emergence.pdf Reynolds, C. (1987). Flocks, herds, and schools: A distributed behavioral model. Computer Graphics, 21(4), 25-34. Russell, D. (2000). A design pattern-based video summarization technique. Paper presented at the Proceedings of the 33rd Hawaii International Conference on System Sciences (p. 3048). Santini, S., Gupta, A., & Jain, R. (2001). Emergent semantics through interaction in image databases. IEEE Transactions on Knowledge and Data Engineering, 13(3), 337351. Santini, S., & Jain, R. (1999, Jan). Interfaces for emergent semantics in multimedia databases. Paper presented at the SPIE, San Jose, California. Simoff, S. J., & Zaiane, O. R. (2000). Report on MDM/KDD2000: The First International Workshop on Multimedia Data Mining. SIGKDD Explorations, 2(2), 103-105. Sridharan, H., Sundaram, H., & Rikasis, T. (2003, November 7). Computational models for experiences in the arts, and multimedia. Paper presented at the ACM Multimedia 2003, First ACM Workshop on Experiential Teleprescence, Berkeley, CA, USA. Staab, S. (2002). Emergent semantics. IEEE Intelligent Systems, 17(1), 78-79. Venkatesh, S., & Dorai, C. (2001). Computational media aesthetics: Finding meaning beautiful. IEEE Multimedia, 10-12. Waldrop, M. M. (1992). Life at the edge of chaos. In Complexity, (pp. 198-240). New York: Touchstone (Simon & Schuster).
362
Whitaker, R. (2003). Self-organization, autopoiesis and enterprises. Retrieved February 3, 2005, from http://www.acm.org/sigois/auto/Main.html Whitesides, G. M., & Grzybowski, B. (2003). Self-assembly at all scales. Science, 295, 2418-2421.
ENDNOTES
1 2 3
The action or influence of persons or things on each other (OED, 2003). The user can also be considered a unit. According to a fitness function, which measures the phenotype (the strings expression or behaviour).
Emergent Semantics from Media Blending
363
Chapter 16

Edward Altman, Institute for Infocomm Research, Singapore Lonce Wyse, Institute for Infocomm Research, Singapore
The computation of emergent semantics for blending media into creative compositions is based on the idea that meaning is endowed upon the media in the context of other media and through interaction with the user. The interactive composition of digital content in modern production environments remains a challenging problem since much critical semantic information resides implicitly within the media, the relationships between media models, and the aesthetic goals of the creative artist. The composition of heterogeneous media types depends upon the formulation of integrative structures for the discovery and management of semantics. This semantics emerges through the application of generic blending operators and a domain ontology of pre-existing media assets and synthesis models. In this chapter, we will show the generation of emergent semantics from blending networks in the domains of audio generation from synthesis models, automated home video editing, and information mining from multimedia presentations.
ABSTRACT
INTRODUCTION
Today, there exists a plethora of pre-existing digital media content, synthesis models, and authored productions that are available for the creation of new media productions for games, presentations, reports, illustrated manuals, and instructional
364 Altman & Wyse
materials for distance education. Technologies from sophisticated authoring environments for nonlinear video editing, audio synthesis, and information management systems are increasingly finding their way into a new class of easy to use, partially automated, authoring tools. This trend in media production is expanding the life cycle of digital media from content-centric authoring, storage, and distribution to include usercentric semantics for performing stylized compositions, information mining, and the reuse of the content in ways not envisioned at the time of the original media creation. The automation of digital media production at a semantic level remains a challenging problem since much critical information resides implicitly within the media, the relationships between media, and the aesthetic goals of the creative artist. A key problem in modern production environments is therefore the discovery and management of media semantics that emerges from the structured blending of pre-existing media assets. This chapter introduces a model-based framework for media blending that supports the creative composition of media elements from pre-existing resources. The vast quantity of pre-existing media from CDs, the Internet, and local recordings that are currently available has motivated recent research into automation technologies for digital media (Davis, 1995; Funkhouser et al., 2004; Kovar & Gleicher, 2003). Traditional authoring tools require extensive training before the user becomes proficient and normally consume enormous time to compose relatively simple productions even by skilled professionals. This contrasts with the needs of the non-professional media author who would prefer high level insights into how media elements can be transformed to create the target production, as well as tools to automate the composition from semantically meaningful models. Such creative insights arise from the ability to flexibly manipulate information and discover new relationships relative to a given task. However, current methods of information retrieval and content production do not adequately support exploration and discovery in mixed media (Santini, Gupta, & Jain, 2001). A key problem for media production environments is that the task semantics for content repurposing depends upon both the media types and the context of the current task. In this chapter we claim that many semantics based operations, including summarization, retrieval, composition, and synchronization can be represented as a more general operation called, media blending. Blending is an operation that occurs across two or more media elements to yield a new structure called, the blend. The blend is formed by inheriting partial semantics from the input media and generating an emergent structure containing information from the current task and the source media. Thus the semantics of the blend emerges from interactions among the media descriptions, the task to be performed, and the creative input of the user. Automated support for managing the semantics of media content would be beneficial for diverse applications, such as video editing (Davis, 1995; Kellock & Altman, 2000), sound synthesis (Rolland & Pachet, 1995), and mining information from presentations (Dorai, Kermani, & Stewart, 2001). A common characteristic among these domains that will be emphasized in this chapter is the need to manage multiple media sources at the semantic level. For sound production, there is a rich set of semantics associated with sound effects collections and audio synthesis models that typically come with semantically labeled control parameters. In the case of automatic home video editing, the control logic is informed by the relationships between music structure and video cuts as described in film theory to yield a production with a particular composition style (Sharff,
365
1982). In the case of presentation mining from e-learning content, there is an association between pedagogical structures within a lecture video and other content resources, such as textbooks and slide presentations, that can be used to inform a search engine when responding to a students query. In each case, the user-centric production of media involves the dynamic blending of information from different media. In this chapter, we will show the construction of blending networks for user-centric media processing in the domains of audio generation from sound synthesis models, automated home video editing, and presentation mining.
BACKGROUND
Models are fundamental for the construction of blending networks (Veale & ODonoghue, 2000). Blending networks have their origins in frame based reasoning systems and have recently been applied in cognitive linguistics to link discourse analysis with fundamental structures of cognition. According to Conceptual Integration Theory from cognitive linguistics, thought and language depend upon our capabilities to manipulate webs of mappings between mental spaces (Fauconnier, 1997). These mental space mappings form the basis for the understanding of metaphors and other forms of discourse as conceptual blends. Similarly, the experiential qualities of media constitute a form of discourse that can only be understood through the creation of deep models of media (Staab, Maedche, Nack, Santini, & Steels, 2002). Prior work on conceptual blending provides a theoretical framework for the extension of blending theory to digital media. Consequently, audio perception, video appreciation, and information mining may be viewed as a form of media discourse. Accordingly, the claim in this chapter is that principles of conceptual blending derived from analysis of language usage may also be applied to the processing of media. In the remainder of this section we will describe three scenarios for media blending, then review the literature on conceptual blending and metaphor. The following sections will relate these structures to concrete examples of media blending.
Audio Models
Intricate relations between audio perception and cognition associated with sound production techniques pose interesting challenges regarding the semantics that emerge from combinations of constituent elements. The semantics of sound tend to be more flexible than the semantics associated with graphics and have a more tenuous relationship to the world of objects and events than do graphics. For example, the sound of a crunching watermelon can be indicative of a cool refreshing indulgence on a summer day, or it can add juicy impact to a punch for which the sound is infamously used in film production. The art of sound effects production depends heavily on the combination, reuse, and recontextualization of libraries of prerecorded material. Labeling sounds in a database in a way that supports reuse in flexible semantic contexts is a challenge. A current trend in audio is the move toward structured representations (Rolland & Pachet, 1995) that we will call models. A sound model is a parameterized algorithm for generating a class of sounds as shown schematically in Figure 1. Models are useful in media production partly because of their low memory/bandwidth requirements, since it
366 Altman & Wyse
Figure 1. A sound model includes a synthesis algorithm capable of generating a specific range of sounds, and parameters that determine the behaviour of the model
Control Parameters Map Controls to Synth Params Model Synth Parameters Synthesizer algorithm Audio Signal
takes much less memory to parameterize a synthesis model than it does to code the raw audio data. Also, models meet the requirement from interactive media, such as games, that audio be generated in real time in response to unpredictable events in an interactive environment. The sound model design process involves building an algorithm from component signal generators and transformers (modulators, filters, etc.) that are patched together in a signal flow network that generates audio at the output. Models are designed to meet specifications on a) the class of sounds the model needs to cover and b) the method of controlling the model through parameters that are exposed to the user. Models are associated with semantics in a way that general-purpose synthesizers are not, because they are specialized to create a much narrower range of sounds. They also take on semantics by virtue of the real-time interactivity they have responsive behaviors in a way that recorded sounds do not. As media objects, models present interesting opportunities and challenges for effective exploitation in graphical, audio, or mixed media. A database of sound models is different from a database of recorded sounds in that the accessible sounds in the database are (i) not actually present, but potential and (ii) infinite in variability due to the dynamic parameterization that recorded sounds do not afford. Model building is a labor intensive job for experts, so exploiting a database of pre-existing sound models potentially has tremendous value. Another trend in audio, as well as other forms of digital media, is to attempt to automatically extract semantics from raw media data. The utility of being able to identify a baby crying or a window breaking in an audio stream should be self-apparent, as should the difficulty of the task. Typically, audio analysis is based on adaptive association between low-level signal features (such as spectral centroid, basis vectors, zero crossings, pitch, and noise measures) and labels provided by a supervisor, or based on an association with data from another media stream such as video. The difficulty
367
lies in the fact that there is no such thing as the semantics, and any semantics there may be, are dependent upon contexts both in and outside of the media itself. The humanin-the-loop and the intermediate representations between physical attributes and deep semantics that models offer can be effective bridges across this gap.
Video Editing
The blending of two or more media that combines perceptual aspects from each media to create a new effect is a common technique in film production. In film editing, the visual presentation of the scene tells the story from the characters point of view. Music is added to convey information about the characters emotional state, such as fear, excitement, calm, and joy. Thus, when the selection of video cut points along key visual events are synchronized with associated features in the music, the audience experiences the blended media according to the emergent semantics of the cinematic edit (Sharff, 1982). The non-professional media author of a home video may know what style of editing they prefer, but lack the detailed knowledge, or time, to perform the editing operations. Similarly, they may know what music selections to add to the edited video, but lack the tools and insight to match the beat, tempo, and other features from the music with suitable events in the video content. The challenge for semi-automated video editing tools is to combine the stylistic editing logic with metadata descriptions of the selected music and video, then opportunistically blend the source media to create the final product (Kellock & Altman, 2000; Davis, 1995).
Presentation Mining
The utilization of media semantics is important not only for audio synthesis and video editing, but also for information intensive tasks, such as composing and subsequently mining multimedia presentations. There are a rapidly growing number of corporate media archives, multimedia presentations, and modularized distance learning courseware which contain valuable information that remains inaccessible outside the original production context. For instance, a common technique for authoring modular courseware is to produce a series of short, self contained multimedia presentations for topics in the syllabus, then customize the composition of these elements for the target audience (Thompson Learning, n.d.; WebCT, n.d.). The control logic for the sequencing and navigation through the course content is specified through the use of description languages. However this normally does not include a semantic description of pedagogical events, domain models, or dependencies among media resources that would aid in the exploration of the media by the user. Once the course is constructed, it becomes very difficult to modify or adapt the content to new contexts. Recorded corporate presentations and distance learning lectures are notoriously difficult to search for information or reuse in a different context. This difficulty arises from the fact that the semantics of the presentation is fixed at the time of production. The media blending framework is designed to support the discovery and generation of emergent semantics through the use of ontologies for modeling domain information, composition logic, and media descriptions.
368 Altman & Wyse
MEDIA BLENDING FRAMEWORK

The key issue of this chapter is to empower the media producer to more easily create complex media assets by leveraging control over emergent semantics derived from media blends. Current media production techniques involve the human interaction with collections of media libraries and the use of specialized processing tools, but they do not yet provide support for utilizing semantic information in creative compositions. The development of standards, such as mpeg-7, AAF, and SMIL, facilitate the description, shared editing, and structured presentation of media elements (Nack & Hardman, 2002). The combination of description languages and rendering engines for sound (C-Sound, MAX), and video (DirectShow, QuickTime) provide powerful tools for composing and rendering media after it is produced. Recent efforts toward automated media production (Davis, 1995; Kellock & Altman, 2000) begin to demonstrate the power of model-based tools for authoring creative compositions. These approaches depend upon proprietary methods and tend to work only in specialized contexts. The objective of media blending is to unify these disparate approaches under a common framework that results in more efficient methods for managing media semantics. In this section we will motivate the need for a media blending framework by citing current limitations in audio design methods, then provide an illustrative example of an audio blending network. This section concludes with a concise formulation of media blending.
Sound Semantics in Audio Production

Production houses typically have hundreds of CDs of sound effect material stored in databases and access the audio by searching an index of semantic labels. A fragment of a sound effects database shown in Table 1 illustrates that sounds typically acquire their semantics from the context in which they were recorded. One thing that makes it difficult to repurpose sounds from a database labeled this way is that sounds within a category can sound very different, and sounds in different categories can sound very similar. That is, the sounds in production libraries are generally not classified by clusters of acoustic features. Instead, there are several classes of semantic descriptors typically used for sound: Sources as descriptors. Dog barking, tires screeching, gun shot. The benefit of using sources as semantic descriptors is that the descriptions come from lay language that everybody speaks. Sources are very succinct descriptions and come with a rich set of relationships to other objects that we know about. A drawback to sources as descriptors is that some sounds have no possible, or at least obvious, physical cause (e.g., the sound of an engine changing in size). Even if a physical source is responsible for a sound, it may be impossible to identify. Similarly, any given sound may have many unrelated possible sources. Finally, a given source can have acoustically unrelated sounds associated with it, for example, a train generates whistles, steam, rolling, and horn sounds. Actions and events as descriptors. Dog barking, tires screeching, gun shot. Russolos early musical noise machines, or intonurumori, had onomatopoetic names allied with actions, including howler, roarer, crackler, rubber, hummer, gurgler, hisser, whistler, burster, croaker, and rustler (Russolo, 1916). The benefit
369
Table 1. Semantic descriptions in a database of common sound effects
of actions and events as descriptors is that they can often be assigned even when source identification is impossible (a screech is descriptive whether the sound is from tires or a child). Actions and events are also familiar to a layperson for describing sounds (scraping, falling, pounding, screaming, sliding, rolling, coughing, clicking). A drawback is that in some cases it may be difficult or impossible for sounds to be described this way. Unrelated sounds can also have the same description in terms of actions and events. Finally, the description can be quite subjective one persons gust is anothers blow. Source attributes as sound descriptors. Big dog, metal floor, hollow wood. Such descriptions are often easier to obtain than source identification and are still useful even when source identification is impossible. These attributes are often scalar, which makes them quantitative and easier to deal with for a computer. The drawbacks are that it may be difficult to assign attributes for some sounds, many sounds may have the same attributes, and the assignment can be quite subjective.
Sounds may also belong together simply because they frequently co-occur in the environment or in man-made media. A beach sounds class could include crashing waves, shouting people, and dogs barking. Loose categories such as indoor and outdoor are often useful especially in media production. A recording of a dog barking indoors would be useless for an outdoor scene. When producers have the luxury of a high budget and are creating their own sound effects, sounds are typically constructed from a combination of recorded material, synthetic material, and manipulation. Typical manipulation techniques include time reversal, digital effects, such as filtering, delay, pitch shifting, and overlaying many different tracks to create an audio composite. To achieve the desired psychological impact for an event with sound, it is often the case that recordings of actual sounds generated by the real events are useless. For example, the sounds of a real punch or a real gun firing are entirely inadequate for creating the impression of a punch or a gun shot in cinema. The sounds that one might want to use as starting material to construct the effects come from unrelated material with possibly unrelated semantic labels in the stored database (Mott, 1990).
370 Altman & Wyse
Semantic labels tend to commit a sound in a database to a certain usage unless the database users know how to work around the labels to suit their new media context. On the other hand, low-level physical signal attributes are not very helpful at providing human-usable knowledge about a sound, either. In the 1950s, Pierre Schaeffer made a valiant attempt at coming up with a set of generic source-independent sound descriptors. Rough English translations of the descriptors include mass, dynamics, timbre, melodic profile, mass profile, grain, and pace. He hoped that any sound could be described by a set of values for these descriptors. He was never satisfied with the results of his taxonomical attempts. More recently, Dennis Smalley has developed his theory of Spectromorphology (Smalley, 1997) with terms that are more directly related to aural perception: onsets (departure, emergence, anacrusis, attack upbeat, downbeat), continuants (passage, transition, prolongation, maintenance, statement), terminations (arrival, disappearance, closure, release, resolution), motions (push/drag, flow, rise, throw/fling, drift, float, fly), and growth (unidirectional, such as ascent, planar, descent and reciprocal, such as parabola, oscillation, undulation). This terminology has had some success in the analysis of electroacoustic works of music, which are notoriously difficult due to the unlimited sonic domain from which they draw their material and because of the lack of definitive reference to extra-sonic semantics. In fact, the clearest limitation of spectromorphology is its inability to address the referential dimension of much contemporary music.
The Blending Network

Sound models are endowed with semantics in their parameterization since the models are built for a specific media context to cover a certain range of sounds and manipulate perceptually meaningful features under parametric control. Thus, models are given names such as footsteps, and parameters are given names such as walkers weight, limp, and perhaps a variety of characteristics concerning the walking surface. If one knew the parameter values used for a particular footstep model to create a certain sound, then one could interpret the semantics from the parameter names and their values as, for example, a heavy person walking quickly over a wet gravel surface. A single model could be associated with a wide range of different semantics depending upon the assumed values of the parameters. In the context of audio production, models have one clear advantage over a library of recorded sounds in that they permit their semantics to be manipulated. Rarely does a sound producer turn to a library of sounds and use the sound unadulterated in a new media production. Recordings lend themselves to standard audio manipulation techniques they can be layered, time-altered, put through standard effects processors such as compressors and limiters, phateners, harmonizers, etc. However, models, by design, give a sound designer handles on the sound semantics, at least for the semantics they were designed to capture. With only a recording, and no generative model, it would be difficult, for example, to change a walk into a run because in addition to the interval between the foot impacts there are a myriad of other differences in the sound due to heeltoe timing and the interaction between the foot and the surface characteristics. In a good footsteps model, the changes would all be a coordinated function of parameters controlling speed and style.
371
The footsteps model is used in the following example to illustrate the central principles of media blending networks. Consider an audio designer who has been given the task of creating the sound of two people passing on a stairway. In the model library there are separate models for a person going up the stars and for going down the stars, but there is no model for two people passing. The key moment for the audio designer is the event when two people meet on the stairway so that both complete the step at the same time. This synchronization is not a part of either input model, but it has a semantic meaning that is crucial for the overall event. The illustrations of blending networks use diagrams to represent the models and relationships. In these diagrams, models are represented by circles; parameters by points in the circles; and connections between parameters by lines. Each model may be realized as a complex software object that can be modified at the time the blending network is constructed. Thus the sound designer would use a high level description language to specify the configuration of the models and their connections. The configuration description is then compiled into the blending network which could then be run to produce the desired sound. The Footstep network contains two input models corresponding to the audio model for walking up the stairs and the model for walking down the stairs. Each model in Figure 2 is distinct, however they have semantically similar parameters. The starting time for climbing the stairs is t 1, the starting time for descending the stairs is t 2, the person going up is p1, and the person going down is p2. The two audio models have parameters that are semantically labeled. The crossmodel mapping that connects corresponding parameters in the input models is illustrated by dashed lines in Figure 3. In addition to the starting times, t i, and the persons, pi, that are specified explicitly in the input models, connections are established between other similar pairs of parameters, such as walking speed, si, and location, li. The two input models inherit information from an abstract model for walking that includes percussion sounds, walking styles, and material surfaces. This forms a generic model that expresses the common features associated with the two inputs. The common features may be simple parameters, such as start time, person, speed, and location as in Figure 4. More generally, the generic model may be used to specify the components and relationships in more complex models as a domain ontology, as we shall see later. The blending framework in Figure 5 contains a fourth model which is typically called the blend. The two stair components in the input models are mapped onto a single set of stairs in the blend. The local times, t1 and t 2, are mapped onto a common time t in the blend. However, the two people and their locations are mapped according to the local time of the blend. Therefore, the first input model represents the audio produced while going up the stairs, whereas the second model represents the audio produced while going down. The projection from these input models onto the blend preserves time and location. The Footstep network exhibits in the blend model various emergent structures that are not present in the inputs. This emergent structure is derived from several mechanisms available through the dynamic construction of the network. For example, the composition of elements from the inputs causes relations to become available in the blend that do not exist in either of the inputs. According to this particular construction, the blend contains two moving individuals instead of the single individual in each of the inputs. The individuals are moving in opposite directions, starting from opposite ends of the stairs,
372 Altman & Wyse
Figure 2. Input models for the Footstep blending network
Input 1 t1
p1 t2
p2
Input 2
Figure 3. Cross-model mapping between the input footstep models
Input 1
p1 l1 t1 s 1
p2 t2 l2 s2 Input 2
Figure 4. Inclusion of the generic model for the Footstep input models
Generic Space
p t p1 t1
Input 1
p2 t2
Input 2
and their positions and relative temporal patterns can be compared at any time that they are on the stairs. At this point the construction of the blending network is complete and constitutes a meta-model for the two people walking in opposite directions on the same stairs. Since this is a generative model, we can now run the scenario dynamically. In the blend there is new structure: There is no encounter in either of the input models, but the blend contains the synchronized stepping of the two individuals. The input models continue to exist in their original form, therefore information about time, location, and walking speed in the blend space can be projected back to the input models for evaluation there. This final configuration with projection of the blend model back to the input models is illustrated in Figure 5.
Blending Theory
Blending is an operation that occurs across two or more input spaces to yield a new space, the blend. The blend is formed by inheriting partial structure from the input spaces
373
Figure 5. Space mapping for the blending model in the integrated footstep network
p t p1 t1 ' p1 p'2
t'
Generic Space
Input 1
p2 t2
Input 2
Blend Space
and generating an emergent structure containing information not explicitly present in either input. A blend involves the use of information from two sources such that there are two sets of bindings that map between the input spaces and from the input spaces to the blend space. Computationally, the blending network constitutes a double scope binding between the pair of inputs and the blend space (Fauconnier, 1997). The double scope binding configuration for creating a blend is illustrated in Figure 5 and is composed of the following elements: Input Spaces: a pair of inputs, I1 and I2, to the network along with the models for processing the inputs. In the Footstep network, the inputs were the audio models for generating the footstep sounds. Cross Space Mapping: direct links between corresponding elements in the input spaces I1 and I2 or a mapping that relates the elements in one input to the corresponding elements in the other. Generic Space: defines the common structure and organization shared by the inputs and specifies the core cross-space mapping between inputs. The domain ontology for the input models is included in the generic space. Blend Space: information from the inputs I1 and I2 is partially projected onto a fourth space containing selective relations from the inputs. Additionally, the blend model inherits structure from the ontologies used in the generic model, as well as specific functions derived from the context of the current user task. The two footstep models were integrated into a single blend model with projection of parameter values. Emergent Structure: The blend contains new information not explicitly present in the inputs that becomes available as a result of processing the blending network. In the Footstep network, the synchronization of the footsteps at the meeting point in the blend model was the key emergent structure. The key contribution of Conceptual Integration Theory has been the elaboration of a mechanism for double scope binding for the explanation of metaphor and the
374 Altman & Wyse
processing of natural language discourse. The basic diagram for the double scope binding from the two input models onto the blend model was previously illustrated in Figure 5. This network is formed using the generic model to perform cross-space mapping between the two input models, then projecting selected parameters of the input models onto the blend model. Once the complete network has been composed, the parameter values are bound and information is continuously propagated while dynamically running the network. We will next discuss the computational theory for blending, then examine the double scope binding configuration applied to audio synthesis, video editing, and presentation mining.
Computational Theory
The blending framework for discovering emergent semantics in media consists of three main components: ontologies that provide a shared description of the domain; operators that apply transformations to the inputs and perform computations on the input models; and an integration mechanism that helps the user discover emergent structure in the media.
Ontologies
Ontologies are a key enabling technology for semantic media. An ontology may be defined as a formal and consensual specification of a conceptualization that provides a shared understanding of a domain. Moreover, the ontology provides an understanding that can be communicated across people and application systems. Ontologies may be of several types ranging from the conceptual specification of a domain to an encoding of computer programs and their relationships. In addition to providing a structured representation, ontologies offer the promise of a shared and common understanding of a domain that can be communicated between people and application systems. Thus, the use of ontologies brings together two essential elements for discovering semantics in media: Ontologies define a formal semantics for information that can be processed by a computer. Ontologies define real-world semantics that facilitates the linking of machine processable content with user-centric meaning. Ontologies are used in the media blending framework to model relationships among media elements, as well as provide a domain model for user-centric operators. This provides a level of abstraction that is critical for model-based approaches to media synthesis. As we have seen in the sound synthesis models and automated video editing examples; the input media may come from disparate sources, be described by differing metadata schema, and pose unique processing constraints. Moreover, the intended usage of the media is likely to be different from the initial composition and annotation. Thus the ontologies provide a critical link between the end user and the computer which is necessary for emergent semantics.
Operators
The linkages among the two input spaces and the media blend in Figure 5 are supported by a set of core operators. Two of these operators are called projection and
375
compression. Projection is the process in which information in one space is mapped onto corresponding information in another space. In the video editing domain, projection occurs through the mapping of temporal structures from music onto the duration and sequencing of video clips. The hierarchical structure of music and the editing instructions for the video can both be modeled as a graph. Since each model is represented by a graph structure, projection amounts to a form of graph mapping. In general, the mapping between models is not direct, so the ontology from the generic space is used to construct the transformation that maps information between input spaces. Compression is the process in which detail from the input spaces is removed in the blend space in order to provide a condensed description that can be easily manipulated. Compression is achieved in an audio morph through the low dimensional control parameters for the transformation between to two input sounds. The system performs these operations in the blended space and projects the results back to any of the available input spaces.
Integration Mechanism
The consequence of defining the operators for projection and compression is that a new integration process called, running the blend becomes possible within this framework (Fauconnier, 1997). Running the blend is the process of reversing the direction of causality, thereby using the blend to drive the production of inferences in either of the input spaces. In the blend space of the video editing example, the duration of an edited video clip is related to the loudness of the music and the start and stop times are determined by salient beats in the music. The process of running the blend causes these constraints to propagate back to the music input model to determine the loudness value and the timing of salient beats that satisfy the editing logic of the music video blend. Thus the process of running the blend means that operations applied in the blend model are projected back to the inputs to derive emergent semantics, such as music driven video editing. In the case of mining presentations for information, preprocessing by the system analyzes the textbook to extract terms and relations, which are then added as concept instances of a domain ontology within the textbook model. The user seeks to query the courseware for information that combines the temporal sequencing of the video lecture models with the structured organization of the textbook model. This integrated view of the media is constructed by invoking a blending model, such as find path or find similar to translate the user query into primitives that are suitable for the input models. Once the lecture presentation network has been constructed, the user can run the blend to query the input models for a path through the lecture video that links any two given topics from the textbook. The integration mechanism of this blend provides a parameterized model of a path that can be used to navigate through the media, mine for relationships, or compose answers to the original query. The emergent semantics of this media blending model is a path through the video content that exhibits relationships derived from the textbook. Due to the double scope binding of this network, the blending model can also be used to project information from the video onto the textbook, thereby imposing the temporal sequencing of the video presentation onto the hierarchical organization of the textbook. Additional blending networks, such as the find similar blend, can be used to integrate information from the two input sources to discover similarities.
376 Altman & Wyse
Thus, the integration mechanism for multimedia presentations produces a path in the blend model. The system can then run the blend on the path using all of the emergent properties of the path, such as contracting, expanding, branching, and measuring distance. The specialized domain operators can now be applied and their consequences projected back onto the input spaces. The blend model for presentations encodes strategies that the user may use to locate information, discover the relationships between two topics, or recover from a comprehension failure while viewing the video presentation.
EMERGENT MEDIA SEMANTICS

The ability to derive emergent semantics in the media blending framework depends upon the specification of an ontology, the definition of operators, and processing the integration network to run the blend. This section draws upon the domains of audio synthesis, video editing, and media mining to illustrate the key features of this framework. The combination of a reference ontology with synthesis models facilitates the manipulation of media semantics. The example of an audio morph between two input audio models illustrates the use of model semantics to provide the cross-mapping between models and the use of the domain ontology to discover emergent structure. The blending framework for video editing employs a domain specific ontology and defines a set of operators that are applied to the configuration of models illustrated in Figure 5. The key operators, as defined in Conceptual Integration Theory (Fouconnier, 1997), enable the construction of a metaphorical space that contains a mapping of selected properties from each of the input models. The emergent semantics of the edited music video is derived from the cross space mappings of the blending network. The effective use of emergent semantics is obtained through the processing of the blend model. The final part of this chapter uses media mining in the e-learning domain to provide a detailed example for the integration of a domain ontology, operators, and a query mechanism that runs the blend. This shows that the blending framework provides a systematic way to manage the media semantics that emerges from heterogeneous media.
The Audio Morph

Model based audio synthesis is based on a collection of primitive units and functional modules patched together to build sound models. Units include oscillators, noise generators, filters, event generators, and arithmetic operators. Units have certain types of signals and ranges they expect at its inputs, and produce certain types of signal at their respective outputs. Some units can be classed together (e.g. signal generators), and many bear other types of relationships to one another. The composition of units and modules into sound models is informed and constrained by their parameterizations and the relations between entities which can be specified as a domain ontology. The ontology might include information such as the fact that an oscillator takes a frequency argument, that a sine wave oscillator is-a oscillator, and that the null condition (when a transform unit has no effect, or the output of a generator is constant) for an oscillator occurs when its frequency is set to 0. Experts have implicit knowledge of these ontologies that they use to build and manipulate model structures. There have been attempts to make these ontologies explicit so that non-experts would have support in achieving their semantically specified modeling objectives (Slaney, Covell, & Lassiter, 1996).
377
Synthesis algorithms are typically represented as a signal flow diagram, as in the MAX/MSP software package. An example of a simple patch for sinusoidal amplitude modulation of a stored audio signal is shown in Figure 6. The representation of datatype, functional, and relational information within a domain ontology of sound synthesis unites provides a new rich source of information that can be queried and mined for hints on how to transform or to build new models. The model components and structure define and constrain the behavior of the sound generating objects and relate the dynamics of parametric control to changes in the sound. They do not themselves determine the semantics, but are more closely allied to semantic descriptions than a flat audio stream. A model component ontology would thus be useful for making associations between model structures and the semantics that models are given by designers or by the contexts in which their sounds are used. Relationships, such as distances and morphologies between models, could be discovered and exploited based not only on the sounds they produce, but also on the structure of the models that produce the sounds. Applications of these capabilities include model database management and tools for semantically-driven model building and manipulation. The availability of a domain ontology along with model building and manipulation tools provides the key resources for discovering emergent semantics among sound models. An example of emergent semantics comes from the media morph. Most people are familiar with the concept in the realm of graphics, perhaps less so with sound. A morph is the process of smoothly changing one object into another over a period of time. There are several issues that make this deceptively simple concept challenging to implement in sound, not the least of which is that the individual source and target sound objects may themselves be evolving in time. Also, two objects are not enough to determine a morph they must both have representations in a common space so that a path of intermediate points can be defined. Finally, given two objects in a consistent
Figure 6. The central multiplication operation puts a sinusoidal amplitude modulation (coming from the right branch) on to the stored sample signal on the left branch in this simple MAX/MSP patch.
378 Altman & Wyse
representational space, there are typically an infinite number of paths that can be traversed to reach one from the other, some of which will be effective in a given usage context, others possibly not. The work on this issue has tended to focus on various ways of interpolating between spectral shapes of recorded sounds (Slaney et al., 1996). This approach works well when the source and target sounds are static so that the sounds can be transformed into a spectral representation. Similar to the case with graphical morphs, corresponding points can be identified on the two objects in this space. A combination of space warping and interpolation are used to move between the target and source. A much deeper and informative representation of a sound is provided in terms of a sound model. There are several different ways that models can be used to define a morph. The more the model structures can be exploited, the richer are the possible emergent semantics. If two different sounds can be generated by the same model, then a morph can be trivially defined by selecting a path from the parameter setting that generates one to a setting that generates the other. In this case, the blend space is the same as the model for the two sounds, so although the morph may be more interesting than the spectral variety discussed above, no new semantics can be said to emerge. If we are given two sounds, each with a separate model capable of generating the sounds, then the challenge is to find a common representational space in which to create a path that connects the two sound objects. One possible solution would be to define a set of feature detectors (e.g., spectral measurements, pitch, measures of noisiness) that would provide a kind of description of any sound. This solves the problem of finding a common space in which both source and target can be represented. Next, a region of the feature space that the two model sound classes have in common needs to be identified, and paths from the source and target need to be delineated such that they intersect in that region. If the model ranges do not intersect in the feature space, then a series of models with ranges that form a connected subspace needs to be created to support such a path so that a morph can be built using a series of models as illustrated in Figure 7. This process requires knowledge about the sound generation capabilities of each model at a given point in feature space. We mentioned earlier that a model is defined not only by the sounds within its range, but in the paths it can take through the range as determined by the control parameterizations. The dynamic behavior defined by the possible paths play a key role in any semantics the model might be given. The connected feature space region defines a path between the source and target sounds in a particular way that will create and constrain a semantic interpretation. However, in this case, the new model is less than satisfying because as a combination of other models, only one of which is active at a time, it cant actually generate sounds that were not possible with the extant models. Moreover, if the kludging together of models is actually perceived as such, then new semantics fail to arise. Another way to solve the problem would be to embed the two different models in to a blended structure where each original model can be viewed as a special case given by specific parameter settings of the meta-model. This could be done trivially by building a meta-model that merely mixes the audio output from each model separately, with a parameter that controls the relative contribution from each submodel. Again we have a
379
Figure 7. A morph in feature space performed using one model that can generate the source sound, another model that can generate the target sound, and a path passing through a point in feature space that both models are capable of generating. If the source-generating model and the target-generating model do not overlap in feature space, intermediate models can be used so that a connected path through feature space is covered.
trivial morph that would not be very satisfying, and because sound mixes from independent sources are generally perceived as mixes rather than as a unified sound by a single source, the semantics of the individual component models would presumably be clearly perceptible. There are, however, much richer ways of embedding two models into a blended structure such that each submodel is a sufficient description of the meta-model under specific parameter settings. The blended structure wraps the two submodels that generate the morphing source and target sounds and exposes a single reduced set of parameters. There must exist at least one setting for the meta-model parameters such that the original morphing target sound is produced, and one such that the original morphing source sound is produced in order to create the transformation from source sound to target sound. The meta-model parameterization defines the common space in which both the original sounds exist and in which any number of paths may be constructed connecting the two. We discussed this situation earlier, except in this case, the metamodel is genuinely new, and has its own set of capabilities and constraints defined by the relationship between the structure of the two original models, but present in neither. New semantics emerge from the domain ontology, mappings between models, and the integration network created in the blend. As a concrete example of an audio morph with emergent semantics, consider two different sounds: one the result of waveshaping on a sinusoid, the other the result of amplitude modulation of a sampled noise source as illustrated in Figure 8. Each structure creates a distinctive kind of distortion of the input signal. One way of combining these two models into a meta-model is shown in Figure 9. To combine these two models, we use knowledge about the constituent components of the models, which could be exploited automatically if they were represented as a formal ontology as discussed above. In
380 Altman & Wyse
Figure 8. Two kinds of signal distortion. a) This patch puts the recorded sample through a non-linear transfer function (tanh). The amount of effect determines how nonlinear the shaping is, with zero causing the original sample to be heard unchanged. b) A sinusoidal amplitude modulation of the recorded sample.
(a)
(b)
particular, knowing the input and output types and ranges for signal modifying units, and knowing specific parameter values for which the modifiers have no effect on the signal (the null condition), we can structure the model for morphing. Knowledge of null conditions, in particular, was used so that the effect of one submodel on the other would be nullified at the extreme values of the morphing parameter. Using knowledge of the modifier units signal range expectations and transformations permits the models to be integrated at a much deeper structural level than treating the models as black boxes would permit. Most importantly, blending the individual model structures creates a genuinely new model capable of a wide range of sounds that neither submodel was capable of generating alone, yet including the specific sounds from each submodel that were the source and the target sounds for the morph. A new range of sounds implies new semantic possibilities. New semantics can be said to arise in another aspect as well. In the particular blend illustrated above, most of the parameters exposed by the original submodels are still available for independent control. At the extreme values for the morphing parameter, the original controls have the same effect that they had in their original context. However, at the in between values for the morphing parameter, the controls from the submodel have an effect on the sound output that is entirely new and dependent upon the particular submodel blend that is constructed. This emergent property is not present in the trivial morph described earlier which merely mixed the audio output of the two submodels individually. Since a morph between two objects is not completely determined by the endpoints, but by the entire path through the blend space, there is a creative role for a human-in-the-loop to complete the specification of the morph according the usage context.
381
Figure 9. Both the waveshaping (WS) and the amplitude modulation (AM) models embedded in a single meta-model. When the WS vs. AM morphing parameter is at one extreme or the other we get only the effect of either the WS or the AM model individually. When the morph parameter is at an in between state, we get a variety of new combinations of waveshaping of the AM signal and/or amplitude modulation of the waveshaped signal (depending on the other parameter settings).
We have shown, that given knowledge about how elementary units function within models in the form of an ontology, structures for different models can be combined in such a way that gives rise to new sound ranges and new handles for control. Semantics emerge that are related to those of the model constituents, but in rich and complex ways. There are, in general, many ways that sound models may be combined to form new structures. Some combinations may work better in certain contexts than others. How desired semantics can be used to guide the construction process is a topic that warrants further study.
The Video Edit

The proliferation of digital video cameras has enabled the casual user to easily create recordings of events. However, these raw recordings are of limited value unless they are edited. Unfortunately, manual editing of home videos requires considerable time and skill to create a compelling result. During many decades of experimentation, the film industry has developed a grammar for the composition of space, time, image, music, and
382 Altman & Wyse
sound that are routinely used to edit film (Sharff, 1982). The mechanisms that underlie cinematic editing can be described as a blending network that leverages the cognitive perceptions of audio and imagery to create a compelling story. The Video Edit example borrows from such cinematic editing techniques to construct a blending network for the semi-automatic editing of home video. In this network, the generic model is an encoding of the cinematic editing rules relating music and video, and the input models represent structural features of the music and visual events in the video. Encoding of aesthetic decisions for editing video to music is key for creating the blending model. Traditional film production techniques start by creating the video track, then add sound effects and music to enhance the affective qualities of the video. This is a highly labor intensive process that does not lend itself well to automation. In the case of the casual user with a raw home video, the preferred editing commands emphasize functional operations, such as selecting the overall style of cinematic editing, choosing to emphasize people in the video, selecting the music, and deciding how much native audio to include in the final production. The generic model for the music and video inputs is a collection of editing units that describe simple relations between fragments of audio and video. Each unit captures partial information associated with a cinematic editing rule, thus the units in the generic model can be composed in a graph structure to form more complex editing logic. One example of an insertion unit specifies that the length of a video clip to be inserted should be inversely proportional to the loudness of the music. During the construction of the blending network the variables for video length and music loudness are bound and specific values are propagated during the subsequent running of the blend to dynamically produce the final edited video. Another example of a transition unit specifies how two video clips are to be spliced together. When this unit is added to the graph structure, it specifies the type of transition between video clips, the duration of the transition, and the inclusion of audio or graphical special effects. Yet another insertion unit may relate the timing and visual characteristics of people in the video to various structural features in the music. The generic model may therefore be viewed as an ontology of simple editing units that can be composed into a graph structure by the blending model for subsequent editing of the music and video inputs. The video input model contains the raw video footage plus the shots detected in the video, where each shot is a sequence of contiguous video frames containing similar images. The raw video frames are then analyzed in terms of features for color, texture, motion, as well as simple models for the existence of people in the shot or other salient events. This analysis provides the metadata for a model representing the input video for use in the subsequent video editing. For example, the video model includes techniques for finding parts of the shots which contain human faces. The information about faces can be combined with the editing logic to create a final production which emphasizes the people in the video. In this way, the system can automatically construct a people-oriented model of the input video. The model for the input music needs to support the editing logic of the generic model for cinematic editing and the high level commands from the user interface. The basic music model is composed of a number of parameters, including the tempo, rhythm, and loudness envelope for the music. The combined inputs of the video model and the music model in Figure 10 are integrated with the cinematic styles in the blending model to
383
Figure 10. The blending network for automatic editing of video according to the affective structures in the music and operators for different cinematic styles.
Styles
Music
Raw Video
Music Analysis
Video Analysis
Music Data
Music Description
Video Description
Video Data
Composition Logic
Media Production
produce a series of editing decisions when to make cuts, which kinds of transitions to use, what effects to add, and when to add them. The composition logic in the blend model integrates information from three places: a video description produced by the video analysis, a music description produced by the music analysis, and information about the desired editing style as selected by the user. The composition logic uses the blending model to combine these three inputs in order to make the best possible production from the given material one which is as stylish and artistically pleasing as possible. It does this by representing the blended media construction as a graph structure and opportunistically selecting content to complete the media graph. This process results in the emergent semantics of a music video which inherits partial semantics from the music and from the video.
The Presentation
The Presentation example illustrates the use of emergent structure to facilitate information retrieval from online e-learning courseware. A simple form of emergent structure is a path that combines concept relationships from a textbook with temporal sequencing from the video presentation. The path structure can then be manipulated to gain insight into the informational content of the courseware by performing all of the standard operations afforded by paths, such as traversal, compression, expansion, branching, and the measurement of distance.
384 Altman & Wyse
Find Path Scenario

As distance education expands by placing ever larger quantities of course content online, students and instructors have increasing difficulty navigating the courseware and assimilating complex relationships from the multimedia resources. The separation in space and time between students and teachers also makes it difficult to effectively formulate questions that would resolve media based comprehension failures. The find path scenario addresses this problem by combining media blending with information retrieval to assist the student in formulating complex queries about lecture video presentations. Suppose that a student is half way through a computer science course on the Introduction to Algorithms, which teaches the computational theory for popular search optimization techniques (Corman, Leiserson, Rivest, & Stein, 2001). The topic of Dynamic Programming was introduced early in the course, then after several intervening topics the current topic of Greedy Algorithms is presented. The student realizes that these two temporally distant topics are connected, but the relationship is not evident from the lecture presentations due to the temporal separation and the web of intervening dependencies between topics. To resolve this comprehension failure, the student would like to compose a simple query, such as Find a path linking greedy algorithms to dynamic programming and have the presentation system identify a sequence of locations in the lecture video which can be composed to provide an answer to the query. Note that the path that the student is seeking does not exist in either the textbook or the lecture video. The textbook contains a hierarchical organization of topics and instances of concept relations. The sequencing of topics from the beginning of the book to the end provides a linear ordering of topics, but not necessarily a chronological ordering. The lecture videos do not contain the semantics of a path since they have a purely chronological sequencing of content. As we shall see, partial structure from the textbook and the video must be projected onto the blend to create the emergent structure of the path.
Media Blending Network

Traditional approaches to media integration are based on the use of a generic model to provide a unified indexing scheme for the input media plus cross media mapping (Benitez & Chang, 2002). The media blending approach adds the blend model as an integral part of the integration mechanism. This has two consequences. Firstly, the blend model adds considerable richness to the media semantics in terms of the operators that can be applied, such as projection, compression, and the propagation of information back into the input spaces. Secondly, by explicitly providing for emergent structures in the computational framework, we can potentially achieve a higher level of integration of multimedia resources. Of particular interest for the Presentation network is the semantics of a path that emerges from the media blend. Once the path is obtained, then all of the common operations on paths, such as expand, compress, extend, append, branch, and the measurement of distance can be applied to the selected media elements. The configuration of models used to construct the Presentation network is illustrated in Figure 11. The textbook model provides one input to the network. This model contains instances of terms and relations between terms that can be extracted through standard text processing techniques. The instances of terms and relations are mapped
385
Figure 11. Media integration in the Presentation network for the Find Path blend.
ontology term Textbook Model text topic ontology sequence terms topic
Generic Model
Lecture Model slide transcript segment video time Emergent Structure terms path video
context
ontology term topic media
Path Blend
onto the abstract concepts of the domain ontology. The ontology is subsequently converted to a graph structure for efficient search. The textbook has explicit structure due to the hierarchical organization of chapters and topics, as well as the table of contents, and index. There is also implicit structure in the linear sequencing of topics and the convention among textbooks that simpler material comes before more complex material. The lecture model provides the second input which represents the lecture video, transcripts, and accompanying slide presentation. The metadata for the lecture model can be derived automatically through the analysis of perceptual events in the video to classify shots according to the activity of the instructor. The text of the transcripts can be analyzed to extract terms and indexed for text based queries. The slide presentation and associated video time stamps provide an additional source of key terms and images that can be cross mapped to the textbook model. The generic model for the Presentation network contains the core domain ontology of terms and relations used in the course. As we shall see later, the cross-space mapping between the textbook model and the lecture model occurs at the level of extracted terms and their locations in the respective media. The concepts in the core ontology thus provide a unified indexing scheme for the term instances that occur as a result of media processing in the two input models. The blend model for the find path scenario receives projections of temporal sequence information from the lecture video and term relations from the textbook. When the user issues a query to find a path in the lecture video that goes from topic A to topic
386 Altman & Wyse
B, the blend model first accesses the textbook model to expand the query terms associated with the topics A and B. The graph representation of the textbook model is then searched for a path linking these topics. Once a path is found among the textbook terms, the original query is expanded into a sequence of lecture video queries, one for each of the terms plus local context from the textbook. The blend model then evaluates each video query and assembles the selected video clips into the final path structure (see insert in Figure 11). At this point, the blend model has fully instantiated the path as a blend of the two inputs. Once the path has been constructed, the user can run the blend to perform various operations on the path. Note that the blending network has added mappings between the input models, but has not modified the original models. Thus, the path blend can, for instance, be used to project the temporal sequencing from the time stamps of the lecture video back onto the textbook model to construct a navigational path in the textbook with sequential dependencies from the video.
Operators
Mappings between models in the Presentation network in Figure 11 support a set of core operators for information retrieval in mixed media. Two of these operators are called projection and compression. As seen in previous examples, projection is the process in which information in one model is mapped onto corresponding information in another model. Since both input models are represented by a graph structure, where links between nodes are relations, projection between inputs amounts to a form of graph matching to identify corresponding elements. These elements are then bound so that information can pass directly between the models. A second source of binding occurs between each input model and the emergent structure that is constructed in the blend model. This double scope binding enables the efficient projection of information within the network. Compression is another core operator of the blending network that supports media management through semantics. For example, traditional methods for constructing a video summary require the application of specialized filters to identify relevant video segments. The segments are then composed to form the final summary. Instead of operating directly on the input media, compression operates on the emergent path structure and projects the results back to the input media. Thus by operating on the path blend, one can derive the shortest time path, the most densely connected path, or the path with the fewest definitions from the lecture video. The system performs these operations on the blended model and projects the results back to either of the available input models to determine the query result. The consequence of defining the operators for projection and compression is that a new process called, running the blend becomes possible within this framework. Running the blend is the process of reversing the direction of causality within the network, thereby using the blend to drive the production of inferences in either of the input spaces. In the find path example, the application of projection and compression on the path blend means that the user can manage the media using higher level semantics. Moreover, all of the standard operations on paths, such as contracting, expanding, reversing, etc. can now be performed and their consequences projected back onto the input spaces. Finally, the user can perform a series of queries in the blend and project the results back to the inputs to view the results.
387
Integration Mechanism
The network of models in the Presentation blend provides an integration mechanism for the multimedia resources. Once the network is constructed, it is possible to process user queries by running the blend. In the find path scenario, the student began with a request to find a relationship between Dynamic Programming (DP) and Greedy Algorithms (GA). The system searches the domain ontology of the textbook model to discover a set of possible paths linking DP with GA, subject to user preferences and event descriptions. The user preferences, event descriptions, and relations among the path nodes in the ontology are used to formulate a focused search for similar content in the video presentation. The resultant temporal sequence of video segments is added to the emergent path structure in the blend model. As discussed previously, when the student requested to find a path from topic DP to topic GA, a conceptual blend was formed which combined the ontology from the textbook with the temporal sequencing of topics from the lecture. The result was a chronological path through the sequence of topics linking DP to GA. This path can now be used in an intuitive way to compress time, expand the detail, select alternative routes, or combine with another path. The resultant path through a sequence of interrelated media segments in the find path blend is the emergent structure arising from the processing of the users query. Thus, one can now start from the constructed path and project information back onto the input spaces to mine for additional information that was previously inaccessible. For example, one could use the path to select a sequence of text locations in the textbook that correspond to the same chronological presentation of the topics that occurs in the lecture. Thus, the blending network effectively uses the instructors knowledge about the pedagogical sequencing of topics to provide a navigational guide through the textbook. We have designed a system for indexing mixed media content using text, audio, video, and slides and the segmentation of the content into various lecture components. The GUI for the ontology based exploration and navigation of lecture videos is shown in Figure 12. By manipulating these lecture components in the media blend, we are able to present media information to support insight generation and aid in the recovery of comprehension failures during the viewing of lecture videos. This dynamic composition of cross media blends provides a malleable representation for generating insights into the media content by allowing the user to manage the media through the high level semantics of the media blend.
FUTURE TRENDS
The development of the Internet and the World Wide Web has lead to the globalization of text based exchanges of information. The subsequent use of web services for the automatic generation of web pages from databases for both human and machine communication is being facilitated by the development of Semantic Web technologies. Similarly, we now have the capacity to capture and share multimedia content on a large scale. Clearly, the plethora of pre-existing digital media and the popularization of multimedia applications among non-professional users will drive the demand for authoring tools that provide a high level of automation.
388 Altman & Wyse
Figure 12. User interface for the Presentation network. Display contains the following frames (clockwise from top left corner): Path Finder, Video Player, Slide, Slide Index, Textbook display, and a multiple timeline display of search results presented as gradient color hotspots on a bar chart.
Preliminary attempts toward the use of models for the generation of sound effects for games and film, as well as the retrieval of video from databases has been primarily directed toward human-to-human communication. The increasing use of generative models for media synthesis and the ability to dynamically construct networks for combining these models will create new ways for people to experience media. Since the semantics of the media is not fixed, but arises from the media and the way that it is used, the discovery of emergent semantics through ontology based operations is becoming a significant trend in multimedia research. The convergence of generative models, automation, and ontologies will also facilitate the exchange of media information between machines and support the development of a media semantic web. In order to realize these goals further progress is needed in the following technologies: Tools to support the development of generative models for media synthesis. Use of semantic descriptions for the composition of models into blending networks. Automation of user tasks based on the mediation between blending networks as an extension to the ongoing research in database schema mediation. Formalization of ontologies for domain knowledge, synthesis units, and relationships among generative models. Discovery and generation of emergent semantics.
389
The trend toward increasing the automation of media production through the creation of media models relies upon the ability to manage the semantics that emerges from user-centric operations on the media.
CONCLUSIONS
In this chapter we have presented a framework for media blending that has proved useful for discovering emergent semantics. Concrete examples drawn from the domains of video editing, sound synthesis and the exploration of multimedia content for lecture based courseware have been used to illustrate the key components of the framework. Ontologies for sound synthesis components and the perceptual relations among sounds were used to describe how emergent properties arise from the morphing of two audio models into a new model. From the domain of automatic home video editing, we have described how the basic operators of projection and compression lead to the emergence of a stylistically edited music video with combined semantics of the source music and video. In the video presentation example, we have shown how multiple media specific ontologies can be used to transform high level user queries into detailed searches in the target media. In each of the above cases, the discovery and/or generation of emergent semantics involved the integration of descriptions from four distinct spaces. The two input spaces contain the models and metadata descriptions of the source media that are to be combined. The generic space contains the domain specific information and mappings that relate elements in the two input spaces. Finally, the blend space is where the real work occurs for combining information from the other spaces to generate a new production according to audio synthesis designs, cinematic editing rules, or navigational paths in presentations as discussed in this chapter.
REFERENCES
Benitez, A. B., & Chang, S. F. (2002). Multimedia knowledge integration, summarization and evaluation. Proceedings of the 2002 International Workshop on Multimedia Data Mining, Edmonton, Alberta, Canada (pp. 39-50). Corman, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to algorithms. Cambridge, MA: MIT Press. Davis, M. (1995). Media Streams: An iconic visual language for video Representation. In Baecher, R. M., Grudin, J., Buxton, W. A. S., & Greenberg, S. (Eds.), Readings in human-computer interaction: Toward the year 2000 (2nd ed.) (pp. 854-866). San Francisco: Morgan Kaufmann Publishers. Dorai, C., Kermani, P., & Stewart, A. (2001, October). E-learning media navigator. Proceedings of the 9th ACM International Conference on Multimedia, Ottawa, Canada (pp. 634-635). Fauconnier, G. (1997). Mappings in thought and language. Cambridge, UK: Cambridge University Press.
390 Altman & Wyse
Funkhouser, T., Kazhdan, M., Shilane, P., Min, P., Kiefer, W., Tal, A., et al. (2004, August). Modeling by example. ACM Transactions on Graphics (SIGGRAPH 2004). Kellock, P., & Altman, E. J. (2000). System and method for media production. Patent WO 02/052565. Kovar, L., & Gleicher, M. (2003). Flexible automatic motion blending with registration curves. Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, San Diego, California (pp. 214-224). Mott, R. L. (1990). Sound Effects: Radio, TV and film. Focal Press. Nack, F., & Hardman, L. (2002). Towards a syntax for multimedia semantics. CWI Technical Report, INS-R0204, April. Rolland, P.-Y., & Pachet, F. (1995). Modeling and applying the knowledge of synthesizer patch programmers. In G. Widmer (Ed.), Proceedings of the IJCAI-95 International Workshop on Artificial Intelligence and Music, 14th International Joint Conference on Artificial Intelligence, Montreal, Canada. Retrieved June 1, 2004, from http://citeseer.ist.psu.edu/article/rolland95modeling.html Russolo, L. (1916). The art of noises. Barclay Brown (translation). New York: Pendragon Press. Santini, S., Gupta, A., & Jain, R. (2001). Emergent semantics through interaction in image databases. IEEE Transaction of Knowledge and Data Engineering, 337-351. Sharff, S. (1982). The elements of cinema: Toward a theory of cinesthetic impact. New York: Columbia University Press. Slaney, M., Covell, M., & Lassiter, B. (1996). Automatic audio morphing. Proceedings of IEEE International Conference Acoustics, Speech and Signal Processing, Atlanta, 1-4. Retrieved June 1, 2004, from http://citeseer.nj.nec.com/ slaney95automatic.html Smalley, D. (1997). Spectromorphology: Explaining sound shapes. Organized Sound, 2(2), 107-126. Staab, S., Maedche, A., Nack, F., Santini, S., & Steels, L. (2002). Emergent semantics. IEEE Intelligent Systems: Trends & Controversies, 17(1), 78-86. Thompson Learning (n.d.). Retrieved June 1, 2004, from http://www.thompson.com/ Veale, T., & ODonoghue, T. (2000). Computation and blending. Cognitive Linguistics, 11, 253-281. WebCT (n.d.). Retrieved June 1, 2004, from http://www.webct.com
Glossary 391
Glossary
B Frame: One of three picture types used in MPEG video. B pictures are bidirectionally predicted, based on both previous and following pictures. B pictures usually use the least number of bits. B pictures do not propagate coding errors since they are not used as a reference by other pictures. Bandwidth: There are physical constraints on the amount of data that can be transferred through a specific medium. The constraint is measured in terms of the amount of data that can be transferred over a measure of time, and is known as the bandwidth of the particular medium. Bandwidth is measured in bps (bits per second). Bit-rate: The rate at which a presentation is streamed, usually expressed in Kilobits per second (Kbps). bps: Bits-Per-Second Compression/Decompression: A method of encoding/decoding signals to reduce the data rate needed allows transmission (or storage) of more information than the media would otherwise be able to support. Extensible Markup Language: see XML Encoding/Decoding: Encoding is the process of changing data from one form into another according to a set of rules specified by a codec. The data is usually a file containing audio, video or still image. Often the encoding is done to make a file compatible with specific hardware (such as a DVD Player) or to compress or reduce the space the data occupies.
392 Glossary
Feature: A Feature is a distinctive characteristic of the data which signifies something to somebody. A Feature for a given data set has a descriptor (i.e., feature representation) and an instantiation (descriptor value); for example, colour histograms. Frame Rate: The number of images captured or displayed per second. A lower frame rate produces a less fluid motion and saves disk space. A higher setting results in a fluid image and a larger movie file. GIF (Graphic Interchange Format): A common format for image files, especially suitable for images containing large areas of the same color. GUI (Graphical User Interface): The term given to that set of items and facilities which provide the user with a graphic means for manipulating screen data rather than being limited to character-based commands. Graphic User Interface tool kits are provided by many different vendors and contain a variety of components including (but not limited to) tools for creating and manipulating Windows, Menu Bars, Status Bars, Dialogue Boxes, Pop-Up Windows, Scroll Or Slide Bars, Icons, Radio Buttons, Online and Context-Dependent Help Facilities. Graphic User Interface tool kits may also provide facilities for using a mouse to locate and manipulate on-screen data and activate program components. Hyperlink: A synonym for link A hyperlink links a document to another document, or a document fragment to another document fragment. A link on a document can be activated and the user is then taken to that linked document or fragment. In Web pages links were originally underlined to indicate that there is a link. I Frame: An I frame is encoded as a single image, with no reference to any past or future frames. Often video editing programs can only cut MPEG-1 or MPEG-2 encoded video on an I frame since B frames and P frames depend on other frames for encoding information. Index: An index is a database feature used for locating data quickly within a table. Indexes are defined by selecting a set of commonly searched attribute(s) on a table and using the appropriate platform-specific mechanism to create an index. Interactive Multimedia: Term used interchangeably with multimedia whose input and output are interleaved, like a conversation, allowing the users input to depend on earlier output from the same run. Internet Protocol: IP provides a datagram service between hosts on a TCP/IP network IP thus runs on top of TCP. It routes the packets of data that are transmitted to the correct host. IP also takes apart datagrams and puts them back together again. ISO (International Standard Organization): The ISO is an international body that certifies standards ranging from visual signs (such as the i-sign for information,
Glossary 393
seen at places such as airports and information kiosks) to language characters (such as proposed by ANSI and ASCII). JPEG (Joint Photographic Experts Group): JPEG is most commonly mentioned as a format for image files. JPEG format is preferred to the GIF format for photographic images as opposed to line art or simple logo art. Key Frame: In movie compression, the key frame is the baseline frame against which other frames are compared for differences. The key frames are saved in their entirety, while the frames in between are compressed based on their differences from the key frame. Lossless: A video/image compression method that retains all of the information present in the original data. Metamodel: A metamodel is a formal model used to create a lower-level model. A metamodel consists of the rules and conventions to be used when the lower-level model is created. MPEG (Motion Picture Expert Group): Set of ISO an IEC standards: MPEG1: Audio - MPEG1 Layers 2(MP2) or 3 (MP3), an Audio codec system used on the Internet offering good compression rations; widely used Video compression format. MPEG2: Audio AAC (Advanced Audio Codec) the core codec of the future; Enhanced Video compression format (also used with DVD and HDTV applications); A/V transport and sync. MPEG 4: Makes the whole stuff suitable for the Internet (applications layers set). Lets wait and see. (Ex.: BIFFs offer alternatives and extensions to SMIL and VRML, even more advanced Audio coding techniques (very low datarate, scaleable datarates, speech audio combinations)); Multiple delivery media - Mixed content MPEG 7: (Meta)Search and identification techniques for media streams / multimedia content description interface. MPEG 21: The purpose of MPEG-21 is to build an infrastructure for the delivery and consumption of multimedia content. MPEG-7 Descriptor: A Descriptor is a representation of a Feature. It defines the syntax and semantics of Feature representation. A single Feature may take on several descriptors for different requirements. Multimedia: The use of computers to present text, graphics, video, animation, and sound in an integrated way. Noninterlaced: The video signal created when frames or images are rendered from a graphics program. Each frame contains a single field of lines being drawn one after another. See also interlaced.
394 Glossary
P Frame: A P-frame is a video frame encoded relative to the past reference frame. A reference frame is a P- or I-frame. The past reference frame is the closest preceding reference frame. Pixel: A picture element; images are made of many tiny pixels. For example, a 13-inch computer screen is made of 307,200 pixels (640 columns by 480 rows). Protocol: A protocol is a standard of technical conventions allowing communication between different electronic devices. It consists of a set of rules for the communication between devices Query: Queries are the primary mechanism for retrieving information from a database and consist of questions presented to the database in a predefined format. Many database management systems use the Structured Query Language (SQL) standard query format. QuickTime: A desktop video standard developed by Apple Computer. QuickTime Animation was created for lossless compression of animated movies and QuickTime Video was created for lossy compression of desktop video. Scene: A meaningful segment of the video. Schema: A Schema is a mechanism similar to defining data types. SQL (Structured Query Language): A specialized programming language for sending queries to databases. Streaming: Multimedia files are typically large. A user does not want to wait until the entire file is received through an Internet connection. Streaming makes possible a portion of a files content can be viewed before the entire file is received. The data of the file is continuously sent from the server. While loading, the user can begin viewing the streamed data. TCP (Transmission Control Protocol): TCP protocol ensures the safe transmission of data between two hosts. Information is transmitted in packets. TCP/IP: The TCP and IP protocols in combination is the basic protocol of the Internet. URL (Uniform Resource Locator): The standard way to give the address of any resource on the Internet that is part of the World Wide Web (WWW). Web (WWW) (World Wide Web): The universe of hypertext servers (HTTP servers) which are the servers that allow text, graphics, sound files, and so forth, to be mixed together.
Glossary 395
XML (Extensible Markup Language): XML is simplified (application profile) SGML(Standard Generalized Markup Language [ISO8879]), the international standard for markup languages. XML allows one to create their own markup tags, and is designed for use on the World Wide Web.
396 About the Authors
About the Authors

Uma Srinivasan worked as principal research scientist at CSIRO, leading a team of scientists and engineers in two specialist areas: Multimedia Delivery Technologies and Health Data Integration. She holds a PhD from the University of New South Wales, Australia and has over 20 years experience in Information Technology research, consultancy and management. Her work in the area of video semantics and event detection has been published in several international conferences, journals and book chapters. She has been an active member of the MPE-7 conceptual modelling group. She has been particularly successful in working with multi-disciplinary teams in developing new ideas for information architectures and models, many of which have been translated into elegant information solutions. She serves as a visiting faculty at the University of Western Sydney, Australia. Dr. Srinivasan is now the founder and director of PHI Systems, a company specialising in delivering Pervasive Health Information technologies. Surya Nepal is a senior research scientist working on consistency management in longrunning Web service transactions project at CSIRO ICT Centre. His main research interest is in the development and implementation of algorithms and frameworks for multimedia data management and delivery. He obtained his BE from Regional Engineering College, Surat, India; ME from the Asian Institute of Technology, Bangkok, Thailand; and PhD from RMIT University, Australia. At CSIRO, Surya undertook research into content-management problems faced by large collections of images and videos. His main interests are on high-level event detection and processing. He has worked in the areas of content-based image retrieval, query processing in multimedia databases, event detection in sports video, and spatio-temporal modeling and querying video databases. He has several papers in these areas to his credit. * * * Brett Adams received a BE in information technology from the University of Western Australia, Perth, Australia (1995). He then worked for three years developing software, particularly for the mining industry. In 1999 he began a PhD at the Curtin University of
About the Authors 397
Technology, Perth, on the extraction of high-level elements from feature film, which he completed in 2003. His research interests include computational media aesthetics, with application to mining multimedia data for meaning and computationallyassisted, domain specific, multimedia authoring. Edward Altman has spent the last decade developing a theory of media blending as a senior scientist in the Media Semantics Department at the Institute for Infocomm Research (I2R) and before that as a visiting researcher at ATR Media Integration & Communications Research Laboratories. His current research involves the development of Web services, ontology management, and media analysis to create interactive environments for distance learning. He was instrumental in the development of semiautomated video editing technologies for muvee Technologies. He received a PhD from the University of Illinois in 1991 and performed post-doctoral research in a joint computer vision and cognitive science program at the Beckman Institute for Advanced Science and Technology. Susanne Boll is assistant professor for multimedia and Internet-technologies, Department of Computing Science, University of Oldenburg, Germany. In 2001, Boll received her doctorate with distinction at the Technical University of Vienna, Austria. Her studies were concerned with the flexible multimedia document model ZYX, designed and realized in the context of a multimedia database system. She received her diploma degree with distinction in computer science at the Technical University of Darmstadt, Germany (1996). Her research interests lie in the area of personalization of multimedia content, mobile multimedia systems, multimedia information systems, and multimedia document models. The research projects that Boll is working on include a framework for personalized multimedia content generation and development of personalized (mobile) multimedia presentation services. She has been publishing her research results at many international workshops, conferences and journals. Boll is an active member of SIGMM of the ACM and German Informatics Society (GI). Cheong Loong Fah received a B Eng from the National University of Singapore and a PhD from the University of Maryland, College Park, Center for Automation Research (1990 and 1996, respectively). In 1996, he joined the Department of Electrical and Computer Engineering, National University of Singapore, where he is currently an assistant professor. His research interests are related to the basic processes in the perception of three-dimensional motion, shape, and their relationship, as well as the application of these theoretical findings to specific problems in navigation and in multimedia systems, for instance, in the problems of video indexing in large databases. Isabel F. Cruz is an associate professor of computer science at the University of Illinois at Chicago (UIC). She holds a PhD in computer science from the University of Toronto. In 1996, she received a National Science Foundation CAREER award. She is a member of the National Research Councils Mapping Science Committee (2004-2006). She has been invited to give more than 50 talks worldwide, has served on more than 70 program committees, and has more than 60 refereed publications in databases, Semantic Web, visual languages, graph drawing, user interfaces, multimedia, geographic information systems, and information retrieval.
Ajay Divakaran received a BE (with Honors) in electronics and communication engineering from the University of Jodhpur, Jodhpur, India (1985), and an MS and PhD from Rensselaer Polytechnic Institute, Troy, New York (1988 and 1993, respectively). He was an assistant professor with the Department of Electronics and Communications Engineering, University of Jodhpur, India (1985-1986). He was a research associate at the Department of Electrical Communication Engineering, Indian Institute of Science, in Bangalore, India (1994-1995). He was a scientist with Iterated Systems Inc., Atlanta, Georgia (1995-1998). He joined Mitsubishi Electric Research Laboratories (MERL) in 1998 and is now a senior principal member of the technical staff. He has been an active contributor to the MPEG-7 video standard. His current research interests include video analysis, summarization, indexing and compression, and related applications. He has published several journal and conference papers, as well as four invited book chapters on video indexing and summarization. He currently serves on program committees of key conferences in the area of multimedia content analysis. Thomas S. Huang received a BS in electrical engineering from the National Taiwan University, Taipei, Taiwan, ROC, and an MS and ScD in electrical engineering from the Massachusetts Institute of Technology (MIT), Cambridge. He was with the faculty of the Department of Electrical Engineering at MIT (1963-1973) and with the faculty of the School of Electrical Engineering and Signal Processing at Purdue University, West Lafayette, Indiana (1973-1980). In 1980, he joined the University of Illinois at UrbanaChampaign, where he is now William L. Everitt distinguished professor of electrical and computer engineering, research professor at the Coordinated Science Laboratory, and head of the Image Formation and Processing Group at the Beckman Institute for Advanced Science and Technology. He is also co-chair of the institutes major research theme (human computer intelligent interaction). During his sabbatical leaves, he has been with the MIT Lincoln Laboratory, Lexington, MA; IBM T.J. Watson Research Center, Yorktown Heights, NY; and Rheinishes Landes Museum, Bonn, West Germany. He held visiting professor positions at the Swiss Federal Institutes of Technology, Zurich and Lausanne, Switzerland; University of Hannover, West Germany; INRSTelecommunications, University of Quebec, Montreal, QC, Canada; and University of Tokyo, Japan. He has served as a consultant to numerous industrial forums and government agencies both in the United States and abroad. His professional interests lie in the broad area of information technology, especially the transmission and processing of multidimensional signals. He has published 14 books and more than 500 papers in network theory, digital filtering, image processing, and computer vision. He is a founding editor of the International Journal Computer Vision, Graphics, and Image Processing, and editor of the Springer Series in Information Sciences (Springer Verlag). Dr. Huang is a member of the National Academy of Engineering; a foreign member of the Chinese Academies of Engineering and Sciences; and a fellow of the International Association of Pattern Recognition and of the Optical Society of America. He has received a Guggenheim Fellowship, an AV Humboldt Foundation Senior US Scientist Award, and a Fellowship from the Japan Association for the Promotion of Science. He received the IEEE Signal Processing Societys Technical Achievement Award in 1987 and the Society Award in 1991. He was awarded the IEEE Third Millennium Medal in 2000. In addition, in 2000 he received the Honda Lifetime Achievement Award for contributions to motion analysis. In 2001, he received the IEEE Jack S. Kilby Medal. In 2002, he
received the King-Sun Fu Prize from the International Association of Pattern Recognition and the Pan Wen-Yuan Outstanding Research Award. Jesse S. Jin graduated with a PhD from the University of Otago, New Zealand. He worked as a lecturer in Otago, a lecturer, senior lecturer and associate professor at the University of New South Wales, an associate professor at the University of Sydney. He is now the chair professor of IT at The University of Newcastle. Professor Jins areas of interest include multimedia technology, medical imaging, computer vision and the Internet. He has published more than 160 articles and 14 books and edited books. He also has one patent and is in the process of filing three more. He has received several million dollars in research funding from government agents (ARC, DIST, etc.), universities (UNSW, USyd, Newcastle, etc.), industries (Motorola, NewMedia, Cochlear, Silicon Graphics, Proteome Systems, etc.), and overseas organisations (NZ Wool Board, UGC HK, CAS, etc.). He established a spin-off company that won the 1999 ATP Vice-Chancellor New Business Creation Award. He is a consultant to companies such as Motorola, Computer Associates, ScanWorld, Proteome Systems, HyperSoft. Ashraf A. Kassim (M81) received his BEng (First Class Honors) and MEng degrees in electrical engineering from the National University of Singapore (NUS) (1985 and 1987, respectively). From 1986 to 1988, he worked on the design and development of machine vision systems at Texas Instruments. He went on to obtain his PhD in electrical and computer engineering from Carnegie Mellon University, Pittsburgh (1993). Since 1993, he has been with the Electrical & Computer Engineering Department at NUS, where he is currently an associate professor and deputy head of the department. Dr Kassims research interests include computer vision, video/image processing and compression Wolfgang Klas is professor at the Department of Computer Science and Business Informatics at the University of Vienna, Austria, heading the multimedia information systems group. Until 2000, he was professor with the Computer Science Department at the University of Ulm, Germany. Until 1996, he was head of the Distributed Multimedia Systems Research Division (DIMSYS) at GMD-IPSI, Darmstadt, Germany. From 1991 to 1992, Dr. Klas was a visiting fellow at the International Computer Science Institute (ICSI), University of California at Berkeley, USA. His research interests are in multimedia information systems and Internet-based applications. He currently serves on the editorial board of the Very Large Data Bases Journal and has been a member and chair of program committees of many conferences. Shonali Krishnaswamy is a research fellow at the School of Computer Science and Software Engineering at Monash University, Melbourne, Australia. Her research interests include service-oriented computing, distributed and ubiquitous data mining, software agents and rough sets. She received her masters and PhD in computer science from Monash University. She is a member of IEEE and ACM. Neal Leshs research efforts currently focus on human-computer collaborative interface agents and interactive data exploration. He has recently published papers on a range of topics including computational biology, data mining, information visualization, humanrobot interaction, planning, combinatorial optimization, storysharing systems, and
intelligent tutoring. Before joining MERL, Neal completed a PhD at the University of Washington and worked briefly as a post-doctoral student at the University of Rochester. Joo-Hwee Lim received his BSc (Hons I) and MSc (by research) in computer science from the National University of Singapore (1989 and 1991, respectively). He has joined Institute for Infocomm Research, Singapore, in October 1990. He has conducted research in connectionist expert systems, neural-fuzzy systems, handwriting recognition, multiagent systems, and content-based retrieval. He was a key researcher in two international research collaborations, namely the Real World Computing Partnership funded by METI, Japan, and the Digital Image/Video Album project with CNRS, France, and School of Computing, National University of Singapore. He has published more than 50 refereed international journal and conference papers in his research areas including contentbased processing, pattern recognition, and neural networks. Namunu C. Maddage is currently pursuing a PhD in computer science in the School of Computing, National University of Singapore, Singapore. His research interests are in the areas of music modeling, music structure analysis and audio/music data mining. He received a BE in 2000 in the Department of Electrical & Electronic Engineering from Birla Institute of Technology (BIT), Mesra, in India. Ankush Mittal received a BTech and masters (by research) degrees in computer science and engineering from the Indian Institute of Technology, Delhi. He received a PhD from the National University of Singapore (2001). Since October 2003, he has been working as assistant professor at the Indian Institute of Technology - Roorkee. Prior to this, he was serving as a faculty member in the Department of Computer Science, National University of Singapore. His research interests are in multimedia indexing, machine learning, and motion analysis. Baback Moghaddam is a senior research scientist of MERL Research Lab at Mitsubishi Electric Research Labs, Cambridge, MA, USA. His research interests are in computational vision with a focus on probabilistic visual learning, statistical modeling and pattern recognition with application in biometrics and computer-human interface. He obtained his PhD in electrical engineering and computer science (EECS) from the Massachusetts Institute of Technology (MIT) in 1997. Here, he was a member of the Vision and Modeling Group at the MIT Media Laboratory, where he developed a fully-automatic vision system which won DARPAs 1996 FERET Face Recognition Competition. Dr. Moghaddam was the winner of the 2001 Pierre Devijver Prize from the International Association of Pattern Recognition for his innovative approach to face recognition and received the Pattern Recognition Society Award for exceptional outstanding quality for his journal paper Bayesian Face Recognition. He currently serves on the editorial board of the journal titled Pattern Recognition and has contributed to numerous textbooks on image processing and computer vision (including the core chapter in Springer Verlags latest biometric series, Handbook of Face Recognition).
Anne H.H. Ngu is an associate professor with the Department of Computer Science, Texas State University, San Marcos, Texas. Ngu received her PhD in 1990 from the University of Western Australia. She has more than 15 years of experience in research and development in IT with expertise in integrating data and applications on the Web, multimedia databases, Web services, and object-oriented technologies. She has worked in different countries as a researcher, including the Institute of Systems Science in Singapore, Tilburg University, The Netherlands; Telcordia Technologies and MCC in Austin, Texas. Prior to moving to the United States, she worked as a senior lecturer in the School of Computer Science and Engineering, University of New South Wales (UNSW). Currently, she also holds an adjunct associate professor position at UNSW and summer faculty scholar position at Lawrence Livermore National Laboratory, California. Krishnan V. Pagalthivarthi is associate professor in the Department of Applied Mechanics, Indian Institute of Technology Delhi, India. Dr.Krishnan received his BTech from IIT Delhi (1979) and obtained his MSME (1984) and PhD (1988) from Georgia Institute of Technology. He has supervised several students studying for their MTech,MS (R),and PhD degrees and has published numerous research papers in various journals. Silvia Pfeiffer received her masters degree in computer science and business management from the University of Mannheim, Germany (1993). She returned to that university in 1994 to pursue a PhD within the MoCA (Movie Content Analysis) project, exploring novel extraction methods for audio-visual content and novel applications using these. Her thesis of 1999 was about audio content analysis of digital video. Next, she moved to Australia to work as a research scientist in digital media at the CSIRO in Sydney. She has explored several projects involving automated content analysis in the compressed domain, focusing on segmentation applications. She has also actively submitted to MPEG-7. In January 2001, she had initial ideas for a web of continuous media, the specifications of which were worked out within the continuous media web research group that she is heading. Conrad Parker works as a senior software engineer at CSIRO, Australia. He is actively involved in various open source multimedia projects, including development of the Linux and Unix sound editor Sweep. With Dr. Pfeiffer, he developed the mechanisms for streamable metadata encapsulation used in the Annodex format, and is responsible for development of the core software libraries, content creation tools and server modules of the reference implementation. His research focuses on interesting applications of dynamic media generation, and improved TCP congestion control for efficient delivery of media resources. Andr Pang received his Bachelor of Science (Honors) at the University of New South Wales, Sydney, Australia (2003). He has been involved with the Continuous Media Web project since 2001, helping to develop the first specifications and implementations of the Annodex technology and implementing the first Annodex Browser under Mac OS X. Andr is involved in integrating Annodex support into several media frameworks, such as the VideoLAN media player, DirectShow, xine, and QuickTime. In his spare time, he enjoys researching about compilers and programming languages, and also codes on many different open-source projects.
Viranga Ratnaike is a PhD candidate in the School of Computer Science and Software Engineering, Faculty of Information Technology, Monash University, Melbourne, Australia. He holds a Bachelor of Applied Science (Honors) in computer science. After being a programmer for several years, he decided to return to fulltime study, and pursue a career in research. His research interests are in emergence, artificial intelligence and nonverbal knowledge representation. Olga Sayenko received her BS from the University of Illinois at Chicago (2001). She is working toward her MS under the direction of Dr. Cruz with the expected graduation date in July 2004. Karin Schellner studied computer science at the Technical University of Vienna and received her diploma degree in 2000. From 1995 to 2001, she was working at IBM Austria. From 2001 to 2003, she worked at the Department of Computer Science and Business Informatics at the University of Vienna. Since 2003, she has been member of Research Studios Austria Digital Memory Engineering. She has been responsible for the concept, design and implementation of the data model developed in CULTOS. Ansgar Scherp received his diploma degree in computer science at the Carl von Ossietzky University of Oldenburg, Germany (2001) with the diploma thesis process model and development methodology for virtual laboratories. Afterwards, he worked for two years at the University of Oldenburg where he developed methods and tools for virtual laboratories. Since 2003 he has been working as a scientific assistant at the research institute OFFIS on the MM4U (Multimedia for you) project. The aim of this project is the development of a component-based object-oriented software framework that offers extensive support for the dynamic generation of personalized multimedia content. Xi Shao received a BS and MS in computer science from Nanjing University of Posts and Telecommunications, Nanjing, PRChina (1999 and 2002, respectively). He is currently pursuing a PhD in computer science in the School of Computing, National University of Singapore, Singapore. His research interests include content-based audio/music analysis, music information retrieval, and multimedia communications. Chia Shen is associate director and senior research scientist of MERL Research Lab at Mitsubishi Electric Research Labs, Cambridge, MA, USA. Dr. Shens research investigates HCI issue in our understanding of multi-user, computationally augmented interactive surfaces, such as digital tabletops and walls. Her research probes new ways of thinking in terms of UI design, interaction technique development, and entails the reexamination of the conventional metaphor and underlying system infrastructure, which have been traditionally geared towards mice and keyboard-based, single-user desktop computers and devices. Her current research projects include DiamondSpin, UbiTable and PDH (see www.merl.com/projects for details). Jialie Shen received his BSc in applied physics from Shenzhen University, China. He is now a PhD candidate and associate lecturer in the School of Computer Science and
Engineering at the University of New South Wales (Sydney, Australia). His research interests include database systems, indexing, multimedia databases and data mining. Bala Srinivasan is a professor of information technology in the School of Computer Science and Software Engineering at the Faculty of Information Technology, Monash Univeristy, Melbourne, Australia. He was formerly an academic staff member of the Department of Computer Science and Information Systems at the National University of Singapore and the Indian Institute of Technology, Kanpur,India. He has authored and jointly edited six technical books and authored and co-authored more than 150 international refereed publications in journals and conferences in the areas of multimedia databases, data communications, datamining and distributed systems. He is a founding chairman of the Australiasian database conference. He was awarded the Monash Vice-Chancellor medal for post-graduate supervision. He holds a Bachelor of Engineering (Honors) in electronics and communication engineering, and a masters and PhD, both in computer science. Qi Tian received his PhD in electrical and computer engineering from the University of Illinois at Urbana-Champaign (UIUC), Illinois (2002). He received his MS in electrical and computer engineering from Drexel University, Philadelphia, Pennsylvania (1996), and a BE in electronic engineering from Tsinghua University, China (1992). He has been an assistant professor in the Department of Computer Science at the University of Texas at San Antonio (UTSA) since 2002 and an adjunct assistant professor in the Department of Radiation Oncology at the University of Texas Health Science Center at San Antonio (UTHSCSA) since 2003. Before he joined UTSA, he was a research assistant at the Image Formation and Processing (IFP) Group of the Beckman Institute for Advanced Science and Technology and a teaching assistant in the Department of Electrical and Computer Engineering at UIUC (1997- 2002). During the summer of 2000 and 2001, he was an intern researcher with the Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA. In the summer 2003, he was a visiting professor of NEC Laboratories America, Inc., Cupertino, CA, in the Video Media Understanding Group. His current research interests include multimedia, computer vision, machine learning, and image and video processing. He has published about 40 technical papers in these areas and has served on the program committee of several conferences in the area of content-based image retrieval. He is a senior member of IEEE. Svetha Venkatesh is a professor at the School of Computing at the Curtin University of Technology, Perth, Western Australia. Her research is in the areas of large-scale pattern recognition, image understanding and applications of computer vision to image and video indexing and retrieval. She is the author of about 200 research papers in these areas and is currently co-director for the Center of Excellence in Intelligent Operations Management. Utz Westermann is a member of the Department of Computer Science and Business Informatics at the University of Vienna. He received his diploma degree in computer science at the University of Ulm, Germany (1998), and his doctoral degree in technical sciences at the Technical University of Vienna, Austria (2004)4. His research interests
lie in the area of context-aware multimedia information systems. This includes metadata standards for multimedia content, metadata management, XML, XML databases, and multimedia databases. Utz Westermann has participated in several third-party-funded projects in this domain. Campbell Wilson received his masters degree and PhD in computer science from Monash University. His research interests include multimedia retrieval techniques, probabilisitic reasoning, virtual reality interfaces and adaptive user profiling. He is a member of the IEEE. Ying Wu received his PhD in electrical and computer engineering from the University of Illinois at Urbana-Champaign (UIUC), Urbana, Illinois (2001). From 1997 to 2001, he was a research assistant at the Beckman Institute at UIUC. During the 1999 and 2000, he was with Microsoft Research, Redmond, Washington. Since 2001, he has been an assistant professor at the Department of Electrical and Computer Engineering of Northwestern University, Evanston, Illinois. His current research interests include computer vision, machine learning, multimedia, and human-computer interaction. He received the Robert T. Chien Award at UIUC, and is a recipient of the NSF CAREER award. Lonce Wyse heads the Multimedia Modeling Lab at the Institute for Infocomm Research (I2R). He also holds an adjunct position at the National University in Singapore where he teaches a course in sonic arts and sciences. He received his PhD in 1994 in cognitive and neural system from Boston University specializing in vision and hearing systems, and then spent a year as a Fulbright Scholar in Taiwan before joining I2R. His current research focus is applications and techniques for developing sound models. Changsheng Xu received his PhD from Tsinghua University, China (1996). From 1996 to 1998, he was a research associate professor in the National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. He joined the Institute for Infocomm Research (I2R) of Singapore in March 1998. Currently, he is head of the Media Analysis Lab in I2R. His research interests include multimedia content analysis/indexing/ retrieval, digital watermarking, computer vision and pattern recognition. He is a senior member of IEEE. Jie Yu is a PhD candidate of computer science at University of Texas at San Antonio (UTSA). He received his bachelors degree in telecommunication engineering from Dong Hua University, China (2000). He has been a research assistant and teaching assistant in the Department of Computer Science at UTSA since 2002. His current research in image processing is concerned with the study of efficient algorithms in content-based image retrieval. Sonja Zillner studied mathematics at the University Freiburg, Germany, and received her diploma degree in 1999. Since 2000, she has been a member of the scientific staff of the Department of Computer Science and Business Informatics at the University of Vienna. Her research interests lie in the areas of semantic multimedia content modeling and ecommerce. She has participated in the EU project CULTOS (Cultural Units of Learning Tools and Services).
Samar Zutshi received his masters degree in information technology from Monash University, Melbourne, Australia, during which he did some work on agent communication. After a stint in the software industry he is back at Monash doing what he enjoys research and teaching. He is working in the area of relevance feedback in multimedia retrieval for his PhD.
406 Index
Index
A abstraction 188 acoustical music signal 117 ADSR 100 amplitude envelope 100 annodex 161 artificial life 355 attributes ranking method 107 audio model 365 morph 376 production 368 B beat space segmentation 122 blending network 370 theory 372 C censor 93 Cepstrum 102 class relative indexing 40 classical approach 290 clips 173
clustering 137 CMWeb 161 collection DS 186 combining multiple visual features (CMVF) 3 composite image features 6 computational theory 374 constrained generating procedures (CGP) 352 content-based multimedia retrieval 289 music classification 99 music summarization 99 context information extraction 82 continuous media Web 160 D data layer 340 DelaunayView 334 description scheme (DS) 183 digital cameras 33 digital item 164 distance-based access methods 6 domain 235 dynamic authoring 252 dynamic Bayesian network (DBN) 77
Index 407
E emergent semantics 351 EMMO 305 enhanced multimedia metaobjects 305, 311 Enterprise Java Beans (EJBs) 310 F facial images 35 film semiotics 138 functional aspect 306 G genotype 352 granularity 187 graphical user interfaces (GUI) 247 ground truth based method 107 H human visual perception 10 human-computer partnership 223 hybrid dimension reducer 6 hybrid training algorithm 13 I IETF 161 image classification 35 databases 1 distortion 19 feature dimension reduction 4 feature vectors 2 similarity measurement 5 iMovie 226 indexing 1 instrument detection 119 identification 119 integration lLayer 341 Internet 162 Internet engineering task force 161 interpretation models 150 J just-in-time (JIT) 228
K keyword integration approach 292 Kuleshov experiments 78 L LayLab 336 logical media parts 315 low energy component 102 M machine learning approach 106, 292 mathematical-conceptual approach 293 media aspect 306 media blending 363 content 248 semantics 351 medical images 35 metric trees 2 MM4U 246 MPEG-21 163 MPEG-7 165, 182, 335 multimedia authoring 223, 251 composition 262 content 183, 247 descriptions scheme (MDS) 183 research 161 semantics 187, 288, 294, 333 software engineering 272 multimedia-extended entity relationship 137 multimodal operators 149 music genre classification 114 representation 100 structure analysis 106 summarization 104 video 109 video alignment 111 video structure 109 video summarization 109 N narrative theory 227 network training 22
408 Index
neural network Initialization 20 O object recognition 30 ontology objects 315 P pattern classification 30 classifiers 30 discovery indexing (PDI) 36, 41 discovery scheme 41 matching approach 106 recognition 30 personalization engine 253 personalized multimedia content 246 phenotype 352 pitch content features 103 presentation layer 345 mining 367 presentation-oriented modeling 306 probabilistic-statistical approach 291 Q query by example (QBE) 38 R relevance feedback 288 resource description framework (RDF) 335 rhetorical structure theory (RST) 227 rhythm extraction 122 rhythmic content features 103 S search engines 161 engine support 177 segment DS 186 self-organization 355 semantic aspect 306 DS 186 gap 289
indexing 37 region detection 117 region indexing (SRI) 34 Web 306 semantics 30, 136, 333 Shockwave file format 249 Sightseeing4U 277 SMIL 162 song structure 121 spatial access methods (SAMs) 2 relationship 140 relationship operators 148 spectral centroid 101 contrast feature 103 spectrum flux 101 rolloff 101 speech 118 Sports4U 278 synchronized multimedia interaction language 162 system architecture 340 T temporal grouping 137 motion activity 80 ordering 88 relationship 140 relationship operators 148 URI 166 timbral textural features 100 time segments 173 time-to-collision (TTC) 80 U usage information 186 user interaction 187 user interfaces 255 user response 295 user-centric modeling 298 V video
Index 409
content 135 data 77 data model 142 edit 381 editing 367 metamodel framework 136 object (VO) 142 retrieval systems 77 semantics 135 VIMET 135 visual keywords 34 W World Wide Web 160 Z zero crossing rates 102
Instant access to the latest offerings of Idea Group, Inc. in the fields of I NFORMATION SCIENCE , T ECHNOLOGY AND MANAGEMENT!
InfoSci-Online Database
BOOK CHAPTERS JOURNAL AR TICLES C ONFERENCE PROCEEDINGS C ASE STUDIES
The Bottom Line: With easy to use access to solid, current and in-demand information, InfoSci-Online, reasonably priced, is recommended for academic libraries.
- Excerpted with permission from Library Journal, July 2003 Issue, Page 140
The InfoSci-Online database is the most comprehensive collection of full-text literature published by Idea Group, Inc. in: n n n n n n n n n Distance Learning Knowledge Management Global Information Technology Data Mining & Warehousing E-Commerce & E-Government IT Engineering & Modeling Human Side of IT Multimedia Networking IT Virtual Organizations
Start exploring at www.infosci-online.com
BENEFITS n Instant Access n Full-Text n Affordable n Continuously Updated n Advanced Searching Capabilities
Recommend to your Library Today! Complimentary 30-Day Trial Access Available!

A product of:
Information Science Publishing*

Enhancing knowledge through information science
*A company of Idea Group, Inc. www.idea-group.com
New Releases from Idea Group Reference
Idea Group REFERENCE
The Premier Reference Source for Information Science and Technology Research
ENCYCLOPEDIA OF
ENCYCLOPEDIA OF
DATA WAREHOUSING AND MINING

Edited by: John Wang, Montclair State University, USA
Two-Volume Set April 2005 1700 pp ISBN: 1-59140-557-2; US $495.00 h/c Pre-Publication Price: US $425.00* *Pre-pub price is good through one month after the publication date
INFORMATION SCIENCE AND TECHNOLOGY

AVAILABLE NOW!
Provides a comprehensive, critical and descriptive examination of concepts, issues, trends, and challenges in this rapidly expanding field of data warehousing and mining A single source of knowledge and latest discoveries in the field, consisting of more than 350 contributors from 32 countries Offers in-depth coverage of evolutions, theories, methodologies, functionalities, and applications of DWM in such interdisciplinary industries as healthcare informatics, artificial intelligence, financial modeling, and applied statistics Supplies over 1,300 terms and definitions, and more than 3,200 references
Five-Volume Set January 2005 3807 pp ISBN: 1-59140-553-X; US $1125.00 h/c
ENCYCLOPEDIA OF
DATABASE TECHNOLOGIES AND APPLICATIONS
ENCYCLOPEDIA OF
DISTANCE LEARNING
Four-Volume Set April 2005 2500+ pp ISBN: 1-59140-555-6; US $995.00 h/c Pre-Pub Price: US $850.00* *Pre-pub price is good through one month after the publication date
April 2005 650 pp ISBN: 1-59140-560-2; US $275.00 h/c Pre-Publication Price: US $235.00* *Pre-publication price good through one month after publication date
ENCYCLOPEDIA OF
MULTIMEDIA TECHNOLOGY AND NETWORKING
More than 450 international contributors provide extensive coverage of topics such as workforce training, accessing education, digital divide, and the evolution of distance and online education into a multibillion dollar enterprise Offers over 3,000 terms and definitions and more than 6,000 references in the field of distance learning Excellent source of comprehensive knowledge and literature on the topic of distance learning programs Provides the most comprehensive coverage of the issues, concepts, trends, and technologies of distance learning
April 2005 650 pp ISBN: 1-59140-561-0; US $275.00 h/c Pre-Publication Price: US $235.00* *Pre-pub price is good through one month after publication date
www.idea-group-ref.com
Idea Group Reference is pleased to offer complimentary access to the electronic version for the life of edition when your library purchases a print copy of an encyclopedia For a complete catalog of our new & upcoming encyclopedias, please contact: 701 E. Chocolate Ave., Suite 200 Hershey PA 17033, USA 1-866-342-6657 (toll free) cust@idea-group.com
Multimedia Networking: Technology, Management and Applications

Syed Mahbubur Rahman Minnesota State University, Mankato, USA
Today we are witnessing an explosive growth in the use of multiple media forms in varied application areas including entertainment, communication, collaborative work, electronic commerce and university courses. Multimedia Networking: Technology, Management and Applications presents an overview of this expanding technology beginning with application techniques that lead to management and design issues. The goal of this book is to highlight major multimedia networking issues, understanding and solution approaches, and networked multimedia applications design.
ISBN 1-930708-14-9 (h/c); eISBN 1-59140-005-8 US$89.95 498 pages 2002
The increasing computing power, integrated with multimedia and telecommunication technologies, is bringing into reality our dream of real time, virtually face-to-face interaction with collaborators sitting far away from us. Syed Mahbubur Rahman, Minnesota State University, Mankato, USA Its Easy to Order! Order online at www.idea-group.com! Mon-Fri 8:30 am-5:00 pm (est) or fax 24 hours a day 717/533-8661
Idea Group Publishing

Hershey London Melbourne Singapore
An excellent addition to your library
File: Edit2
6/6/2005, 9:04:15AM
Chapter 1 Toward Semantically Meaningful Feature Spaces for Efficient Indexing in Large Image Databases 1 Chapter 2 From Classification to Retrieval: Exploiting Pattern Classifiers in Semantic Image Indexing and Retrieval 30 Chapter 3 Self-Supervised Learning Based on Discriminative Nonlinear Features and Its Applications for Pattern Classification 52 Chapter 4 Context-Based Interpretation and Indexing of Video Data 77 Chapter 5 Content-Based Music Summarization and Classification 99 Chapter 6 A Multidimensional Approach for Describing Video Semantics 135 Chapter 7 Continuous Media Web: Hyperlinking, Search and Retrieval of Time-Continuous Data on the Web 160 Chapter 8 Management of Multimedia Semantics Using MPEG-7 182 Chapter 9 Visualization, Estimation and User Modeling for Interactive Browsing of Personal Photo Libraries 193 Chapter 10 Multimedia Authoring: Human-Computer Partnership for Harvesting Metadata from the Right Sources 223 Chapter 11 MM4U: A Framework for Creating Personalized Multimedia Content 246 Chapter 12 The Role of Relevance Feedback in Managing Multimedia Semantics: A Survey 288 Chapter 13 EMMO: Tradeable Units of Knowledge-Enriched Multimedia Content 305 Chapter 14 Semantically Driven Multimedia Querying and Presentation 333 Chapter 15 Emergent Semantics: An Overview 351 Chapter 16 Emergent Semantics from Media Blending 363 Glossary 391 About the Authors 396 Index 406
Page: 1

Srinivasa U., Nepal S. - Managing Multimedia Semantics (2005)

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Srinivasa U., Nepal S. - Managing Multimedia Semantics (2005)

Hochgeladen von

Copyright:

Verfügbare Formate

Managing Multimedia Semantics

Hershey London Melbourne Singapore

Managing Multimedia Semantics

OUR APPROACH TO ADDRESS THIS CHALLENGE

ORGANISATION OF THIS BOOK

Section 1: Semantic Indexing and Retrieval of Images

Section 2: Audio and Video Semantics: Models and Standards

Section 3: User-Centric Approach to Manage Semantics

Section 4: Managing Distributed Multimedia

Section 5: Emergent Semantics

Section 1 Semantic Indexing and Retrieval of Images

Efficient Indexing in Large Image Databases 1

2 Ngu, Shen & Shepherd

Efficient Indexing in Large Image Databases 3

4 Ngu, Shen & Shepherd

Efficient Indexing in Large Image Databases 5

Image Similarity Measurement

6 Ngu, Shen & Shepherd

Distance-Based Access Methods

HYBRID DIMENSION REDUCER

Composite Image Features

Efficient Indexing in Large Image Databases 7

8 Ngu, Shen & Shepherd

Architecture of Hybrid Image Feature Dimension Reducer

PCA for Dimension Reduction

Mathematically, PCA method can be described as follows. Given a set of N feature

Efficient Indexing in Large Image Databases 9

Since trace (S) =

accounts for the total variance of the original set of feature

vi is the i-th eigenvector of S.

10 Ngu, Shen & Shepherd

Classification Based on Human Visual Perception

Neural Network for Dimension Reduction

Efficient Indexing in Large Image Databases 11

Figure 2. A three-layer multiplayer perceptron layout

12 Ngu, Shen & Shepherd

Efficient Indexing in Large Image Databases 13

The Hybrid Training Algorithm

14 Ngu, Shen & Shepherd

EXPERIMENTS AND DISCUSSIONS

Test Image Collection

Efficient Indexing in Large Image Databases 15

Figure 3. Overall architecture of a content-based image retrieval system based on CMVF

16 Ngu, Shen & Shepherd

(log ranki log i ) N! log( ) ( N R)! R!)

Query Effectiveness of Reduced Dimensional Image Features

(a) Recall rate

(b) Precision rate

Efficient Indexing in Large Image Databases 17

Effects on Query Effectiveness Improvement with Additional Visual Feature Integration

18 Ngu, Shen & Shepherd

Ave. Normalized Recall Rate

0.7 0.8 0.7 0.6 0.5 0.4 0.3 0 1 2 3 4 5

Ave. Normalized Pre. Rate

0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Class ID

(a) Recall rate

(b) Precision rate

Ave. Normalized Recall Rate

Ave. Normalized Pre. Rate

0.8 0.7 0.6 0.5 0.4 0.3 0 1 2 3 4 5

0.6 0.5 0.4 0.3 0.2 0.1

(a) Recall rate